Spark create dataframe python

Содержание

DataFrame¶
Attributes and underlying data¶
Conversion¶
Indexing, iteration¶
Binary operator functions¶
Function application, GroupBy & Window¶
Computations / Descriptive Stats¶
Reindexing / Selection / Label manipulation¶
Missing data handling¶
Reshaping, sorting, transposing¶
Combining / joining / merging¶
Time series-related¶
Serialization / IO / Conversion¶
Spark-related¶
Plotting¶
Pandas-on-Spark specific¶

DataFrame¶

pandas-on-Spark DataFrame that corresponds to pandas DataFrame logically.

Attributes and underlying data¶

The index (row labels) Column of the DataFrame.

The column labels of the DataFrame.

Returns true if the current DataFrame is empty.

Return the dtypes in the DataFrame.

Return a tuple representing the dimensionality of the DataFrame.

Return a list representing the axes of the DataFrame.

Return an int representing the number of array dimensions.

Return an int representing the number of elements in this object.

Return a subset of the DataFrame’s columns based on the column dtypes.

Return a Numpy representation of the DataFrame or the Series.

Conversion¶

Make a copy of this object’s indices and data.

Detects missing values for items in the current Dataframe.

Cast a pandas-on-Spark object to a specified dtype dtype .

Detects missing values for items in the current Dataframe.

Detects non-missing values for items in the current Dataframe.

Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill` .

Return the bool of a single element in the current object.

Indexing, iteration¶

Access a single value for a row/column label pair.

Access a single value for a row/column pair by integer position.

Return index of first occurrence of maximum over requested axis.

Return index of first occurrence of minimum over requested axis.

Access a group of rows and columns by label(s) or a boolean Series.

Purely integer-location based indexing for selection by position.

Iterator over (column name, Series) pairs.

This is an alias of items .

Iterate over DataFrame rows as (index, Series) pairs.

Iterate over DataFrame rows as namedtuples.

Return item and drop from frame.

Return cross-section from the DataFrame.

Get item from object for given key (DataFrame column, Panel slice, etc.).

Replace values where the condition is False.

Replace values where the condition is True.

Query the columns of a DataFrame with a boolean expression.

Binary operator functions¶

Get Addition of dataframe and other, element-wise (binary operator + ).

Get Floating division of dataframe and other, element-wise (binary operator / ).

Get Multiplication of dataframe and other, element-wise (binary operator * ).

Get Subtraction of dataframe and other, element-wise (binary operator — ).

Читайте также: Check for spaces php

Get Exponential power of series of dataframe and other, element-wise (binary operator ** ).

Get Exponential power of dataframe and other, element-wise (binary operator ** ).

Get Modulo of dataframe and other, element-wise (binary operator % ).

Get Integer division of dataframe and other, element-wise (binary operator // ).

Compare if the current value is less than the other.

Compare if the current value is greater than the other.

Compare if the current value is less than or equal to the other.

Compare if the current value is greater than or equal to the other.

Compare if the current value is not equal to the other.

Compare if the current value is equal to the other.

Compute the matrix multiplication between the DataFrame and others.

Update null elements with value in the same location in other .

Function application, GroupBy & Window¶

Apply a function along an axis of the DataFrame.

Apply a function to a Dataframe elementwise.

Apply func(self, *args, **kwargs).

Aggregate using one or more operations over the specified axis.

Group DataFrame or Series using one or more columns.

Provide rolling transformations.

Provide expanding transformations.

Call func on self producing a Series with transformed values and that has the same length as its input.

Computations / Descriptive Stats¶

Return a Series/DataFrame with absolute numeric value of each element.

Return whether all elements are True.

Return whether any element is True.

Trim values at input threshold(s).

Compute pairwise correlation of columns, excluding NA/null values.

Compute pairwise correlation.

Count non-NA cells for each column.

Compute pairwise covariance of columns, excluding NA/null values.

Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Provide exponentially weighted window transformations.

Return unbiased kurtosis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0).

Return the mean absolute deviation of values.

Return the maximum of the values.

Return the mean of the values.

Return the minimum of the values.

Return the median of the values for the requested axis.

Get the mode(s) of each element along the selected axis.

Percentage change between the current and a prior element.

Return the product of the values.

Return value at the given quantile.

Return number of unique elements in the object.

DataFrame.sem ([axis, skipna, ddof, numeric_only])

Return unbiased standard error of the mean over requested axis.

Return unbiased skew normalized by N-1.

Return the sum of the values.

Читайте также: Java config bean name

DataFrame.std ([axis, skipna, ddof, numeric_only])

Return sample standard deviation.

Return cumulative minimum over a DataFrame or Series axis.

Return cumulative maximum over a DataFrame or Series axis.

Return cumulative sum over a DataFrame or Series axis.

Return cumulative product over a DataFrame or Series axis.

Round a DataFrame to a variable number of decimal places.

First discrete difference of element.

Evaluate a string describing operations on DataFrame columns.

Reindexing / Selection / Label manipulation¶

Prefix labels with string prefix .

Suffix labels with string suffix .

Align two objects on their axes with the specified join method.

Select values at particular time of day (example: 9:30AM).

Select values between particular times of the day (example: 9:00-9:30 AM).

DataFrame.drop ([labels, axis, index, columns])

Drop specified labels from columns.

Return DataFrame with requested index / column level(s) removed.

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

Return boolean Series denoting duplicate rows, optionally only considering certain columns.

Compare if the current value is equal to the other.

Subset rows or columns of dataframe according to labels in the specified index.

Select first periods of time series data based on a date offset.

Select final periods of time series data based on a date offset.

Set the name of the axis for the index or columns.

Reset the index, or a level of it.

Set the DataFrame index (row labels) using one or more existing columns.

Interchange axes and swap values axes appropriately.

Swap levels i and j in a MultiIndex on a particular axis.

Return the elements in the given positional indices along an axis.

Whether each element in the DataFrame is contained in values.

Return a random sample of items from an axis of object.

Truncate a Series or DataFrame before and after some index value.

Missing data handling¶

Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill` .

Returns a new DataFrame replacing a value with another value.

Synonym for DataFrame.fillna() or Series.fillna() with method=`bfill` .

Synonym for DataFrame.fillna() or Series.fillna() with method=`ffill` .

Fill NaN values using an interpolation method.

Reshaping, sorting, transposing¶

Create a spreadsheet-style pivot table as a DataFrame.

Return reshaped DataFrame organized by given index / column values.

Sort object by labels (along an axis)

Sort by the values along either axis.

Return the first n rows ordered by columns in descending order.

Return the first n rows ordered by columns in ascending order.

Stack the prescribed level(s) from columns to index.

Pivot the (necessarily hierarchical) index labels.

Unpivot a DataFrame from wide format to long format, optionally leaving identifier variables set.

Transform each element of a list-like to a row, replicating index values.

Squeeze 1 dimensional axis objects into scalars.

Transpose index and columns.

Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index.

Return a DataFrame with matching indices as other object.

DataFrame.rank ([method, ascending, numeric_only])

Compute numerical data ranks (1 through n) along axis.

Combining / joining / merging¶

Append rows of other to the end of caller, returning a new object.

Assign new columns to a DataFrame.

Merge DataFrame objects with a database-style join.

Join columns of another DataFrame.

Modify in place using non-NA values from another DataFrame.

Insert column into DataFrame at specified location.

Shift DataFrame by desired number of periods.

Retrieves the index of the first valid value.

Return index for last non-NA/null value.

Serialization / IO / Conversion¶

Convert structured or recorded ndarray to DataFrame.

Print a concise summary of a DataFrame.

Write the DataFrame into a Spark table.

Write the DataFrame out as a Delta Lake table.

Write the DataFrame out as a Parquet file or directory.

Write the DataFrame out to a Spark data source.

Write object to a comma-separated values (csv) file.

Return a pandas DataFrame.

Render a DataFrame as an HTML table.

A NumPy ndarray representing the values in this DataFrame or Series.

Render a DataFrame to a console-friendly tabular output.

Convert the object to a JSON string.

Convert the DataFrame to a dictionary.

Write object to an Excel sheet.

Copy object to the system clipboard.

Print Series or DataFrame in Markdown-friendly format.

Convert DataFrame to a NumPy record array.

Render an object to a LaTeX tabular environment table.

Property returning a Styler object containing methods for building a styled HTML representation for the DataFrame.

DataFrame.spark provides features that does not exist in pandas but in Spark. These can be accessed by DataFrame.spark. .

Return the current DataFrame as a Spark DataFrame.

Yields and caches the current DataFrame.

Yields and caches the current DataFrame with a specific StorageLevel.

Specifies some hint on the current DataFrame.

Write the DataFrame into a Spark table.

Write the DataFrame out to a Spark data source.

Applies a function that takes and returns a Spark DataFrame.

Returns a new DataFrame partitioned by the given partitioning expressions.

Returns a new DataFrame that has exactly num_partitions partitions.

Plotting¶

DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods of the form DataFrame.plot. .

alias of pyspark.pandas.plot.core.PandasOnSparkPlotAccessor

Make a horizontal bar plot.

Draw one histogram of the DataFrame’s columns.

Make a box plot of the Series columns.

Plot DataFrame/Series as lines.

Create a scatter plot with varying marker point size and color.

Generate Kernel Density Estimate plot using Gaussian kernels.