Skip to content

Pandas

Basic Pandas Imports

Python
1
2
3
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

SERIES

1-dimensional labelled array, axis label referred as INDEX.
Index can contain repetitions.

Python
1
2
3
4
s = Series(data, index=index, name='name')
# DATA: {python dict, ndarray, scalar value}
# NAME: {string}
s = Series(dict)  # Series created from python dict, dict keys become index values

INDEXING / SELECTION / SLICING

Python
1
2
3
4
5
s['index']  # selection by index label
s[condition]  # return slice selected by condition
s[ : ]  # slice endpoint included
s[ : ] = *value  # modify value of entire slice
s[condition] = *value  # modify slice by condition

MISSING DATA

Missing data appears as NaN (Not a Number).

Python
1
2
3
4
pd.isnull(array)  # return a Series index-bool indicating which indexes don't have data
pd.notnull(array)  # return a Series index-bool indicating which indexes have data
array.isnull()
array.notnull()

SERIES ATTRIBUTES

Python
1
2
3
4
s.values  # NumPy representation of Series
s.index  # index object of Series
s.name = "Series name"  # renames Series object
s.index.name = "index name"  # renames index

SERIES METHODS

Python
pd.Series.isin(self, values)  #  boolean Series showing whether elements in Series matches elements in values exactly

# Conform Series to new index, new object produced unless the new index is equivalent to current one and copy=False
pd.Series.reindex(self, index=None, **kwargs)
# INDEX: {array} -- new labels / index
# METHOD: {none (don't fill gaps), pad (fill or carry values forward), backfill (fill or carry values backward)}-- hole filling method
# COPY: {bool} -- return new object even if index is same -- DEFAULT True
# FILLVALUE: {scalar} --value to use for missing values. DEFAULT NaN

pd.Series.drop(self, index=None, **kwargs)  # return Series with specified index labels removed
# INPLACE: {bool} -- if true do operation in place and return None -- DEFAULT False
# ERRORS: {ignore, raise} -- If "ignore", suppress error and existing labels are dropped
# KeyError raised if not all of the labels are found in the selected axis

pd.Series.value_counts(self, normalize=False, sort=True, ascending=False, bins=None, dropna=True)
# NORMALIZE: {bool} -- if True then object returned will contain relative frequencies of unique values
# SORT: {bool} -- sort by frequency -- DEFAULT True
# ASCENDING: {bool} -- sort in ascending order -- DEFAULT False
# BINS: {int} -- group values into half-open bins, only works with numeric data
# DROPNA: {bool} -- don't include counts of NaN

DATAFRAME

2-dimensional labeled data structure with columns of potentially different types. Index and columns can contain repetitions.

Python
df = DataFrame(data, index=row_labels, columns=column_labels)
# DATA: {list, dict (of lists), nested dicts, series, dict of 1D ndarray, 2D ndarray, DataFrame}
# INDEX: {list of row_labels}
# COLUMNS: {list of column_labels}
# outer dict keys interpreted as index labels, inner dict keys interpreted as column labels
# INDEXING / SELECTION / SLICING
df[col]  # column selection
df.at[row, col]  # access a single value for a row/column label pair
df.iat[row, col]  # access a single value for a row/column pair by integer position

df.column_label  # column selection

df.loc[label]  # row selection by label
df.iloc[loc]  # row selection by integer location

df[ : ]  # slice rows
df[bool_vec]  # slice rows by boolean vector
df[condition]  # slice rows by condition

df.loc[:, ["column_1", "column_2"]]  # slice columns by names
df.loc[:, [bool_vector]]  # slice columns by names

df[col] = *value  # modify column contents, if colon is missing it will be created
df[ : ] = *value # modify rows contents
df[condition] = *value  # modify contents

del df[col]  # delete column

DATAFRAME ATTRIBUTES

Python
1
2
3
4
5
6
df.index  # row labels
df.columns  # column labels
df.values  # NumPy representation of DataFrame
df.index.name = "index name"
df.columns.index.name = "columns name"
df.T # transpose

DATAFRAME METHODS

Python
pd.DataFrame.isin(self , values)  # boolean DataFrame showing whether elements in DataFrame matches elements in values exactly

# Conform DataFrame to new index, new object produced unless the new index is equivalent to current one and copy=False
pd.DataFrame.reindex(self, index=None, columns=None, **kwargs)
# INDEX: {array} -- new labels / index
# COLUMNS: {array} -- new labels / columns
# METHOD: {none (don't fill gaps), pad (fill or carry values forward), backfill (fill or carry values backward)}-- hole filling method
# COPY: {bool} -- return new object even if index is same -- DEFAULT True
# FILLVALUE: {scalar} --value to use for missing values. DEFAULT NaN

pd.DataFrame.drop(self, index=None, columns=None, **kwargs)  # Remove rows or columns by specifying label names
# INPLACE: {bool} -- if true do operation in place and return None -- DEFAULT False
# ERRORS: {ignore, raise} -- If "ignore", suppress error and existing labels are dropped
# KeyError raised if not all of the labels are found in the selected axis

INDEX OBJECTS

Holds axis labels and metadata, immutable.

INDEX TYPES

Python
1
2
3
4
5
pd.Index  # immutable ordered ndarray, sliceable. stores axis labels
pd.Int64Index  # special case of Index with purely integer labels
pd.MultiIndex  # multi-level (hierarchical) index object for pandas objects
pd.PeriodINdex  # immutable ndarray holding ordinal values indicating regular periods in time
pd.DatetimeIndex  # nanosecond timestamps (uses Numpy datetime64)

INDEX ATTRIBUTERS

Python
1
2
3
4
pd.Index.is_monotonic_increasing  # Return True if the index is monotonic increasing (only equal or increasing) values
pd.Index.is_monotonic_decreasing  # Return True if the index is monotonic decreasing (only equal or decreasing) values
pd.Index.is_unique  # Return True if the index has unique values.
pd.Index.hasnans  # Return True if the index has NaNs

INDEX METHODS

Python
pd.Index.append(self, other)  # append a collection of Index options together

pd.Index.difference(self, other, sort=None)  # set difference of two Index objects
# SORT: {None (attempt sorting), False (don't sort)}

pd.Index.intersection(self, other, sort=None)  # set intersection of two Index objects
# SORT: {None (attempt sorting), False (don't sort)}

pd.Index.union(self, other, sort=None)  # set union of two Index objects
# SORT: {None (attempt sorting), False (don't sort)}

pd.Index.isin(self, values, level=None)  # boolean array indicating where the index values are in values
pd.Index.insert(self, loc, item)  # make new Index inserting new item at location
pd.Index.delete(self, loc)  # make new Index with passed location(-s) deleted

pd.Index.drop(self, labels, errors='raise')  # Make new Index with passed list of labels deleted
# ERRORS: {ignore, raise} -- If 'ignore', suppress error and existing labels are dropped
# KeyError raised if not all of the labels are found in the selected axis

pd.Index.reindex(self, target, **kwargs)  # create index with target's values (move/add/delete values as necessary)
# METHOD: {none (don't fill gaps), pad (fill or carry values forward), backfill (fill or carry values backward)}-- hole filling method

ARITHMETIC OPERATIONS

NumPy arrays operations preserve labels-value link.
Arithmetic operations automatically align differently indexed data.
Missing values propagate in arithmetic computations (NaN <operator> value = NaN)

ADDITION

Python
1
2
3
4
5
6
7
8
self + other
pd.Series.add(self, other, fill_value=None)  # add(), supports substitution of NaNs
pd,Series.radd(self, other, fill_value=None)  # radd(), supports substitution of NaNs
pd.DataFrame.add(self, other, axis=columns, fill_value=None)  # add(), supports substitution of NaNs
pd.DataFrame.radd(self, other, axis=columns, fill_value=None)  # radd(), supports substitution of NaNs
# OTHER: {scalar, sequence, Series, DataFrame}
# AXIS: {0, 1, index, columns} -- whether to compare by the index or columns
# FILLVALUE: {None, float} -- fill missing value

SUBTRACTION

Python
1
2
3
4
5
6
7
8
self - other
pd.Series.sub(self, other, fill_value=None)  # sub(), supports substitution of NaNs
pd.Series.radd(self, other, fill_value=None)  # radd(), supports substitution of NaNs
ps.DataFrame.sub(self, other, axis=columns, fill_value=None)  # sub(), supports substitution of NaNs
pd.DataFrame.rsub(self, other, axis=columns, fill_value=None)  # rsub(), supports substitution of NaNs
# OTHER: {scalar, sequence, Series, DataFrame}
# AXIS: {0, 1, index, columns} -- whether to compare by the index or columns
# FILLVALUE: {None, float} -- fill missing value

MULTIPLICATION

Python
1
2
3
4
5
6
7
8
self * other
pd.Series.mul(self, other, fill_value=None)  # mul(), supports substitution of NaNs
pd.Series.rmul(self, other, fill_value=None)  # rmul(), supports substitution of NaNs
ps.DataFrame.mul(self, other, axis=columns, fill_value=None)  # mul(), supports substitution of NaNs
pd.DataFrame.rmul(self, other, axis=columns, fill_value=None)  # rmul(), supports substitution of NaNs
# OTHER: {scalar, sequence, Series, DataFrame}
# AXIS: {0, 1, index, columns} -- whether to compare by the index or columns
# FILLVALUE: {None, float} -- fill missing value

DIVISION (float division)

Python
self / other
pd.Series.div(self, other, fill_value=None)  # div(), supports substitution of NaNs
pd.Series.rdiv(self, other, fill_value=None)  # rdiv(), supports substitution of NaNs
pd.Series.truediv(self, other, fill_value=None)  # truediv(), supports substitution of NaNs
pd.Series.rtruediv(self, other, fill_value=None)  # rtruediv(), supports substitution of NaNs
ps.DataFrame.div(self, other, axis=columns, fill_value=None)  # div(), supports substitution of NaNs
pd.DataFrame.rdiv(self, other, axis=columns, fill_value=None)  # rdiv(), supports substitution of NaNs
ps.DataFrame.truediv(self, other, axis=columns, fill_value=None)  # truediv(), supports substitution of NaNs
pd.DataFrame.rtruediv(self, other, axis=columns, fill_value=None)  # rtruediv(), supports substitution of NaNs
# OTHER: {scalar, sequence, Series, DataFrame}
# AXIS: {0, 1, index, columns} -- whether to compare by the index or columns
# FILLVALUE: {None, float} -- fill missing value

FLOOR DIVISION

Python
1
2
3
4
5
6
7
8
self // other
pd.Series.floordiv(self, other, fill_value=None)  # floordiv(), supports substitution of NaNs
pd.Series.rfloordiv(self, other, fill_value=None)  # rfloordiv(), supports substitution of NaNs
ps.DataFrame.floordiv(self, other, axis=columns, fill_value=None)  # floordiv(), supports substitution of NaNs
pd.DataFrame.rfloordiv(self, other, axis=columns, fill_value=None)  # rfloordiv(), supports substitution of NaNs
# OTHER: {scalar, sequence, Series, DataFrame}
# AXIS: {0, 1, index, columns} -- whether to compare by the index or columns
# FILLVALUE: {None, float} -- fill missing value

MODULO

Python
1
2
3
4
5
6
7
8
self % other
pd.Series.mod(self, other, fill_value=None)  # mod(), supports substitution of NaNs
pd.Series.rmod(self, other, fill_value=None)  # rmod(), supports substitution of NaNs
ps.DataFrame.mod(self, other, axis=columns, fill_value=None)  # mod(), supports substitution of NaNs
pd.DataFrame.rmod(self, other, axis=columns, fill_value=None)  # rmod(), supports substitution of NaNs
# OTHER: {scalar, sequence, Series, DataFrame}
# AXIS: {0, 1, index, columns} -- whether to compare by the index or columns
# FILLVALUE: {None, float} -- fill missing value

POWER

Python
1
2
3
4
5
6
7
8
other ** self
pd.Series.pow(self, other, fill_value=None)  # pow(), supports substitution of NaNs
pd.Series.rpow(self, other, fill_value=None)  # rpow(), supports substitution of NaNs
ps.DataFrame.pow(self, other, axis=columns, fill_value=None)  # pow(), supports substitution of NaNs
pd.DataFrame.rpow(self, other, axis=columns, fill_value=None)  # rpow(), supports substitution of NaNs
# OTHER: {scalar, sequence, Series, DataFrame}
# AXIS: {0, 1, index, columns} -- whether to compare by the index or columns
# FILLVALUE: {None, float} -- fill missing value

ESSENTIAL FUNCTIONALITY

FUNCTION APPLICATION AND MAPPING

NumPy ufuncs work fine with pandas objects.

Python
pd.DataFrame.applymap(self, func)  # apply function element-wise

pd.DataFrame.apply(self, func, axis=0, args=())  # apply a function along an axis of a DataFrame
# FUNC: {function} -- function to apply
# AXIS: {O, 1, index, columns} -- axis along which the function is applied
# ARGS: {tuple} -- positional arguments to pass to func in addition to the array/series
# SORTING AND RANKING
pd.Series.sort_index(self, ascending=True **kwargs)  # sort Series by index labels
pd.Series.sort_values(self, ascending=True, **kwargs)  # sort series by the values
# ASCENDING: {bool} -- if True, sort values in ascending order, otherwise descending -- DEFAULT True
# INPALCE: {bool} -- if True, perform operation in-place
# KIND: {quicksort, mergesort, heapsort} -- sorting algorithm
# NA_POSITION {first, last} -- 'first' puts NaNs at the beginning, 'last' puts NaNs at the end

pd.DataFrame.sort_index(self, axis=0, ascending=True, **kwargs)  # sort object by labels along an axis
pd.DataFrame.sort_values(self, axis=0, ascending=True, **kwargs)  # sort object by values along an axis
# AXIS: {0, 1, index, columns} -- the axis along which to sort
# ASCENDING: {bool} -- if True, sort values in ascending order, otherwise descending -- DEFAULT True
# INPALCE: {bool} -- if True, perform operation in-place
# KIND: {quicksort, mergesort, heapsort} -- sorting algorithm
# NA_POSITION {first, last} -- 'first' puts NaNs at the beginning, 'last' puts NaNs at the end

DESCRIPTIVE AND SUMMARY STATISTICS

COUNT

Python
1
2
3
pd.Series.count(self)  # return number of non-NA/null observations in the Series
pd.DataFrame.count(self, numeric_only=False)  # count non-NA cells for each column or row
# NUMERIC_ONLY: {bool} -- Include only float, int or boolean data -- DEFAULT False

DESCRIBE

Generate descriptive statistics summarizing central tendency, dispersion and shape of dataset's distribution (exclude NaN).

Python
1
2
3
4
5
pd.Series.describe(self, percentiles=None, include=None, exclude=None)
pd.DataFrame.describe(self, percentiles=None, include=None, exclude=None)
# PERCENTILES: {list-like of numbers} -- percentiles to include in output,between 0 and 1 -- DEFAULT [.25, .5, .75]
# INCLUDE: {all, None, list of dtypes} -- white list of dtypes to include in the result. ignored for Series
# EXCLUDE: {None, list of dtypes} -- black list of dtypes to omit from the result. ignored for Series

MAX - MIN

Python
1
2
3
4
5
6
pd.Series.max(self, skipna=None, numeric_only=None)  # maximum of the values for the requested axis
pd.Series.min(self, skipna=None, numeric_only=None)  # minimum of the values for the requested axis
pd.DataFrame.max(self, axis=None, skipna=None, numeric_only=None)  # maximum of the values for the requested axis
pd.DataFrame.min(self, axis=None, skipna=None, numeric_only=None)  # minimum of the values for the requested axis
# SKIPNA: {bool} -- exclude NA/null values when computing the result
# NUMERIC_ONLY: {bool} -- include only float, int, boolean columns, not implemented for Series

IDXMAX - IDXMIN

Python
1
2
3
4
5
6
pd.Series.idxmax(self, skipna=True)  # row label of the maximum value
pd.Series.idxmin(self, skipna=True)  # row label of the minimum value
pd.DataFrame.idxmax(self, axis=0, skipna=True)  #  Return index of first occurrence of maximum over requested axis
pd.DataFrame.idxmin(self, axis=0, skipna=True)  #  Return index of first occurrence of minimum over requested axis
# AXIS:{0, 1, index, columns} -- row-wise or column-wise
# SKIPNA: {bool} -- exclude NA/null values. ff an entire row/column is NA, result will be NA

QUANTILE

Python
1
2
3
4
5
pd.Series.quantile(self, q=0.5, interpolation='linear')  # return values at the given quantile
pd.DataFrame.quantile(self, q=0.5, axis=0, numeric_only=True, interpolation='linear') # return values at the given quantile over requested axis
# Q: {flaot, array} -- value between 0 <= q <= 1, the quantile(s) to compute -- DEFAULT 0.5 (50%)
# NUMERIC_ONLY: {bool} -- if False, quantile of datetime and timedelta data will be computed as well
# INTERPOLATION: {linear, lower, higher, midpoint, nearest} -- SEE DOCS

SUM

Python
1
2
3
4
5
6
pd.Series.sum(self, skipna=None, numeric_only=None, min_count=0)  # sum of the values
pd.DataFrame.sum(self, axis=None, skipna=None, numeric_only=None, min_count=0)  # sum of the values for the requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values when computing the result
# NUMERIC_ONLY: {bool} -- include only float, int, boolean columns, not implemented for Series
# MIN_COUNT: {int} -- required number of valid values to perform the operation. if fewer than min_count non-NA values are present the result will be NA

MEAN

Python
1
2
3
4
5
pd.Series.mean(self, skipna=None, numeric_only=None)   # mean of the values
pd.DataFrame.mean(self, axis=None, skipna=None, numeric_only=None)  # mean of the values for the requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values when computing the result
# NUMERIC_ONLY: {bool} -- include only float, int, boolean columns, not implemented for Series

MEDIAN

Python
1
2
3
4
5
pd.Series.median(self, skipna=None, numeric_only=None)  # median of the values
pd.DataFrame.median(self, axis=None, skipna=None, numeric_only=None)  # median of the values for the requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values when computing the result
# NUMERIC_ONLY: {bool} -- include only float, int, boolean columns, not implemented for Series

MAD (mean absolute deviation)

Python
1
2
3
4
pd.Series.mad(self, skipna=None)  # mean absolute deviation
pd.DataFrame.mad(self, axis=None, skipna=None)  # mean absolute deviation of the values for the requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values when computing the result

VAR (variance)

Python
1
2
3
4
5
6
pd.Series.var(self, skipna=None, numeric_only=None)  # unbiased variance
pd.DataFrame.var(self, axis=None, skipna=None, ddof=1, numeric_only=None)  #  unbiased variance over requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values. if an entire row/column is NA, the result will be NA
# DDOF: {int} -- Delta Degrees of Freedom. divisor used in calculations is N - ddof (N represents the number of elements) -- DEFAULT 1
# NUMERIC_ONLY: {bool} -- include only float, int, boolean columns, not implemented for Series

STD (standard deviation)

Python
1
2
3
4
5
6
pd.Series.std(self, skipna=None, ddof=1, numeric_only=None)  # sample standard deviation
pd.Dataframe.std(self, axis=None, skipna=None, ddof=1, numeric_only=None)  # sample standard deviation over requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values. if an entire row/column is NA, the result will be NA
# DDOF: {int} -- Delta Degrees of Freedom. divisor used in calculations is N - ddof (N represents the number of elements) -- DEFAULT 1
# NUMERIC_ONLY: {bool} -- include only float, int, boolean columns, not implemented for Series

SKEW

Python
1
2
3
4
5
pd.Series.skew(self, skipna=None, numeric_only=None)  # unbiased skew Normalized bt N-1
pd.DataFrame.skew(self, axis=None, skipna=None, numeric_only=None)  #  unbiased skew over requested axis Normalized by N-1
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values when computing the result
# NUMERIC_ONLY: {bool} -- include only float, int, boolean columns, not implemented for Series

KURT

Unbiased kurtosis over requested axis using Fisher's definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Python
1
2
3
4
5
pd.Series.kurt(self, skipna=None, numeric_only=None)
pd.Dataframe.kurt(self, axis=None, skipna=None, numeric_only=None)
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values when computing the result
# NUMERIC_ONLY: {bool} -- include only float, int, boolean columns, not implemented for Series

CUMSUM (cumulative sum)

Python
1
2
3
4
pd.Series.cumsum(self, skipna=True)  # cumulative sum
pd.Dataframe.cumsum(self, axis=None, skipna=True)  # cumulative sum over requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values. if an entire row/column is NA, the result will be NA

CUMMAX - CUMMIN (cumulative maximum - minimum)

Python
1
2
3
4
5
6
pd.Series.cummax(self, skipna=True)  # cumulative maximum
pd.Series.cummin(self, skipna=True)  # cumulative minimum
pd.Dataframe.cummax(self, axis=None, skipna=True)  # cumulative maximum over requested axis
pd.Dataframe.cummin(self, axis=None, skipna=True)  # cumulative minimum over requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values. if an entire row/column is NA, the result will be NA

CUMPROD (cumulative product)

Python
1
2
3
4
pd.Series.cumprod(self, skipna=True)  # cumulative product
pd.Dataframe.cumprod(self, axis=None, skipna=True)  # cumulative product over requested axis
# AXIS: {0, 1, index, columns} -- axis for the function to be applied on
# SKIPNA: {bool} -- exclude NA/null values. if an entire row/column is NA, the result will be NA

DIFF

Calculates the difference of a DataFrame element compared with another element in the DataFrame.
(default is the element in the same column of the previous row)

Python
1
2
3
4
pd.Series.diff(self, periods=1)
pd.DataFrame.diff(self, periods=1, axis=0)
# PERIODS: {int} -- Periods to shift for calculating difference, accepts negative values -- DEFAULT 1
# AXIS: {0, 1, index, columns} -- Take difference over rows or columns

PCT_CHANGE

Percentage change between the current and a prior element.

Python
1
2
3
4
5
pd.Series.Pct_change(self, periods=1, fill_method='pad', limit=None, freq=None)
pd.Dataframe.pct_change(self, periods=1, fill_method='pad', limit=None)
# PERIODS:{int} -- periods to shift for forming percent change
# FILL_METHOD: {str, pda} -- How to handle NAs before computing percent changes -- DEFAULT pad
# LIMIT: {int} -- number of consecutive NAs to fill before stopping -- DEFAULT None

HANDLING MISSING DATA

FILTERING OUT MISSING DATA

Python
1
2
3
4
5
6
7
pd.Series.dropna(self, inplace=False)  # return a new Series with missing values removed
pd.DataFrame.dropna(axis=0, how='any', tresh=None, subset=None, inplace=False)  # return a new DataFrame with missing values removed
# AXIS: {tuple, list} -- tuple or list to drop on multiple axes. only a single axis is allowed
# HOW: {any, all} -- determine if row or column is removed from DataFrame (ANY = if any NA present, ALL = if all values are NA). DEFAULT any
# TRESH: {int} -- require that many non-NA values
# SUBSET: {array} -- labels along other axis to consider
# INPLACE: {bool} -- if True, do operation inplace and return None -- DEFAULT False

FILLING IN MISSING DATA

Fill NA/NaN values using the specified method.

Python
1
2
3
4
5
6
7
pd.Series.fillna(self, value=None, method=None, inplace=False, limit=None)
pd.DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None)
# VALUE: {scalar, dict, Series, DataFrame} -- value to use to fill holes, dict/Series/DataFrame specifying which value to use for each index or column
# METHOD: {backfill, pad, None} -- method to use for filling holes -- DEFAULT None
# AXIS: {0, 1, index, columns} -- axis along which to fill missing values
# INPLACE: {bool} -- if true fill in-place (will modify views of object) -- DEFAULT False
# LIMIT: {int} --  maximum number of consecutive NaN values to forward/backward fill -- DEFAULT None

HIERARCHICAL INDEXING (MultiIndex)

Enables storing and manipulation of data with an arbitrary number of dimensions.
In lower dimensional data structures like Series (1d) and DataFrame (2d).

MULTIIINDEX CREATION

Python
1
2
3
4
5
6
pd.MultiIndex.from_arrays(*arrays, names=None)  # convert arrays to MultiIndex
pd.MultiIndex.from_tuples(*arrays, names=None)  # convert tuples to MultiIndex
pd.MultiIndex.from_frame(df, names=None)  # convert DataFrame to MultiIndex
pd.MultiIndex.from_product(*iterables, names=None)     #  MultiIndex from cartesian product of iterables
pd.Series(*arrays)  # Index constructor makes MultiIndex from Series
pd.DataFrame(*arrays)  # Index constructor makes MultiINdex from DataFrame

MULTIINDEX LEVELS

Vector of label values for requested level, equal to the length of the index.

Python
pd.MultiIndex.get_level_values(self, level)

PARTIAL AND CROSS-SECTION SELECTION

Partial selection "drops" levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame.

Python
1
2
3
4
5
6
pd.Series.xs(self, key, axis=0, level=None, drop_level=True)  # cross-section from Series
pd.DataFrame.xs(self, key, axis=0, level=None, drop_level=True)  #  cross-section from DataFrame
# KEY: {label, tuple of label} -- label contained in the index, or partially in a MultiIndex
# AXIS: {0, 1, index, columns} -- axis to retrieve cross-section on -- DEFAULT 0
# LEVEL: -- in case of key partially contained in MultiIndex, indicate which levels are used. Levels referred by label or position
# DROP_LEVEL: {bool} -- If False, returns object with same levels as self -- DEFAULT True

INDEXING, SLICING

Multi index keys take the form of tuples.

Python
1
2
3
4
5
df.loc[('lvl_1', 'lvl_2', ...)]  # selection of single row
df.loc[('idx_lvl_1', 'idx_lvl_2', ...), ('col_lvl_1', 'col_lvl_2', ...)]  # selection of single value

df.loc['idx_lvl_1':'idx_lvl_1']  # slice of rows (aka partial selection)
df.loc[('idx_lvl_1', 'idx_lvl_2') : ('idx_lvl_1', 'idx_lvl_2')]  # slice of rows with levels

REORDERING AND SORTING LEVELS

Python
1
2
3
4
5
6
7
8
pd.MultiIndex.swaplevel(self, i=-2, j=-1)  # swap level i with level j
pd.Series.swaplevel(self, i=-2, j=-1) # swap levels i and j in a MultiIndex
pd.DataFrame.swaplevel(self, i=-2, j=-1, axis=0) # swap levels i and j in a MultiIndex on a partivular axis

pd.MultiIndex.sortlevel(self, level=0, ascending=True, sort_remaining=True)  # sort MultiIndex at requested level
# LEVEL: {str, int, list-like} -- DEFAULT 0
# ASCENDING: {bool} -- if True, sort values in ascending order, otherwise descending -- DEFAULT True
# SORT_REMAINING: {bool} -- sort by the remaining levels after level

DATA LOADING, STORAGE FILE FORMATS

Python
pd.read_fwf(filepath, colspecs='infer', widths=None, infer_nrows=100)  # read a table of fixed-width formatted lines into DataFrame
# FILEPATH: {str, path object} -- any valid string path is acceptable, could be a URL. Valid URLs: http, ftp, s3, and file
# COLSPECS: {list of tuple (int, int), 'infer'} -- list of tuples giving extents of fixed-width fields of each line as half-open intervals { [from, to) }
# WIDTHS: {list of int} -- list of field widths which can be used instead of "colspecs" if intervals are contiguous
# INFER_ROWS: {int} -- number of rows to consider when letting parser determine colspecs -- DEFAULT 100

pd.read_excel()  # read an Excel file into a pandas DataFrame
pd.read_json()  # convert a JSON string to pandas object
pd.read_html()  # read HTML tables into a list of DataFrame objects
pd.read_sql()  # read SQL query or database table into a DataFrame

pd.read_csv(filepath, sep=',', *args, **kwargs )  # read a comma-separated values (csv) file into DataFrame
pd.read_table(filepath, sep='\t', *args, **kwargs)  # read general delimited file into DataFrame
# FILEPATH: {str, path object} -- any valid string path is acceptable, could be a URL. Valid URLs: http, ftp, s3, and file
# SEP: {str} -- delimiter to use -- DEFAULT \t (tab)
# HEADER {int, list of int, 'infer'} -- row numbers to use as column names, and the start of the data -- DEFAULT 'infer'
# NAMES:{array} -- list of column names to use -- DEFAULT None
# INDEX_COL: {int, str, False, sequnce of int/str, None} -- Columns to use as row labels of DataFrame, given as string name or column index -- DEFAULT None
# SKIPROWS: {list-like, int, callable} -- Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file
# NA_VALUES: {scalar, str, list-like, dict} -- additional strings to recognize as NA/NaN. if dict passed, specific per-column NA values
# THOUSANDS: {str} -- thousand separator
# *ARGS, **KWARGS -- SEE DOCS

# write object to a comma-separated values (csv) file
pd.DataFrame.to_csv(self, path_or_buf, sep=',', na_rep='', columns=None, header=True, index=True, encoding='utf-8', line_terminator=None, decimal='.', *args, **kwargs)
# SEP: {str len 1} --  Field delimiter for the output file
# NA_REP: {str} -- missing data representation
# COLUMNS: {sequence} -- colums to write
# HEADER: {bool, list of str} -- write out column names. if list of strings is given its assumed to be aliases for column names
# INDEX: {bool, list of str} -- write out row names (index)
# ENCODING: {str} -- string representing encoding to use -- DEFAULT "utf-8"
# LINE_TERMINATOR: {str} -- newline character or character sequence to use in the output file -- DEFAULT os.linesep
# DECIMAL: {str} -- character recognized as decimal separator (in EU ,)

pd.DataFrame.to_excel()
pd.DataFrame.to_json()
pd.DataFrame.to_html()
pd.DataFrame.to_sql()