Pandas cookbook

Pandas code snippets and recipes that I revisit now and again.
exploratory-data-analysis
machine-learning
resources
Published

June 13, 2021

Selected Pandas code snippets and recipes that I revisit now and again. The snippets are adopted from different python scripts written over time, ignore the variable names.

This post was inspired by the wonderful work of Chris Albon and his snippets of code blocks.

Last update: 13th June 2021

Reading and Writing

Basic reading a blank csv file

# If you have no column names but know the number of columns
pd.read_csv(file_name, header=None, names=['col1','col2'])

Saving a file to not have ‘Unamed’ column

df1.to_csv(os.path.join(output_dir, 'file_name_to_save_as.csv'), sep=',',columns=df1.columns, index=False, header=False) # header = None for no column names

Quickly generate pandas dataframe from n lists

smiles = ["C", "CCC"]
labels = [1.5, 2.3]
df = pd.DataFrame( list( zip(smiles, labels) ), columns=["smiles", "task"] )

Information about the dataframe

pandas_dataframe.info()

Summary statistics (mean, quartile ranges)

pandas_dataframe.describe().round(2)

Replace

df = df.replace( [list_of_value_to_replace], value_to_replace_with)
# Eg: df.replace( [98-99], np.nan)

Replace characters in the columns

# List of characters to remove
chars_to_remove = ['+','$',',']

# List of column names to clean
cols_to_clean = ['Installs','Price']

# Loop for each column in cols_to_clean
for col in cols_to_clean:
    # Loop for each char in chars_to_remove
    for char in chars_to_remove:
        # Replace the character with an empty string
        apps[col] = apps[col].apply(lambda x: x.replace(char, ''))
    # Convert col to float data type
    apps[col] = apps[col].astype(float)

Convert spaces titles in the row to one word separated by ‘-’

reduced_df['product_title'] = reduced_df['product_title'].apply( lambda x: x.lower().replace(' ', '-') )

Define a new column with temp entries

pandas_dataframe['columns_name'] = 42 

Create columns in a loop

pandas_dataframe.columns = ['feature_' + str(i) for i in range(n_columns)]

Dropping miscellaneous columns and NaN entries

columns_to_drop = ['CookTimeInMins', 'Servings', 'Course', 'Diet', 'Instructions', 'TranslatedInstructions', 'URL']
food_df = food_df.drop(columns = columns_to_drop).dropna()

Post process

Highlight cells based on a condition

df = pd.DataFrame({
    "col1":[-5,-2,1,4],
    "col2":[2,3,-1,4]
})

def highlight_number(row):
    return[
    "background-color: red; color:white"
    if cell <= 0
    else "background-color: green; color:white"
    for cell in row
    ]

df.style.apply(highlight_number)

Quick plotting

Simple pearson correlation plot

# Generate Pearson Correlation Matrix for HOUSING 
corr_matrix=housing.corr()

# Edit the visuals and precision 
corr_matrix.style.background_gradient(cmap='coolwarm').set_precision(2)

# Look at Pearson values for one attribute 
corr_matrix['median_house_value'].sort_values(ascending=True)

Plot multiple scatter plots

from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_axes = scatter_matrix(housing[attributes], figsize=(12, 8));

Handling missing values

Option A: Dropping values in the columns with NaN

housing.dropna(subset=["total_bedrooms"])

Option B: Drop that column entirely

housing.drop("total_bedrooms", axis=1)

Option C: Fill missing value with some central tendency

attribute_median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna( attribute_median, inplace=True ) 

Checking the NULL enties in the dataset

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()

Get number of NULL entries in the dataframe columns

null_columns=food_df.columns[food_df.isnull().any()]
food_df[null_columns].isnull().sum()

Print full rows having NULL entries in the df

is_NaN = food_df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = food_df[row_has_NaN]

Dropping NULL only from a particular column

df_income_drop_na = df.dropna(subset=['INCOME2'])

Join two datasets

1. Inner join

Only returns rows with matching values in both df.

A_B = A.merge(B, on = <common column name>, suffixes = tuples to append the name of columns with similar names) 

Remember that .merge() only returns rows where the values match in both tables.

2. Merging more than one table

df1.merge(df2, on='col_A') \
    .merge(df3, on='col_B') \
    .merge(df4, on='col_C')

3. Merge across multiple columns tags

df1.merge( df2, on = ['col_A', 'col_B'])

Searching

Find columns names based on a string

df_raw_data.columns[df_raw_data.columns.str.contains('STRING_SUBSET')]

Find rows in column Model based on a string

Select all row entries that start with Mac

df[df['model'].str.match('Mac')]

Select all row entries that contain ac in it

df[df['model'].str.contains('ac')]

Filter entries in the column based on the threshold * Data has indian-inspired international cuisines which are not what we are interested in

cuisin_counts = food_df['Cuisine'].value_counts()
cuisin_counts_more_than_50 = cuisin_counts.iloc[np.where(cuisin_counts > 50)]
food_df_top_cuisine = food_df.loc[ food_df['Cuisine'].isin(list(cuisin_counts_more_than_50.index))  ] 
#Dropping entries in `food_df` which have non-ind

Clean up entries with partial matches

df.loc[df['Store Name'].str.contains('Wal', case=False), 'Store_Group_1'] = 'Walmarts'
south_indian_tag = ['Chettinad', 'Andhra', 'Karnataka', 'Tamil Nadu', 'Kerala Recipes', 'South Indian Recipes']
food_df_top_cuisine.loc[food_df_top_cuisine['Cuisine'].isin(south_indian_tag), 'Combined_cuisine'] = 'South Indian'

With or statements

String_filter_option = ['cond_1', 'cond_2']
pandas_dataframe[ Pandas_dataframe[ 'columns' ].str.contains('|'.join(string_filter_option)) ] 

Filter rows in the pandas df with another list

month_list = ['May','Jun','Jul','Aug','Sep']
df_pH.loc[df_pH['Month'].isin(month_list)]

Filter out values using names: Making a separate list of those that DO NOT satisfy the constraint

no_bands = halftime_musicians[ ~halftime_musicians.musician.str.contains('Marching') ]

Statistics & Distributions

Histogram

df.hist('WTKG3')

CDF and PDF

# Functions for PMF and CDF, we will come to those later in the notebook 
def pmf(pandas_series):
    values, counts = np.unique(pandas_series, return_counts = True)
    pmf = np.c_[ values, counts / sum(counts) ]

    return pmf 

def cdf(pandas_series):
    values, counts = np.unique(pandas_series, return_counts = True)
    pmf = np.c_[ values, counts / sum(counts) ]
    cdf = np.zeros(shape=pmf.shape) 
    
    for i in range(0, pmf.shape[0]):
        cdf[i] = [pmf[i][0], np.sum(pmf[:i+1], axis=0)[-1]] 
        
    return cdf

Confidence interval

A bootstrap analysis of the reduction of deaths due to handwashing

boot_mean_diff = []
for i in range(3000):
    boot_before = before_proportion.sample(frac=1, replace=True)
    boot_after = after_proportion.sample(frac=1, replace=True)
    boot_mean_diff.append( boot_after.mean() - boot_before.mean() )

Calculating a 95% confidence interval from boot_mean_diff

confidence_interval = pd.Series(boot_mean_diff).quantile([0.025,0.975])

Convert variables

Convert continuous variable to discrete

pd.cut 

Example 1:

housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf], #bins around 2-5 income bracket
                               labels=[1, 2, 3, 4, 5])

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

Example 2:

pd.cut(iris_df['sepal_length'], bins=3, right=True, labels=['low','med','high'], retbins=True)

Fine tune the labeling

def convert_to_cat(panda_series):
    first_quarter = panda_series.describe()['25%']
    third_quarter = panda_series.describe()['75%']
    print(first_quarter, third_quarter)
    
    cat_list = ['temp'] * len(panda_series) 

    for i, entry in enumerate(panda_series):
        if entry <= first_quarter: 
            cat_list[i] = 'SMALL'
        elif first_quarter < entry <= third_quarter:
            cat_list[i] = 'MED'
        else:
            cat_list[i] = 'LARGE'
    
    return cat_list

Cateogorical variables to one-hot

# Pandas get dummies is one option 
pd.get_dummies(iris_df['sepal_width_cat'], prefix='sepal_width')

One-hot discrete variable with more granularity

def OHE_discreet(point, pandas_series, intervals):
    '''
    define range for one-hot, for every entry find the closest value in the one-hot
    '''

    z = np.linspace(min(pandas_series), max(pandas_series), intervals)
    ohe = np.zeros(len(z))
    ohe[np.argmin(abs(z - point)**2)] = 1
    return ohe

iris_df['sepal_width_OHE'] = iris_df['sepal_width'].apply(OHE_discreet, args=(iris_df['sepal_width'], 11))

Grouping data by entries in a row:

Example 1

licenses_zip_ward.groupby('alderman').agg({'income':'median'})

Estimate the statistic of ‘income’ after grouping the dataframe by row entries in column ‘alderman’

Example 2

counted_df = licenses_owners.groupby('title').agg({'account':'count'})

I want to know the number of account each unique title entry has in the df. Here the column account was counted and the total entries were reported when the data frame was first grouped by entries in the title column.

Example 3

Groupby multiple columns and show the counts

# Create a column that will store the month
data['month'] = data['date'].dt.month

# Create a column that will store the year
data['year'] = data['date'].dt.year

# Group by the month and year and count the pull requests
counts = data.groupby(['month','year'])['pid'].count()

Example 4

Group and pivot table. Find the number of pull_request for the repo every year for the two authors:

# The developers we are interested in
authors = ['xeno-by', 'soc']

# Get all the developers' pull requests
by_author = pulls.loc[ pulls['user'].isin(authors) ]
by_author['year'] = by_author['date'].dt.year 

# Count the number of pull requests submitted each year
counts = by_author.groupby(['user', 'year']).agg({'pid': 'count'}).reset_index()

# Convert the table to a wide format
counts_wide = counts.pivot_table(index='year', columns='user', values='pid', fill_value=0)

# Plot the results
counts_wide