Selected Pandas code snippets and recipes that I revisit now and again. The snippets are adopted from different python scripts written over time, ignore the variable names.
This post was inspired by the wonderful work of Chris Albon and his snippets of code blocks.
Last update: 13th June 2021
Reading and Writing
Basic reading a blank csv file
# If you have no column names but know the number of columns
=None, names=['col1','col2']) pd.read_csv(file_name, header
Saving a file to not have ‘Unamed’ column
'file_name_to_save_as.csv'), sep=',',columns=df1.columns, index=False, header=False) # header = None for no column names df1.to_csv(os.path.join(output_dir,
Quickly generate pandas dataframe from n lists
= ["C", "CCC"]
smiles = [1.5, 2.3]
labels = pd.DataFrame( list( zip(smiles, labels) ), columns=["smiles", "task"] ) df
Information about the dataframe
pandas_dataframe.info()
Summary statistics (mean, quartile ranges)
round(2) pandas_dataframe.describe().
Replace
= df.replace( [list_of_value_to_replace], value_to_replace_with)
df # Eg: df.replace( [98-99], np.nan)
Replace characters in the columns
# List of characters to remove
= ['+','$',',']
chars_to_remove
# List of column names to clean
= ['Installs','Price']
cols_to_clean
# Loop for each column in cols_to_clean
for col in cols_to_clean:
# Loop for each char in chars_to_remove
for char in chars_to_remove:
# Replace the character with an empty string
= apps[col].apply(lambda x: x.replace(char, ''))
apps[col] # Convert col to float data type
= apps[col].astype(float) apps[col]
Convert spaces titles in the row to one word separated by ‘-’
'product_title'] = reduced_df['product_title'].apply( lambda x: x.lower().replace(' ', '-') ) reduced_df[
Define a new column with temp entries
'columns_name'] = 42 pandas_dataframe[
Create columns in a loop
= ['feature_' + str(i) for i in range(n_columns)] pandas_dataframe.columns
Dropping miscellaneous columns and NaN entries
= ['CookTimeInMins', 'Servings', 'Course', 'Diet', 'Instructions', 'TranslatedInstructions', 'URL']
columns_to_drop = food_df.drop(columns = columns_to_drop).dropna() food_df
Post process
Highlight cells based on a condition
= pd.DataFrame({
df "col1":[-5,-2,1,4],
"col2":[2,3,-1,4]
})
def highlight_number(row):
return[
"background-color: red; color:white"
if cell <= 0
else "background-color: green; color:white"
for cell in row
]
apply(highlight_number) df.style.
Quick plotting
Simple pearson correlation plot
# Generate Pearson Correlation Matrix for HOUSING
=housing.corr()
corr_matrix
# Edit the visuals and precision
='coolwarm').set_precision(2)
corr_matrix.style.background_gradient(cmap
# Look at Pearson values for one attribute
'median_house_value'].sort_values(ascending=True) corr_matrix[
Plot multiple scatter plots
from pandas.plotting import scatter_matrix
= ["median_house_value", "median_income", "total_rooms",
attributes "housing_median_age"]
= scatter_matrix(housing[attributes], figsize=(12, 8)); scatter_axes
Handling missing values
Option A: Dropping values in the columns with NaN
=["total_bedrooms"]) housing.dropna(subset
Option B: Drop that column entirely
"total_bedrooms", axis=1) housing.drop(
Option C: Fill missing value with some central tendency
= housing["total_bedrooms"].median()
attribute_median "total_bedrooms"].fillna( attribute_median, inplace=True ) housing[
Checking the NULL enties in the dataset
= housing[housing.isnull().any(axis=1)].head() sample_incomplete_rows
Get number of NULL entries in the dataframe columns
=food_df.columns[food_df.isnull().any()]
null_columnssum() food_df[null_columns].isnull().
Print full rows having NULL entries in the df
= food_df.isnull()
is_NaN = is_NaN.any(axis=1)
row_has_NaN = food_df[row_has_NaN] rows_with_NaN
Dropping NULL only from a particular column
= df.dropna(subset=['INCOME2']) df_income_drop_na
Join two datasets
1. Inner join
Only returns rows with matching values in both df.
= A.merge(B, on = <common column name>, suffixes = tuples to append the name of columns with similar names) A_B
Remember that .merge() only returns rows where the values match in both tables.
2. Merging more than one table
='col_A') \
df1.merge(df2, on='col_B') \
.merge(df3, on='col_C') .merge(df4, on
3. Merge across multiple columns tags
= ['col_A', 'col_B']) df1.merge( df2, on
Searching
Find columns names based on a string
str.contains('STRING_SUBSET')] df_raw_data.columns[df_raw_data.columns.
Find rows in column Model
based on a string
Select all row entries that start with Mac
'model'].str.match('Mac')] df[df[
Select all row entries that contain ac
in it
'model'].str.contains('ac')] df[df[
Filter entries in the column based on the threshold * Data has indian-inspired international cuisines which are not what we are interested in
= food_df['Cuisine'].value_counts()
cuisin_counts = cuisin_counts.iloc[np.where(cuisin_counts > 50)]
cuisin_counts_more_than_50 = food_df.loc[ food_df['Cuisine'].isin(list(cuisin_counts_more_than_50.index)) ]
food_df_top_cuisine #Dropping entries in `food_df` which have non-ind
Clean up entries with partial matches
'Store Name'].str.contains('Wal', case=False), 'Store_Group_1'] = 'Walmarts' df.loc[df[
= ['Chettinad', 'Andhra', 'Karnataka', 'Tamil Nadu', 'Kerala Recipes', 'South Indian Recipes']
south_indian_tag 'Cuisine'].isin(south_indian_tag), 'Combined_cuisine'] = 'South Indian' food_df_top_cuisine.loc[food_df_top_cuisine[
With or
statements
= ['cond_1', 'cond_2']
String_filter_option 'columns' ].str.contains('|'.join(string_filter_option)) ] pandas_dataframe[ Pandas_dataframe[
Filter rows in the pandas df with another list
= ['May','Jun','Jul','Aug','Sep']
month_list 'Month'].isin(month_list)] df_pH.loc[df_pH[
Filter out values using names: Making a separate list of those that DO NOT satisfy the constraint
= halftime_musicians[ ~halftime_musicians.musician.str.contains('Marching') ] no_bands
Statistics & Distributions
Histogram
'WTKG3') df.hist(
CDF and PDF
# Functions for PMF and CDF, we will come to those later in the notebook
def pmf(pandas_series):
= np.unique(pandas_series, return_counts = True)
values, counts = np.c_[ values, counts / sum(counts) ]
pmf
return pmf
def cdf(pandas_series):
= np.unique(pandas_series, return_counts = True)
values, counts = np.c_[ values, counts / sum(counts) ]
pmf = np.zeros(shape=pmf.shape)
cdf
for i in range(0, pmf.shape[0]):
= [pmf[i][0], np.sum(pmf[:i+1], axis=0)[-1]]
cdf[i]
return cdf
Confidence interval
A bootstrap analysis of the reduction of deaths due to handwashing
= []
boot_mean_diff for i in range(3000):
= before_proportion.sample(frac=1, replace=True)
boot_before = after_proportion.sample(frac=1, replace=True)
boot_after - boot_before.mean() ) boot_mean_diff.append( boot_after.mean()
Calculating a 95% confidence interval from boot_mean_diff
= pd.Series(boot_mean_diff).quantile([0.025,0.975]) confidence_interval
Convert variables
Convert continuous variable to discrete
pd.cut
Example 1:
"income_cat"] = pd.cut(housing["median_income"],
housing[=[0., 1.5, 3.0, 4.5, 6., np.inf], #bins around 2-5 income bracket
bins=[1, 2, 3, 4, 5]) labels
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
Example 2:
'sepal_length'], bins=3, right=True, labels=['low','med','high'], retbins=True) pd.cut(iris_df[
Fine tune the labeling
def convert_to_cat(panda_series):
= panda_series.describe()['25%']
first_quarter = panda_series.describe()['75%']
third_quarter print(first_quarter, third_quarter)
= ['temp'] * len(panda_series)
cat_list
for i, entry in enumerate(panda_series):
if entry <= first_quarter:
= 'SMALL'
cat_list[i] elif first_quarter < entry <= third_quarter:
= 'MED'
cat_list[i] else:
= 'LARGE'
cat_list[i]
return cat_list
Cateogorical variables to one-hot
# Pandas get dummies is one option
'sepal_width_cat'], prefix='sepal_width') pd.get_dummies(iris_df[
One-hot discrete variable with more granularity
def OHE_discreet(point, pandas_series, intervals):
'''
define range for one-hot, for every entry find the closest value in the one-hot
'''
= np.linspace(min(pandas_series), max(pandas_series), intervals)
z = np.zeros(len(z))
ohe abs(z - point)**2)] = 1
ohe[np.argmin(return ohe
'sepal_width_OHE'] = iris_df['sepal_width'].apply(OHE_discreet, args=(iris_df['sepal_width'], 11)) iris_df[
Grouping data by entries in a row:
Example 1
'alderman').agg({'income':'median'}) licenses_zip_ward.groupby(
Estimate the statistic of ‘income’ after grouping the dataframe by row entries in column ‘alderman’
Example 2
= licenses_owners.groupby('title').agg({'account':'count'}) counted_df
I want to know the number of account each unique title entry has in the df. Here the column account
was counted and the total entries were reported when the data frame was first grouped by entries in the title column.
Example 3
Groupby multiple columns and show the counts
# Create a column that will store the month
'month'] = data['date'].dt.month
data[
# Create a column that will store the year
'year'] = data['date'].dt.year
data[
# Group by the month and year and count the pull requests
= data.groupby(['month','year'])['pid'].count() counts
Example 4
Group and pivot table. Find the number of pull_request for the repo every year for the two authors:
# The developers we are interested in
= ['xeno-by', 'soc']
authors
# Get all the developers' pull requests
= pulls.loc[ pulls['user'].isin(authors) ]
by_author 'year'] = by_author['date'].dt.year
by_author[
# Count the number of pull requests submitted each year
= by_author.groupby(['user', 'year']).agg({'pid': 'count'}).reset_index()
counts
# Convert the table to a wide format
= counts.pivot_table(index='year', columns='user', values='pid', fill_value=0)
counts_wide
# Plot the results
counts_wide