Skip to Content
ModulesPandas TutorialAdvanced Python

Advanced Pandas

As you become more proficient in Pandas, mastering its advanced features can significantly improve your data analysis workflows. This page covers MultiIndex DataFrames, pivot tables, crosstabs, and performance tuning techniques.


MultiIndex DataFrames

MultiIndex DataFrames allow hierarchical indexing, enabling you to work with complex datasets more effectively.

Creating a MultiIndex DataFrame

import pandas as pd # Creating a MultiIndex DataFrame data = { "Year": [2020, 2020, 2021, 2021], "City": ["Delhi", "Mumbai", "Delhi", "Mumbai"], "Population": [30, 20, 35, 25] } df = pd.DataFrame(data) df.set_index(["Year", "City"], inplace=True) print(df)

Output:

Population Year City 2020 Delhi 30 Mumbai 20 2021 Delhi 35 Mumbai 25

Accessing Data in MultiIndex

# Accessing data for a specific index print(df.loc[(2020, "Delhi")]) # Accessing data for a specific level print(df.xs(2020, level="Year"))

Pivot Tables and Crosstabs

Pivot tables and crosstabs are powerful tools for summarizing data.

Creating a Pivot Table

# Sample DataFrame data = { "City": ["Delhi", "Mumbai", "Delhi", "Mumbai"], "Year": [2020, 2020, 2021, 2021], "Sales": [100, 200, 150, 250] } df = pd.DataFrame(data) # Creating a pivot table pivot = df.pivot_table(values="Sales", index="Year", columns="City", aggfunc="sum") print(pivot)

Output:

City Delhi Mumbai Year 2020 100 200 2021 150 250

Creating a Crosstab

# Creating a crosstab crosstab = pd.crosstab(df["City"], df["Year"]) print(crosstab)

Output:

Year 2020 2021 City Delhi 1 1 Mumbai 1 1

Efficient Performance Tuning with Pandas

When working with large datasets, efficiency is critical. Here are some tips to improve performance:

1. Using eval() and query() for Speed

# Using eval for faster computation df["Total"] = pd.eval("df['Sales'] * 2") print(df) # Using query for efficient filtering filtered = df.query("City == 'Delhi' and Year == 2021") print(filtered)

2. Optimizing Memory Usage

# Downcasting numeric columns df["Sales"] = pd.to_numeric(df["Sales"], downcast="integer") print(df.dtypes)

3. Parallelizing Operations

# Using Dask for parallelized Pandas operations import dask.dataframe as dd dask_df = dd.from_pandas(df, npartitions=2) print(dask_df.compute())

Try It Yourself

Problem 1: MultiIndex DataFrame

Create a MultiIndex DataFrame for a dataset containing years, cities, and temperatures. Access data for a specific year and city.

Show Code

import pandas as pd # Create MultiIndex DataFrame data = { "Year": [2020, 2020, 2021, 2021], "City": ["Delhi", "Mumbai", "Delhi", "Mumbai"], "Temperature": [40, 35, 42, 38] } df = pd.DataFrame(data).set_index(["Year", "City"]) print(df) # Access specific data print(df.loc[(2020, "Delhi")])

Problem 2: Create a Pivot Table

Given a dataset of students’ marks in different subjects, create a pivot table to summarize the average marks per subject for each class.

Show Code

import pandas as pd # Sample Data data = { "Class": ["10th", "10th", "11th", "11th"], "Subject": ["Math", "Science", "Math", "Science"], "Marks": [85, 90, 88, 92] } df = pd.DataFrame(data) # Create Pivot Table pivot = df.pivot_table(values="Marks", index="Class", columns="Subject", aggfunc="mean") print(pivot)

Pyground

Play with Python!

Output:

Last updated on