Skip to Content
ModulesPandas TutorialData Cleansing and Preprocessing

Data Cleaning and Preprocessing in Pandas

Data cleaning is an essential step in preparing raw data for analysis. Pandas provides several tools for handling missing data, replacing values, removing duplicates, and changing data types.


Handling Missing Data

Missing data can occur in datasets for various reasons. Pandas offers methods to identify and handle these missing values.

Identifying Missing Data

import pandas as pd # Sample DataFrame data = { 'Name': ['Anika', 'Rahul', None], 'Age': [25, None, 22], 'City': ['Delhi', 'Mumbai', None] } df = pd.DataFrame(data) print(df.isnull()) # Check for missing values print(df.notnull()) # Check for non-missing values

Output:

Name Age City 0 False False False 1 False True False 2 True False True

Dropping Missing Values

# Drop rows with missing values df_dropped = df.dropna() print(df_dropped) # Drop columns with missing values df_dropped_col = df.dropna(axis=1) print(df_dropped_col)

Filling Missing Values

# Fill missing values with a specified value df_filled = df.fillna("Unknown") print(df_filled) # Fill missing values with different values for each column df_filled_custom = df.fillna({"Name": "No Name", "Age": 0, "City": "Unknown"}) print(df_filled_custom)

Replacing Values

Replace specific values in the dataset using the replace() method.

# Replace values in a column df_replaced = df.replace({"Delhi": "New Delhi", None: "Unavailable"}) print(df_replaced)

Removing Duplicates

Identify and remove duplicate rows using the duplicated() and drop_duplicates() methods.

# Sample DataFrame with duplicates data = { 'Name': ['Anika', 'Rahul', 'Anika'], 'Age': [25, 30, 25], 'City': ['Delhi', 'Mumbai', 'Delhi'] } df = pd.DataFrame(data) print(df.duplicated()) # Identify duplicates # Remove duplicate rows df_no_duplicates = df.drop_duplicates() print(df_no_duplicates)

Changing Data Types

Convert data types using the astype() method.

# Sample DataFrame data = { 'Name': ['Anika', 'Rahul', 'Sneha'], 'Age': ['25', '30', '22'], 'Salary': ['50000', '60000', '45000'] } df = pd.DataFrame(data) # Convert Age and Salary to integers df['Age'] = df['Age'].astype(int) df['Salary'] = df['Salary'].astype(float) print(df.dtypes)

Try It Yourself

Problem 1: Handle Missing Data

Create a DataFrame with some missing values. Fill missing values in one column with a specific value and drop rows with missing values in another column.

Show Code

data = { 'Product': ['Laptop', 'Phone', None], 'Price': [80000, None, 20000], 'Stock': [50, 150, None] } df = pd.DataFrame(data) # Fill missing Product values with "Unknown" df['Product'] = df['Product'].fillna("Unknown") # Drop rows with missing values in the Price column df_cleaned = df.dropna(subset=['Price']) print(df_cleaned)

Problem 2: Remove Duplicates and Change Data Types

Create a DataFrame with duplicate rows. Remove the duplicates and convert a column to a different data type.

Show Code

data = { 'Name': ['Anika', 'Rahul', 'Anika'], 'Age': ['25', '30', '25'], 'Salary': ['50000', '60000', '50000'] } df = pd.DataFrame(data) # Remove duplicate rows df_no_duplicates = df.drop_duplicates() # Convert Age and Salary to numeric types df_no_duplicates['Age'] = df_no_duplicates['Age'].astype(int) df_no_duplicates['Salary'] = df_no_duplicates['Salary'].astype(float) print(df_no_duplicates)

Pyground

Play with Python!

Output:

Last updated on