Pandas Drop Duplicate Rows

Deling with a big set of data you will often encounter duplicate rows, it may be due to data entry errors or merging datasets from multiple sources. Identifying these duplicates is crucial for maintaining data integrity and conducting accurate analyses.

In this tutorial, you will learn how to remove these duplicate rows from a DataFrame.

Table of Contents

Dropping Duplicate Rows 🗑️
Keeping the First Occurrence 📅
Keeping the Last Occurrence ⏰
Conclusion 🌟

1. Dropping Duplicate Rows 🗑️

To drop duplicate rows, we can use the drop_duplicates() method provided by Pandas. This method identifies and removes rows with identical values across all columns.

The following example shows how you can drop duplicate rows from a dataframe.

import pandas as pd

# Creating a DataFrame with duplicate rows
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Age': [25, 30, 35, 25],
    'City': ['NY', 'LA', 'SF', 'NY']
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# 👇 Drop duplicate rows
df = df.drop_duplicates()

print("\nDataFrame after dropping duplicate rows:")
print(df)

Output:

Original DataFrame:
      Name  Age City
0    Alice   25   NY
1      Bob   30   LA
2  Charlie   35   SF
3    Alice   25   NY

DataFrame after dropping duplicate rows:
      Name  Age City
0    Alice   25   NY
1      Bob   30   LA
2  Charlie   35   SF

As you can see we have removed the duplicate row with the name Alice and age 25.

There are few parameters you can use to customize the behavior of this method. Learn more about them in Pandas drop_duplicates() tutorial.

2. Keeping the First Occurrence 📅

By default the drop_duplicates() method keeps the first occurrence of the duplicate row and removes the rest, however you can also explicitly specify this behavior using the keep='first' parameter.

# 👇 Keep the first occurrence of duplicate rows
df = df.drop_duplicates(keep='first')

print("DataFrame keeping the first occurrence:")
print(df)

3. Keeping the Last Occurrence ⏰

Similarly, you can use the keep='last' parameter to keep the last occurrence of the duplicate row.

# 👇 Keep the last occurrence of duplicate rows
df = df.drop_duplicates(keep='last')

print("DataFrame keeping the last occurrence:")
print(df)

Conclusion

Handling duplicate rows in a Pandas DataFrame is essential for maintaining data quality and ensuring accurate analyses. Whether dropping duplicates, keeping the first occurrence, or keeping the last occurrence, Pandas provides flexible methods to suit your specific needs.

Apply these techniques to keep your DataFrames clean and efficient in your Python data analysis workflows. 🚀🐍