Pandas drop_duplicates() Method


Pandas, the powerful data manipulation library in Python, provides a variety of methods to clean and manipulate data efficiently. One such method, drop_duplicates(), allows us to eliminate duplicate rows from a DataFrame.

We will learn about the drop_duplicates() method in detail with all its parameters and examples.

    Table of Contents

  1. drop_duplicates() Method
    1. Syntax
    2. Return Value
  2. Examples
    1. Removing complete duplicate rows
    2. Removing partical duplicate rows
    3. Drop Duplicate Rows based on Multiple Columns
    4. Drop Duplicate Columns
  3. Conclusion

1. drop_duplicates() Method

The drop_duplicates() method is used to remove duplicate rows from a DataFrame. It takes a few parameters to customize the behavior of the method.

It operates based on the values in one or more columns, providing flexibility in identifying and eliminating duplicates.

1.1 Syntax

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

Here is the description of the parameters:

ParameterDescription
subset

It specifies the column or list of columns to consider for identifying duplicate rows. If no column is specified, all the columns are considered.

keep

It specifies which occurrence of the duplicate row should be kept. It can take the following values:

  • first (Default): It keeps the first occurrence of the duplicate row.
  • last: It keeps the last occurrence of the duplicate row.
  • False: It drops all the duplicate rows.
inplace

It specifies whether the changes should be made in the original DataFrame or a new DataFrame should be returned. It can take the following values:

  • True: It makes changes in the original DataFrame.
  • False (Default): It returns a new DataFrame with the changes.
ignore_index

It specifies whether the index of the DataFrame should be reset after dropping the duplicate rows. It can take the following values:

  • True: It resets the index of the DataFrame.
  • False (Default): It does not reset the index of the DataFrame.

1.2 Return Value

It returns a DataFrame with the duplicate rows dropped if the inplace=False (Default). If inplace=True, it returns None.


2. Examples

Going through examples will help us understand the method deeply.

Example 1: Removing complete duplicate row

Removing duplicate rows from a DataFrame. Complete duplicate rows are removed.

import pandas as pd

# Creating a DataFrame with duplicate values
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Age': [25, 30, 35, 25],
    'City': ['NY', 'LA', 'SF', 'NY']
})

# ๐Ÿ‘‡ Drop duplicates 
df_single_column = df.drop_duplicates()

print("DataFrame after dropping duplicates:")
print(df_single_column)

Output:

DataFrame after dropping duplicates:
      Name  Age City
0    Alice   25   NY
1      Bob   30   LA
2  Charlie   35   SF

Example 2: Removing partical duplicate row

Suppose there is a row in dataframe whose only few of column have duplicate value but other are unique. In this case you need to use the subset parameter and pass the column label which you want to keep unique.

import pandas as pd

# Creating a DataFrame with duplicate values
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Age': [25, 30, 35, 44], # ๐Ÿ‘ˆ Age column is not duplicate
    'City': ['NY', 'LA', 'SF', 'NY']
})

# remove duplicate rows based on 'Name' column
# ๐Ÿ‘‡ Drop duplicates
df_single_column = df.drop_duplicates(subset=['Name'])

print("DataFrame after dropping duplicates:")
print(df_single_column)

Output:

DataFrame after dropping duplicates:
      Name  Age City
0    Alice   25   NY
1      Bob   30   LA
2  Charlie   35   SF

Example 3: Drop Duplicate Rows based on Multiple Columns

In case we want a list of columns ro be unique throughout the dataframe, we can pass a list of column names to the subset parameter.

This will remove all the rows where the combination of values in the specified columns is duplicate.

import pandas as pd

# Creating a DataFrame with duplicate values
df = pd.DataFrame({
    'A': ['A1', 'A1', 'A2', 'A2', 'A3', 'A3'],
    'B': ['B1', 'B1', 'B1', 'B1', 'B2', 'B2'],
    'C': ['C1', 'C2', 'C2', 'C2', 'C3', 'C3'],
})

# ๐Ÿ‘‡ Drop duplicates
# remove duplicate rows based on 'A' and 'B' column
df_multiple_columns = df.drop_duplicates(subset=['A', 'B'])
print("DataFrame after dropping duplicates based on 'A' and 'B' columns:")
print(df_multiple_columns)

# ๐Ÿ‘‡ Drop duplicates
# remove duplicate rows based on 'A' and 'C' column
df_multiple_columns = df.drop_duplicates(subset=['A', 'C'], keep='last')
print("DataFrame after dropping duplicates based on 'A' and 'C' columns:")
print(df_multiple_columns)

Output:

DataFrame after dropping duplicates based on 'A' and 'B' columns:
    A   B   C
0  A1  B1  C1
2  A2  B1  C2
4  A3  B2  C3
DataFrame after dropping duplicates based on 'A' and 'C' columns:
    A   B   C
0  A1  B1  C1
1  A1  B1  C2
3  A2  B1  C2
5  A3  B2  C3

Example 3: Drop Duplicate Columns

Not only rows but duplicate columns can also be removed using drop_duplicates() method. For this we can take transpose of dataframe (rows to column and column to rows) and them remove duplicate and apply transpose again.

Learn how drop duplicate columns in pandas.


Conclusion

Armed with this knowledge, you can confidently tackle duplicate data in your datasets. Whether you're cleaning rows, columns, or need to retain the last occurrence, the drop_duplicates() method is a versatile tool in your data manipulation arsenal.