Pandas Drop Duplicates Columns


Pandas, the versatile data manipulation library in Python, provides efficient methods for handling duplicate columns in a DataFrame.

Duplicate columns in a DataFrame can arise from various scenarios, such as merging DataFrames or loading data from different sources πŸ€”. Identifying these duplicates is crucial for maintaining data integrity and avoiding redundancy.

In this tutorial, we will learn how to identify and drop duplicate columns, ensuring your DataFrame remains clean and optimized.

    Table of Contents

  1. Dropping Duplicate Columns πŸ—‘οΈ
    1. Using Transpose and drop_duplicates()
    2. Using loc
  2. Keeping the First Occurrence πŸ“…
  3. Keeping the Last Occurrence ⏰
  4. Conclusion 🌟

1. Dropping Duplicate Columns πŸ—‘οΈ

There are multiple ways to remove duplicates from a DataFrame. We have discussed 2 of the most efficient methods below.

1.1 Using Transpose and drop_duplicates()

To drop duplicate columns, we can use the T (transpose) property of the DataFrame along with the drop_duplicates() method.

The T property swaps the rows and columns of the DataFrame, and the drop_duplicates() method drops duplicate rows from the DataFrame.

The idea is to transpose the DataFrame so that the columns become rows, remove duplicate rows, and then transpose the DataFrame back to its original form.

Let's illustrate this with an example.

import pandas as pd

# Create a list of lists with duplicate column names
data = [
    ['Alice', 25, 'NY', 25],  # Each list represents a row
    ['Bob', 30, 'LA', 30],
    ['Charlie', 35, 'SF', 35]
]

# Provide explicit column names, including duplicates
column_names = ['Name', 'Age', 'City', 'Age']  # Duplicate 'Age' column

# Create the DataFrame
df = pd.DataFrame(data, columns=column_names)
print("Original DataFrame:")
print(df)

# πŸ‘‡ Drop the duplicate column
df = df.T.drop_duplicates().T
print("\nDataFrame with duplicate columns dropped:")
print(df)

Output:

Original DataFrame:
      Name  Age City  Age
0    Alice   25   NY   25
1      Bob   30   LA   30
2  Charlie   35   SF   35

DataFrame with duplicate columns dropped:
      Name Age City
0    Alice  25   NY
1      Bob  30   LA
2  Charlie  35   SF

This method ensures that only the first occurrence of each duplicated column is retained.


1.2 Using loc

Another way to drop duplicate column names is to use the loc property of the DataFrame.

The loc property is used to access a group of rows and columns, to drop duplicate columns, we can use the loc property to access all the columns except the first occurrence owill access all the columns except the duplicates.

import pandas as pd

# Create a list of lists with duplicate column names
data = [
    ['Alice', 25, 'NY', 25],  # Each list represents a row
    ['Bob', 30, 'LA', 30],
    ['Charlie', 35, 'SF', 35]
]

# Provide explicit column names, including duplicates
column_names = ['Name', 'Age', 'City', 'Age']  # Duplicate 'Age' column

# Create the DataFrame
df = pd.DataFrame(data, columns=column_names)
print("Original DataFrame:")
print(df)

# πŸ‘‡ Drop the duplicate column
df = df.loc[:, ~df.columns.duplicated()]
print("\nDataFrame with duplicate columns dropped:")
print(df)

Output:

Original DataFrame:
      Name  Age City  Age
0    Alice   25   NY   25
1      Bob   30   LA   30
2  Charlie   35   SF   35

DataFrame with duplicate columns dropped:
      Name Age City
0    Alice  25   NY
1      Bob  30   LA
2  Charlie  35   SF

2. Keeping the First Occurrence πŸ“…

By default, both of the above methods keep the first occurrence of each duplicate column. However, we can explicitly specify the keep='first' argument to ensure that only the first occurrence of each duplicate column is retained.

# πŸ‘‡ drop duplicate columns, keeping the first occurrence
df = df.T.drop_duplicates(keep='first').T

# πŸ‘‡ drop duplicate columns, keeping the first occurrence
df = df.loc[:, ~df.columns.duplicated(keep='first')]

3. Keeping the Last Occurrence ⏰

Similarly, we can specify the keep='last' argument to keep the last occurrence of each duplicate column.

# πŸ‘‡ drop duplicate columns, keeping the last occurrence
df = df.T.drop_duplicates(keep='last').T

# πŸ‘‡ drop duplicate columns, keeping the last occurrence
df = df.loc[:, ~df.columns.duplicated(keep='last')]

Conclusion

Handling duplicate columns in a Pandas DataFrame is essential for maintaining data quality and ensuring optimal performance.

Whether dropping duplicates, keeping the first occurrence, or keeping the last occurrence, Pandas provides flexible methods to suit your specific needs.