Back to Blog

To remove duplicate rows from a CSV, define which column or columns determine uniqueness, then scan for rows where those columns match. Most tools default to keeping the first occurrence and removing subsequent duplicates. The complexity comes from deciding what counts as "the same" — exact match, case-insensitive match, or trimmed-whitespace match. Get that definition wrong, and you either leave duplicates in or remove rows that should have stayed.

Why Deduplication Is Non-Trivial

The obvious approach — sort the file, compare adjacent rows, delete matches — breaks on the three problems that appear in almost every real-world dataset.

Case sensitivity. user@example.com and USER@EXAMPLE.COM are different strings. A case-sensitive deduplication pass treats them as two unique records. They represent the same email address. The result is that your "deduplicated" file still has both, and the duplicate remains in the system.

Whitespace. example.com with a trailing space is a different string than example.com without one. The trailing space is invisible when you view the data in a spreadsheet. It's invisible when the CSV is opened in a text editor unless you explicitly enable whitespace display. But the system processing your import sees it, and two rows that look identical on screen become two unique records in the database.

Column combination logic. A list deduplicated by email alone produces different results than the same list deduplicated by email plus first name. If a contact appears twice with the same email address but different first names — because two people share an email address, or because the same person used two different name variations when filling out a form — email-only deduplication removes one row, while email-plus-name deduplication keeps both. Neither outcome is automatically correct. You need to decide what your data model requires.

Normalization Before Deduplication

Run these three steps on every column you plan to deduplicate on, before the deduplication pass:

  1. Trim whitespace. Remove leading and trailing spaces from all values. Most tools have a "trim" or "strip" function; apply it across all columns, not just the key column.
  2. Lowercase string columns. For fields like email address where the comparison should be case-insensitive, lowercase the entire column before deduplication. This is a destructive change — you're modifying the data — but for deduplication purposes it's the correct approach.
  3. Normalize empty values. A field might contain an actual empty string, a string with only spaces, or the word NULL. Treat all of these as "empty" for deduplication purposes. A row with email = NULL and a row with email = "" both lack an email address — they may or may not be duplicates of each other depending on your data model.

If you skip normalization and run deduplication on raw values, you'll produce a "deduplicated" file that still contains duplicates distinguishable only by invisible whitespace or case variation.

How to Deduplicate in Different Tools

Excel: Select the data range, go to Data → Remove Duplicates, choose which columns to check. Excel performs case-insensitive comparison by default, which handles the casing problem. It does not automatically trim whitespace — Jane and Jane remain distinct. After removing duplicates, run a trim operation on all columns if your source data may contain padding.

Excel's Remove Duplicates always keeps the first occurrence. There's no built-in option to keep the last occurrence or the most complete record. If you need that behavior, sort the data first to put the records you want to keep at the top, then run Remove Duplicates.

Google Sheets: There's no native Remove Duplicates for a full sheet in the base product. The closest built-in option is the UNIQUE() function, which outputs a de-duplicated range into a new location — but it operates on the entire row, not a specific column, and you can't configure which duplicate to keep. The Extensions menu has deduplication add-ons if you need column-level control. For scripted deduplication, Apps Script gives you full control via SpreadsheetApp.

Python with pandas: The most configurable approach.

import pandas as pd

df = pd.read_csv('contacts.csv')

# Normalize before deduplication
df['email'] = df['email'].str.lower().str.strip()

# Deduplicate by email, keep first occurrence
df_clean = df.drop_duplicates(subset=['email'], keep='first')

df_clean.to_csv('contacts_clean.csv', index=False)

To keep the last occurrence instead: keep='last'. To mark duplicates without removing them: keep=False marks all duplicates (including the first occurrence) so you can review them before deciding what to remove.

Browser-based CSV Editor: Upload the file, select the key column or columns, specify case sensitivity, and run the deduplication pass. The CSV Editor normalizes whitespace automatically and shows you how many rows were removed before you export. Your file is processed locally — nothing is uploaded to a server.

Which Duplicate to Keep

Three strategies, each correct for different situations:

First occurrence. The default. The original entry predates the duplicate and typically has the most reliable data — it was entered first, before any re-entries or list merges introduced the duplicate. Use this when your list was built incrementally over time and earlier entries are your source of truth.

Last occurrence. The correct choice when you're merging an updated export over an older list. If your CRM exported a customer list in January and again in March, and a customer updated their phone number in February, the March export contains the correct data. Keeping the last occurrence ensures the updated record wins.

Most complete record. Keep the row with the fewest empty fields. This is the right choice when deduplication is happening during a data migration where multiple incomplete records were created for the same entity and you want to salvage as much information as possible. This strategy requires custom logic — it's not built into Excel or basic CSV tools. In pandas, you'd score each row by completeness and sort before calling drop_duplicates.

Deduplication Across Multiple Columns

Deduplicating by a single column is the common case. Multi-column deduplication is less common but necessary when no single column uniquely identifies a record.

Example: a contacts table where email is optional. Some records have an email; some don't. Deduplicating by email alone would collapse all email-less records into one, treating them all as "the same" because they share the same empty value in the email column. The correct approach is a composite key: email plus name, or email plus phone, where you only mark two rows as duplicates if all key fields match.

In pandas: df.drop_duplicates(subset=['email', 'first_name', 'last_name'], keep='first'). In Excel: check multiple columns in the Remove Duplicates dialog.

Document Your Row Counts

Before running deduplication, record the starting row count. After deduplication, record the ending count and the number of rows removed.

A list that shrinks from 10,000 to 9,800 rows (2% removed) is normal for a well-maintained database built from a single source. A list that shrinks from 10,000 to 6,000 rows (40% removed) indicates a structural problem in how the list was assembled — likely multiple exports merged without deduplication at the source, or a subscription confirmation system that wasn't preventing duplicate signups. That's worth fixing upstream, not just cleaning after the fact.

Record the row counts in a comment or log. If someone later asks why the list is smaller than the source export, you have the answer.

When Duplicate Rows Are Intentional

Not all duplicate values in a key column are errors. Before deduplicating, confirm that your data model expects unique rows for your chosen key.

A transaction log where the same customer ID appears 20 times — once per purchase — is correct. Each row represents a distinct event. Deduplicating by customer ID would destroy most of your data. The correct key for a transaction log is transaction ID, not customer ID.

An event attendance list where the same person registered for three separate events should have three rows if you're storing one row per registration. Deduplication by email would reduce it to one.

The question to ask before deduplication: "Should the value in my key column appear more than once in this file, given what each row represents?" If the answer is yes for some cases, you need a filter — deduplicate only within a specific subset of rows, or add a condition that distinguishes legitimate multiple appearances from true duplicates.