Back to Blog

Cleaning a CSV file means removing duplicate rows, fixing or removing invalid data — malformed emails, blank required fields — standardizing formatting, and resolving encoding issues. A dirty CSV fed into an email platform, CRM, or database produces errors, bounced messages, and corrupted records. The cleaner your list before import, the less damage you have to undo afterward.

What Dirty CSV Data Actually Looks Like

You know data is dirty when your import fails, your email campaign bounces at an unusual rate, or your database queries return unexpected results. The underlying causes are usually one of five things.

Duplicate rows. The same email address appears three times because the list was assembled by merging exports from three separate sources without deduplication. Each export was valid on its own. Combined, they produce a contact list that sends three copies of every campaign message to a subset of your audience.

Invalid email addresses. These come in several varieties: missing @ symbol, missing domain, missing TLD, extra spaces (user @example.com), obvious typos (user@gmial.com), and role addresses like info@company.com or support@company.com. Role addresses are structurally valid but represent shared inboxes, not individuals — they generate complaint rates that email service providers penalize heavily.

Inconsistent casing. John Smith in one row and john smith in another are different strings in most systems. For deduplication and matching, case-insensitive comparison is required. For name personalization in email templates, you need consistent casing.

Trailing and leading whitespace. Jane and Jane (with a trailing space) are different strings. Most systems won't match them. Databases won't de-dupe them. This is a common artifact of CSV exports from older enterprise software where fields are padded to a fixed width.

Encoding issues. If you're seeing ?, é, or garbled character sequences where accented letters should be, the file was saved in one encoding and opened in another. This is especially common with Windows-1252 files opened on macOS or UTF-8 files processed by Excel on Windows.

Step 1 — Remove Duplicates

Before removing duplicates, define what makes a row a duplicate. The most common definition is same email address, but depending on your data, you might define it as same name plus company, same phone number, or a combination of fields.

Decide before you start, because the choice affects your results. A list deduplicated by email alone will keep a row with a different name but the same address as a single contact. A list deduplicated by email plus phone number will keep two rows if one of them is missing a phone number — even if the email is identical.

Three normalization steps before deduplication:

  1. Trim whitespace from all fields. A trailing space is invisible in most spreadsheet views but creates a false non-match. Strip it.
  2. Lowercase email addresses. Email is case-insensitive at the protocol level. User@Example.com and user@example.com are the same mailbox. Lowercase both before comparison.
  3. Choose which duplicate to keep. The first occurrence is the default in most tools and is usually correct — the original entry predates the duplicate and is more likely to have complete data. If you're merging an updated export over an older list, keeping the last occurrence makes more sense because newer records should override older ones.

After deduplication, compare your row counts. A list that shrinks from 10,000 to 6,000 rows tells you that 40% of your data was duplicate — which points to a structural problem in how the list was assembled, worth fixing at the source. A list that shrinks by 2-3% is normal and expected for any well-maintained database.

Step 2 — Validate Email Addresses

A structurally valid email address has exactly one @ symbol, a local part before it, a domain after it with at least one dot, and a TLD of at least two characters. That definition catches most invalid formats: multiple @ symbols, spaces, disallowed special characters in the local part, domains without TLDs.

Specific invalid patterns to filter:

  • user@ — no domain
  • @example.com — no local part
  • user @example.com — space in the address
  • user@example — no TLD
  • user@@example.com — double @
  • user@.com — domain starts with a dot

Beyond format validation, domain existence checks add a second layer. An MX record lookup confirms that the domain has a mail server configured to accept messages. A typo like @gmial.com passes format validation but fails an MX record check because that domain doesn't have mail exchange records. Browser-based CSV tools can do this check client-side using DNS lookup APIs.

Role addresses deserve special handling. Addresses starting with info@, admin@, support@, sales@, noreply@, webmaster@, and similar prefixes are shared inboxes. They're structurally valid. They're real email addresses. But email marketing platforms track spam complaint rates per sending domain, and role addresses have complaint rates that are high enough to cause deliverability problems. Most email platforms include an option to filter them at import; cleaning them from the source CSV is the more reliable approach.

After filtering invalids and role addresses, record what you removed and why. A 5% removal rate is typical for a list sourced from a web form with minimal validation. A 20% removal rate suggests the list was acquired from a source with no validation at all.

Step 3 — Standardize Formatting

Standardization isn't about fixing errors — it's about ensuring that identical values are represented identically. This matters for deduplication, search, and display.

Lowercase email addresses. Do this after deduplication or do it as part of deduplication — just be consistent. Email comparison is case-insensitive by specification, but many database systems perform case-sensitive string comparisons by default. If your database stores User@Example.com and you later search for user@example.com, you'll get no result unless your column uses a case-insensitive collation or you normalize the values.

Trim whitespace from all fields. This applies to names, companies, phone numbers, and every other field, not just email. Run it as the first step on every import.

Consistent name casing. Title Case (first letter of each word capitalized) is the standard for display in email templates — John Smith is correct, john smith and JOHN SMITH are not. If you're storing names for matching purposes only and not displaying them in personalized messages, lowercase everything. Choose one approach and apply it consistently.

Phone numbers. If your CSV includes phone numbers, standardize the format before import. (555) 867-5309, 555-867-5309, 5558675309, and +15558675309 are the same number in four different formats. Your CRM will store them as four different values and won't recognize them as duplicates. Choose a target format (E.164 international format +15558675309 is the most unambiguous) and normalize to it.

Step 4 — Resolve Encoding Issues

UTF-8 is the standard encoding for text on the web and in modern databases. If your CSV contains characters from languages other than English, or any accented characters (é, ü, ñ, ç), encoding must be handled explicitly.

The symptom of an encoding mismatch is garbled text: é where é should appear, ? where a special character should appear, or a byte order mark (BOM) character at the beginning of the file that appears as garbage characters in the first field of the first row.

The cause is almost always one of two scenarios:

Scenario A: UTF-8 file opened in Excel on Windows. Excel opens CSV files using the system locale encoding (Windows-1252 on most Windows installations). A UTF-8 file opened this way will display accented characters and non-ASCII characters incorrectly. If you then save the file from Excel, it will be saved in Windows-1252, and you've now converted the encoding with data loss on any characters that don't exist in Windows-1252.

The fix: open the file in a text editor (VS Code, Notepad++, Sublime Text) that handles encoding explicitly, verify the encoding shown in the status bar, and save as UTF-8 without BOM before processing.

Scenario B: Legacy database export in Windows-1252. Older enterprise software and some CRM exports produce Windows-1252 files. These are fine within the Windows ecosystem but will appear corrupted when imported into a system expecting UTF-8. The fix: open the file specifying Windows-1252 encoding, then re-save as UTF-8.

Most CSV cleaning tools, including the browser-based CSV Editor, handle encoding detection automatically and allow you to specify the encoding on load if detection fails.

Why This Matters Before an Email Campaign

Email service providers — Mailchimp, Klaviyo, Postmark, SendGrid — measure your sending reputation based on three metrics: bounce rate, complaint rate, and unsubscribe rate. Each provider publishes thresholds, and exceeding them triggers account warnings or suspension.

Bounce rate thresholds sit at 2-5% for most providers. A list with 15% invalid addresses will hit that limit on your first campaign. A single campaign to a dirty list can damage a sending reputation that took months to build.

The sequence of events: you import a 10,000-row list without cleaning it, 1,500 addresses are invalid, 1,200 bounce on the first send, your bounce rate registers at 12%, your account is flagged for review, and your second campaign is delayed pending a manual review of your sending practices.

Cleaning before every import is not optional maintenance — it's the step that determines whether your messages get delivered.

Using the CSV Editor for Cleanup

The CSV Editor handles all four cleaning steps in a single browser session. Upload your file, run the deduplication pass with your chosen key columns, validate and filter email addresses, trim whitespace across all fields, and export the cleaned file.

For specific use cases, the mailing list cleanup and CRM data cleanup workflows provide a focused interface for the steps most relevant to each type of data.

Your file is processed in the browser. Your contact data — email addresses, names, phone numbers — never leaves your device. No server receives the data, and nothing is logged.

The edge case worth knowing: if your CSV contains multi-line values (a field that contains a line break, which is valid in the CSV format when the field is quoted), some tools will misparse the file and report extra rows or malformed columns. The CSV Editor handles RFC 4180-compliant multi-line fields correctly. If you're seeing unexpected row counts after import, check whether your source data contains multi-line fields.