Data Cleaning Checklist: 10 Things to Check Before Your Next Analysis

William Flaiz • December 9, 2025

You've got a dataset. You've got a deadline. You've got a boss who wants insights by Thursday.

The temptation is to skip straight to the analysis. Don't.

Dirty data doesn't announce itself. It hides in plain sight until your quarterly report shows revenue doubled (it didn't) or your email campaign goes out to 4,000 contacts who are actually the same 900 people entered multiple ways. I've seen both happen. The revenue one was worse.

Here's what to check before you trust any dataset enough to make decisions from it.

Diagram depicting data processing through a series of filters or layers, with green data particles flowing through.

1. Hunt for duplicates—and not just the obvious ones

The easy duplicates are exact matches. Same name, same email, same everything. Excel's "Remove Duplicates" handles those fine.

The problem is the duplicates that aren't exact. "Jon Smith" and "Jonathan Smith" are probably the same person. "Acme Corp" and "Acme Corporation" and "ACME Corp." are definitely the same company. Your database doesn't know that. You do.

Check name columns for slight variations. Look for entries that share an email or phone number but differ elsewhere. If you've got customer data from multiple sources, assume there's overlap until you prove otherwise.

This one takes time. There's no shortcut except automation.

2. Validate email formats

Not every string with an @ symbol is a valid email. Look for:
Missing domains (john@.com)
Typos in common domains (gmail.con, yahooo.com)
Spaces where they shouldn't be
Multiple @ symbols
Obviously fake entries (test@test.com, asdf@asdf.com)

A quick regex check catches format issues. Actually verifying deliverability is harder—but at minimum, filter out the stuff that's obviously broken before you send anything.

3. Standardize phone numbers

Phone numbers are chaos. I pulled a dataset last year with numbers formatted as:

(555) 123-4567
555-123-4567
5551234567
+1 555 123 4567
555.123.4567

All the same number. Five different formats. This breaks sorting, deduplication, and any automated dialing system you might use downstream.

Pick a format. Apply it everywhere. If you've got international numbers, include country codes consistently.

4. Look for impossible values

These are the showstoppers. Values that can't exist but somehow do:

Negative ages
Birth dates in the future
Order totals below zero (unless you handle returns that way intentionally)
Hire dates before the company existed
Percentages over 100
Dates like "02/30/2024" (February 30th isn't a thing)

Sort each numerical column low-to-high and high-to-low. Scan the extremes. The impossible stuff clusters at the edges.

5. Handle missing values consistently

Empty cells happen. The question is what you do about them.

First, figure out why they're empty. Is the data actually unknown? Was it optional? Did an import fail? Did someone enter "N/A" as text instead of leaving it blank?

Then decide on a strategy:

Leave blank (fine for truly unknown values)
Fill with a default (works for categorical data sometimes)
Fill with an average or median (risky—only if the pattern supports it)
Flag and exclude from certain calculations

Whatever you pick, be consistent. A mix of blank cells, "N/A", "NULL", "-", and "unknown" in the same column will break things.

6. Check for inconsistent capitalization and spacing

This one seems minor until it isn't. "NEW YORK" and "New York" and "new york" are the same city. Your pivot table disagrees.

Common culprits:

ALL CAPS entries mixed with normal case
Leading or trailing spaces (invisible but deadly for matching)
Double spaces between words
Inconsistent treatment of abbreviations (St. vs Street vs ST)

Trim whitespace first. Then standardize case. For proper nouns like names and cities, title case usually works. For codes or IDs, pick uppercase or lowercase and stick with it.

7. Standardize date formats

Excel thinks it's helping when it auto-formats dates. It isn't.

"01/02/2024" means January 2nd in the US and February 1st in most of Europe. If your data comes from multiple sources or was touched by people with different regional settings, you might have both interpretations in the same column without knowing it.

Check that dates parse correctly. Convert everything to an unambiguous format (YYYY-MM-DD is safest for data work). Watch for text-formatted dates that look right but won't sort or calculate properly.

8. Look for outliers that might skew your analysis

Not every outlier is an error. Sometimes someone really did place a $50,000 order when the average is $200. But you need to know about those values before they skew your averages and make your charts unreadable.

Calculate basic stats for numerical columns: min, max, mean, median. If the mean and median are wildly different, you've got outliers pulling things. Decide whether to include them, exclude them, or analyze them separately.

A single extreme value can move an average more than a thousand normal ones. Make sure that's what you want.

9. Verify categorical values are standardized

If you've got a "Country" column, how many ways have people entered the United States?

USA
US
United States
United States of America
U.S.A.
America

I've seen all of these in a single dataset. Your analysis will treat them as six different countries unless you fix it.

Pull a list of unique values for every categorical column. Look for variations that should be consolidated. Status fields are especially bad for this—"Active" vs "active" vs "ACTIVE" vs "A" all meaning the same thing.

10. Cross-check related fields for logical consistency

Some errors only show up when you compare columns:

End date before start date
Age that doesn't match birth date
State/province that doesn't match the country
Job title "CEO" with department "Intern Pool"
Ship date before order date

Build a few sanity-check comparisons. They don't need to be complicated—just "does column A make sense given column B?" A surprising number of records will fail these basic tests.

The honest truth

This checklist works. It'll catch most of the problems that cause embarrassment, bad decisions, or broken reports.

It's also tedious. Steps 1, 3, 6, and 9 alone can eat an entire afternoon on a dataset of any real size. Multiply that by every dataset you touch, and data cleaning becomes a significant chunk of your workweek.

That's the tradeoff. Do the work manually and trust the results, or skip it and hope nothing's broken.

Or automate it. But that's a different post.

Newer Post >

William Flaiz is a digital transformation executive and former Novartis Executive Director who has led consolidation initiatives saving enterprises over $200M in operational costs. He holds MIT's Applied Generative AI certification and specializes in helping pharmaceutical and healthcare companies align MarTech with customer-centric objectives. Connect with him on LinkedIn or at williamflaiz.com.

Abstract illustration of data transformation, with fragmented elements flowing toward a glowing cube on a platform.

The True Cost of Dirty Data (And How to Fix It)

By William Flaiz • December 12, 2025

The cost of bad data is wasted spend, missed deals, and broken trust. Learn how to quantify it, stop duplicates, standardize, and build a lasting fix.