Data Cleaning Checklist: 10 Things to Check Before Your Next Analysis
You've got a dataset. You've got a deadline. You've got a boss who wants insights by Thursday.
The temptation is to skip straight to the analysis. Don't.
Dirty data doesn't announce itself. It hides in plain sight until your quarterly report shows revenue doubled (it didn't) or your email campaign goes out to 4,000 contacts who are actually the same 900 people entered multiple ways. I've seen both happen. The revenue one was worse.
Here's what to check before you trust any dataset enough to make decisions from it.

1. Hunt for duplicates—and not just the obvious ones
The easy duplicates are exact matches. Same name, same email, same everything. Excel's "Remove Duplicates" handles those fine.
The problem is the duplicates that aren't exact. "Jon Smith" and "Jonathan Smith" are probably the same person. "Acme Corp" and "Acme Corporation" and "ACME Corp." are definitely the same company. Your database doesn't know that. You do.
Check name columns for slight variations. Look for entries that share an email or phone number but differ elsewhere. If you've got customer data from multiple sources, assume there's overlap until you prove otherwise.
This one takes time. There's no shortcut except automation.
2. Validate email formats
- Not every string with an @ symbol is a valid email. Look for:
- Missing domains (john@.com)
- Typos in common domains (gmail.con, yahooo.com)
- Spaces where they shouldn't be
- Multiple @ symbols
- Obviously fake entries (test@test.com, asdf@asdf.com)
A quick regex check catches format issues. Actually verifying deliverability is harder—but at minimum, filter out the stuff that's obviously broken before you send anything.
3. Standardize phone numbers
Phone numbers are chaos. I pulled a dataset last year with numbers formatted as:
- (555) 123-4567
- 555-123-4567
- 5551234567
- +1 555 123 4567
- 555.123.4567
All the same number. Five different formats. This breaks sorting, deduplication, and any automated dialing system you might use downstream.
Pick a format. Apply it everywhere. If you've got international numbers, include country codes consistently.
4. Look for impossible values
These are the showstoppers. Values that can't exist but somehow do:
- Negative ages
- Birth dates in the future
- Order totals below zero (unless you handle returns that way intentionally)
- Hire dates before the company existed
- Percentages over 100
- Dates like "02/30/2024" (February 30th isn't a thing)
Sort each numerical column low-to-high and high-to-low. Scan the extremes. The impossible stuff clusters at the edges.
5. Handle missing values consistently
Empty cells happen. The question is what you do about them.
First, figure out why they're empty. Is the data actually unknown? Was it optional? Did an import fail? Did someone enter "N/A" as text instead of leaving it blank?
Then decide on a strategy:
- Leave blank (fine for truly unknown values)
- Fill with a default (works for categorical data sometimes)
- Fill with an average or median (risky—only if the pattern supports it)
- Flag and exclude from certain calculations
Whatever you pick, be consistent. A mix of blank cells, "N/A", "NULL", "-", and "unknown" in the same column will break things.
6. Check for inconsistent capitalization and spacing
This one seems minor until it isn't. "NEW YORK" and "New York" and "new york" are the same city. Your pivot table disagrees.
Common culprits:
- ALL CAPS entries mixed with normal case
- Leading or trailing spaces (invisible but deadly for matching)
- Double spaces between words
- Inconsistent treatment of abbreviations (St. vs Street vs ST)
Trim whitespace first. Then standardize case. For proper nouns like names and cities, title case usually works. For codes or IDs, pick uppercase or lowercase and stick with it.
7. Standardize date formats
Excel thinks it's helping when it auto-formats dates. It isn't.
"01/02/2024" means January 2nd in the US and February 1st in most of Europe. If your data comes from multiple sources or was touched by people with different regional settings, you might have both interpretations in the same column without knowing it.
Check that dates parse correctly. Convert everything to an unambiguous format (YYYY-MM-DD is safest for data work). Watch for text-formatted dates that look right but won't sort or calculate properly.
8. Look for outliers that might skew your analysis
Not every outlier is an error. Sometimes someone really did place a $50,000 order when the average is $200. But you need to know about those values before they skew your averages and make your charts unreadable.
Calculate basic stats for numerical columns: min, max, mean, median. If the mean and median are wildly different, you've got outliers pulling things. Decide whether to include them, exclude them, or analyze them separately.
A single extreme value can move an average more than a thousand normal ones. Make sure that's what you want.
9. Verify categorical values are standardized
If you've got a "Country" column, how many ways have people entered the United States?
- USA
- US
- United States
- United States of America
- U.S.A.
- America
I've seen all of these in a single dataset. Your analysis will treat them as six different countries unless you fix it.
Pull a list of unique values for every categorical column. Look for variations that should be consolidated. Status fields are especially bad for this—"Active" vs "active" vs "ACTIVE" vs "A" all meaning the same thing.
10. Cross-check related fields for logical consistency
Some errors only show up when you compare columns:
- End date before start date
- Age that doesn't match birth date
- State/province that doesn't match the country
- Job title "CEO" with department "Intern Pool"
- Ship date before order date
Build a few sanity-check comparisons. They don't need to be complicated—just "does column A make sense given column B?" A surprising number of records will fail these basic tests.
The honest truth
This checklist works. It'll catch most of the problems that cause embarrassment, bad decisions, or broken reports.
It's also tedious. Steps 1, 3, 6, and 9 alone can eat an entire afternoon on a dataset of any real size. Multiply that by every dataset you touch, and data cleaning becomes a significant chunk of your workweek.
That's the tradeoff. Do the work manually and trust the results, or skip it and hope nothing's broken.
Or automate it. But that's a different post.
William Flaiz is a digital transformation executive and former Novartis Executive Director who has led consolidation initiatives saving enterprises over $200M in operational costs. He holds MIT's Applied Generative AI certification and specializes in helping pharmaceutical and healthcare companies align MarTech with customer-centric objectives. Connect with him on LinkedIn or at williamflaiz.com.


