Data Cleaning Checklist: 10 Things to Check Before Your Next Analysis

William Flaiz • December 9, 2025

You've got a dataset. You've got a deadline. You've got a boss who wants insights by Thursday.

The temptation is to skip straight to the analysis. Don't.


Dirty data doesn't announce itself. It hides in plain sight until your quarterly report shows revenue doubled (it didn't) or your email campaign goes out to 4,000 contacts who are actually the same 900 people entered multiple ways. I've seen both happen. The revenue one was worse.


Here's what to check before you trust any dataset enough to make decisions from it.

Diagram depicting data processing through a series of filters or layers, with green data particles flowing through.

1. Hunt for duplicates—and not just the obvious ones

The easy duplicates are exact matches. Same name, same email, same everything. Excel's "Remove Duplicates" handles those fine.


The problem is the duplicates that aren't exact. "Jon Smith" and "Jonathan Smith" are probably the same person. "Acme Corp" and "Acme Corporation" and "ACME Corp." are definitely the same company. Your database doesn't know that. You do.


Check name columns for slight variations. Look for entries that share an email or phone number but differ elsewhere. If you've got customer data from multiple sources, assume there's overlap until you prove otherwise.

This one takes time. There's no shortcut except automation.


2. Validate email formats

  • Not every string with an @ symbol is a valid email. Look for:
  • Missing domains (john@.com)
  • Typos in common domains (gmail.con, yahooo.com)
  • Spaces where they shouldn't be
  • Multiple @ symbols
  • Obviously fake entries (test@test.com, asdf@asdf.com)


A quick regex check catches format issues. Actually verifying deliverability is harder—but at minimum, filter out the stuff that's obviously broken before you send anything.


3. Standardize phone numbers

Phone numbers are chaos. I pulled a dataset last year with numbers formatted as:


  • (555) 123-4567
  • 555-123-4567
  • 5551234567
  • +1 555 123 4567
  • 555.123.4567


All the same number. Five different formats. This breaks sorting, deduplication, and any automated dialing system you might use downstream.


Pick a format. Apply it everywhere. If you've got international numbers, include country codes consistently.


4. Look for impossible values

These are the showstoppers. Values that can't exist but somehow do:


  • Negative ages
  • Birth dates in the future
  • Order totals below zero (unless you handle returns that way intentionally)
  • Hire dates before the company existed
  • Percentages over 100
  • Dates like "02/30/2024" (February 30th isn't a thing)


Sort each numerical column low-to-high and high-to-low. Scan the extremes. The impossible stuff clusters at the edges.


5. Handle missing values consistently

Empty cells happen. The question is what you do about them.


First, figure out why they're empty. Is the data actually unknown? Was it optional? Did an import fail? Did someone enter "N/A" as text instead of leaving it blank?


Then decide on a strategy:


  • Leave blank (fine for truly unknown values)
  • Fill with a default (works for categorical data sometimes)
  • Fill with an average or median (risky—only if the pattern supports it)
  • Flag and exclude from certain calculations


Whatever you pick, be consistent. A mix of blank cells, "N/A", "NULL", "-", and "unknown" in the same column will break things.


6. Check for inconsistent capitalization and spacing

This one seems minor until it isn't. "NEW YORK" and "New York" and "new york" are the same city. Your pivot table disagrees.


Common culprits:


  • ALL CAPS entries mixed with normal case
  • Leading or trailing spaces (invisible but deadly for matching)
  • Double spaces between words
  • Inconsistent treatment of abbreviations (St. vs Street vs ST)


Trim whitespace first. Then standardize case. For proper nouns like names and cities, title case usually works. For codes or IDs, pick uppercase or lowercase and stick with it.


7. Standardize date formats

Excel thinks it's helping when it auto-formats dates. It isn't.


"01/02/2024" means January 2nd in the US and February 1st in most of Europe. If your data comes from multiple sources or was touched by people with different regional settings, you might have both interpretations in the same column without knowing it.


Check that dates parse correctly. Convert everything to an unambiguous format (YYYY-MM-DD is safest for data work). Watch for text-formatted dates that look right but won't sort or calculate properly.


8. Look for outliers that might skew your analysis

Not every outlier is an error. Sometimes someone really did place a $50,000 order when the average is $200. But you need to know about those values before they skew your averages and make your charts unreadable.


Calculate basic stats for numerical columns: min, max, mean, median. If the mean and median are wildly different, you've got outliers pulling things. Decide whether to include them, exclude them, or analyze them separately.


A single extreme value can move an average more than a thousand normal ones. Make sure that's what you want.


9. Verify categorical values are standardized

If you've got a "Country" column, how many ways have people entered the United States?


  • USA
  • US
  • United States
  • United States of America
  • U.S.A.
  • America


I've seen all of these in a single dataset. Your analysis will treat them as six different countries unless you fix it.

Pull a list of unique values for every categorical column. Look for variations that should be consolidated. Status fields are especially bad for this—"Active" vs "active" vs "ACTIVE" vs "A" all meaning the same thing.


10. Cross-check related fields for logical consistency

Some errors only show up when you compare columns:


  • End date before start date
  • Age that doesn't match birth date
  • State/province that doesn't match the country
  • Job title "CEO" with department "Intern Pool"
  • Ship date before order date


Build a few sanity-check comparisons. They don't need to be complicated—just "does column A make sense given column B?" A surprising number of records will fail these basic tests.


The honest truth

This checklist works. It'll catch most of the problems that cause embarrassment, bad decisions, or broken reports.


It's also tedious. Steps 1, 3, 6, and 9 alone can eat an entire afternoon on a dataset of any real size. Multiply that by every dataset you touch, and data cleaning becomes a significant chunk of your workweek.


That's the tradeoff. Do the work manually and trust the results, or skip it and hope nothing's broken.

Or automate it. But that's a different post.

Automate Your Data Cleaning →

William Flaiz is a digital transformation executive and former Novartis Executive Director who has led consolidation initiatives saving enterprises over $200M in operational costs. He holds MIT's Applied Generative AI certification and specializes in helping pharmaceutical and healthcare companies align MarTech with customer-centric objectives. Connect with him on LinkedIn or at williamflaiz.com.

Abstract illustration of connected circles and icons on a light blue and white background, representing networking or data flow.
By William Flaiz February 26, 2026
You can't guilt people into better data entry. Learn how to build a data quality culture through visibility, smart incentives, and automation.
Abstract graphic depicting a central device communicating between two devices, each with an alert symbol.
By William Flaiz February 24, 2026
Your validation rules rejected good data or let bad data through. Here's how to troubleshoot and fix your validation logic.
Data visualization showing data flowing from charts to a schedule board, all in a clean, modern style with teal and white hues.
By William Flaiz February 19, 2026
Turn scattered spreadsheets into one clean, unified dataset without code. A practical workflow for data cleaning, preview controls, audit trails, and governance.
Data transformation illustration, showing data flow from gray blocks to green blocks, passing through verification gates.
By William Flaiz February 17, 2026
Moving CRMs? The data you bring determines whether the new system works. Here's what to clean before you migrate.
Phone number with country codes and a highlighted main number.
By William Flaiz February 12, 2026
Master E.164 phone formatting for CRM data cleansing. Country code examples, a data cleaning checklist, and best practices for international contact data.
Conceptual graphic showing a data filtering process. Hexagon people icons pass through a filter, transforming into document icons.
By William Flaiz February 10, 2026
Deduplication isn't a one-time event. Here's how to handle duplicates at every stage—from prevention to detection to merge.
Abstract graphic with checkmarks and hexagon shapes, in shades of blue, green, and white.
By William Flaiz February 5, 2026
Email Validation the Right Way (Without Nuking Good Leads) — practical strategies and templates.
Map with location markers connected by lines, indicating delivery route, leading to a package detail screen.
By William Flaiz February 3, 2026
123 Main St, 123 Main Street, and 123 Main ST are the same address. Getting your systems to agree is another story.
Timeline showing project phases: start, full-time development, part-time, beta launch. 15-20% time lost to rework.
By William Flaiz February 1, 2026
A brutally honest breakdown of what AI coding tools actually require. The architecture directives, the rework, and why 20 years of experience wasn't optional.
Checklist with green checkmarks, overlaid on translucent rectangular blocks, against a white and abstract background.
By William Flaiz January 29, 2026
Cut through the marketing noise. Learn the five capabilities that actually matter when evaluating data cleaning tools, plus a ready-to-use RFP checklist.