Data Cleaning Checklist: 10 Things to Check Before Your Next Analysis

William Flaiz • December 9, 2025

You've got a dataset. You've got a deadline. You've got a boss who wants insights by Thursday.

The temptation is to skip straight to the analysis. Don't.


Dirty data doesn't announce itself. It hides in plain sight until your quarterly report shows revenue doubled (it didn't) or your email campaign goes out to 4,000 contacts who are actually the same 900 people entered multiple ways. I've seen both happen. The revenue one was worse.


Here's what to check before you trust any dataset enough to make decisions from it.

Diagram depicting data processing through a series of filters or layers, with green data particles flowing through.

1. Hunt for duplicates—and not just the obvious ones

The easy duplicates are exact matches. Same name, same email, same everything. Excel's "Remove Duplicates" handles those fine.


The problem is the duplicates that aren't exact. "Jon Smith" and "Jonathan Smith" are probably the same person. "Acme Corp" and "Acme Corporation" and "ACME Corp." are definitely the same company. Your database doesn't know that. You do.


Check name columns for slight variations. Look for entries that share an email or phone number but differ elsewhere. If you've got customer data from multiple sources, assume there's overlap until you prove otherwise.

This one takes time. There's no shortcut except automation.


2. Validate email formats

  • Not every string with an @ symbol is a valid email. Look for:
  • Missing domains (john@.com)
  • Typos in common domains (gmail.con, yahooo.com)
  • Spaces where they shouldn't be
  • Multiple @ symbols
  • Obviously fake entries (test@test.com, asdf@asdf.com)


A quick regex check catches format issues. Actually verifying deliverability is harder—but at minimum, filter out the stuff that's obviously broken before you send anything.


3. Standardize phone numbers

Phone numbers are chaos. I pulled a dataset last year with numbers formatted as:


  • (555) 123-4567
  • 555-123-4567
  • 5551234567
  • +1 555 123 4567
  • 555.123.4567


All the same number. Five different formats. This breaks sorting, deduplication, and any automated dialing system you might use downstream.


Pick a format. Apply it everywhere. If you've got international numbers, include country codes consistently.


4. Look for impossible values

These are the showstoppers. Values that can't exist but somehow do:


  • Negative ages
  • Birth dates in the future
  • Order totals below zero (unless you handle returns that way intentionally)
  • Hire dates before the company existed
  • Percentages over 100
  • Dates like "02/30/2024" (February 30th isn't a thing)


Sort each numerical column low-to-high and high-to-low. Scan the extremes. The impossible stuff clusters at the edges.


5. Handle missing values consistently

Empty cells happen. The question is what you do about them.


First, figure out why they're empty. Is the data actually unknown? Was it optional? Did an import fail? Did someone enter "N/A" as text instead of leaving it blank?


Then decide on a strategy:


  • Leave blank (fine for truly unknown values)
  • Fill with a default (works for categorical data sometimes)
  • Fill with an average or median (risky—only if the pattern supports it)
  • Flag and exclude from certain calculations


Whatever you pick, be consistent. A mix of blank cells, "N/A", "NULL", "-", and "unknown" in the same column will break things.


6. Check for inconsistent capitalization and spacing

This one seems minor until it isn't. "NEW YORK" and "New York" and "new york" are the same city. Your pivot table disagrees.


Common culprits:


  • ALL CAPS entries mixed with normal case
  • Leading or trailing spaces (invisible but deadly for matching)
  • Double spaces between words
  • Inconsistent treatment of abbreviations (St. vs Street vs ST)


Trim whitespace first. Then standardize case. For proper nouns like names and cities, title case usually works. For codes or IDs, pick uppercase or lowercase and stick with it.


7. Standardize date formats

Excel thinks it's helping when it auto-formats dates. It isn't.


"01/02/2024" means January 2nd in the US and February 1st in most of Europe. If your data comes from multiple sources or was touched by people with different regional settings, you might have both interpretations in the same column without knowing it.


Check that dates parse correctly. Convert everything to an unambiguous format (YYYY-MM-DD is safest for data work). Watch for text-formatted dates that look right but won't sort or calculate properly.


8. Look for outliers that might skew your analysis

Not every outlier is an error. Sometimes someone really did place a $50,000 order when the average is $200. But you need to know about those values before they skew your averages and make your charts unreadable.


Calculate basic stats for numerical columns: min, max, mean, median. If the mean and median are wildly different, you've got outliers pulling things. Decide whether to include them, exclude them, or analyze them separately.


A single extreme value can move an average more than a thousand normal ones. Make sure that's what you want.


9. Verify categorical values are standardized

If you've got a "Country" column, how many ways have people entered the United States?


  • USA
  • US
  • United States
  • United States of America
  • U.S.A.
  • America


I've seen all of these in a single dataset. Your analysis will treat them as six different countries unless you fix it.

Pull a list of unique values for every categorical column. Look for variations that should be consolidated. Status fields are especially bad for this—"Active" vs "active" vs "ACTIVE" vs "A" all meaning the same thing.


10. Cross-check related fields for logical consistency

Some errors only show up when you compare columns:


  • End date before start date
  • Age that doesn't match birth date
  • State/province that doesn't match the country
  • Job title "CEO" with department "Intern Pool"
  • Ship date before order date


Build a few sanity-check comparisons. They don't need to be complicated—just "does column A make sense given column B?" A surprising number of records will fail these basic tests.


The honest truth

This checklist works. It'll catch most of the problems that cause embarrassment, bad decisions, or broken reports.


It's also tedious. Steps 1, 3, 6, and 9 alone can eat an entire afternoon on a dataset of any real size. Multiply that by every dataset you touch, and data cleaning becomes a significant chunk of your workweek.


That's the tradeoff. Do the work manually and trust the results, or skip it and hope nothing's broken.

Or automate it. But that's a different post.

Automate Your Data Cleaning →

William Flaiz is a digital transformation executive and former Novartis Executive Director who has led consolidation initiatives saving enterprises over $200M in operational costs. He holds MIT's Applied Generative AI certification and specializes in helping pharmaceutical and healthcare companies align MarTech with customer-centric objectives. Connect with him on LinkedIn or at williamflaiz.com.

Data flow illustration with Shopify, Salesforce, and HubSpot integrated, leading to a verified user profile.
By William Flaiz January 14, 2026
How to merge customer records from Shopify, Salesforce, and HubSpot into one clean dataset. Field mapping examples and identity resolution tips.
Scientific diagram: Particles passing through a funnel, with a laser beam hitting a hexagonal target labeled
By William Flaiz January 7, 2026
Build a 0-100 Clarity Score to measure data quality. Covers completeness, consistency, duplicates, anomalies—plus a scorecard template.
Digital shield over a network of hexagons and circuits, with a green gradient.
By William Flaiz January 2, 2026
A practical playbook for RevOps leaders: roles, rituals, templates, and a quarterly roadmap to build data trust across your organization.
Abstract illustration of data transformation through a system. Numbers and data flow, changing from the left to a new form on the right.
By William Flaiz December 30, 2025
Your CRM has the same phone number stored 47 different ways. Here's why that happens and how to fix it permanently.
Digital workflow with glowing checkmarks moving through square panels to complete a checklist.
By William Flaiz December 29, 2025
Stop catching CSV errors after they've already broken something. These validation rules prevent bad data from getting into your system in the first place.
Abstract digital graphic with hexagons, dots, and glowing lines, set against a light blue background.
By William Flaiz December 23, 2025
Learn when simple rules suffice and when ML pays off. Spot outliers, cut false positives, and protect decisions with CleanSmart’s LogicGuard.
Grid of tiles with some highlighted in green, a green speedometer at the bottom.
By William Flaiz December 22, 2025
A practical guide to missing data: when to impute and when to flag. Boost data trust with SmartFill confidence scores for cleaner, reliable analytics.
Diagram of a data network with hexagonal grid and nodes connected by lines.
By William Flaiz December 18, 2025
Fuzzy matching misses duplicates that semantic AI catches. Learn why "Jon Smyth" and "Jonathan Smith" slip through traditional deduplication—and how to fix it.
Abstract illustration of data processing: a cube with data streams connecting to a honeycomb structure, all in shades of blue and white.
By William Flaiz December 17, 2025
CSVs are everywhere—and so are their problems. Encoding nightmares, Excel date mangling, delimiter chaos. Learn what goes wrong and how to fix it.
Abstract illustration of data transformation, with fragmented elements flowing toward a glowing cube on a platform.
By William Flaiz December 12, 2025
The cost of bad data is wasted spend, missed deals, and broken trust. Learn how to quantify it, stop duplicates, standardize, and build a lasting fix.