What should I do when a file fails validation?

It depends on what failed and why. For minor issues—a few malformed phone numbers, some dates in the wrong format—you can often clean the data and proceed. For structural problems—wrong columns, completely wrong data types, massive duplicate counts—reject the file and go back to the source. The validation failure is telling you something's wrong with the upstream process, and patching the symptoms doesn't fix the cause. Ask why the export produced bad data rather than just fixing it every time.

How strict should validation rules be?

Strict enough to catch real problems, loose enough to not reject legitimate data. Start stricter than you think you need, then loosen rules when they generate false positives. It's easier to relax a rule that's blocking good data than to tighten one after bad data has been flowing through. For critical systems (financial data, healthcare, anything regulatory), err on the side of too strict. For exploratory analysis, you can afford more flexibility.

Should I validate CSVs that come from internal systems?

Yes. Internal systems produce bad data too—maybe more than external sources, because there's an assumption of trust. Database exports get truncated. ETL jobs fail silently. Someone changes a field definition without updating downstream processes. Validate everything, regardless of source. The five minutes it takes to check is always worth it compared to the alternative.

CSV Validation Rules Every Team Should Enforce

William Flaiz • December 29, 2025

Every bad data problem I've seen in the past year started the same way: someone imported a CSV without checking it first.

The file looked fine. It opened in Excel. It had the right columns. So it went straight into the system, and three weeks later someone's asking why the quarterly numbers don't add up, or why 200 customers have the same phone number, or why there's a negative value in a field that should never be negative.

Validation rules are the fix. Not complicated ones—just a checklist of things that should be true about any file before it touches your production data. Catch the problems at the door instead of finding them later in a broken report.

This post covers the rules worth enforcing on every CSV import, from basic structure checks to field-specific validation. Steal these defaults and adapt them to your data.

Checkmarks through digital data flow, symbolizing progress and completion.

Why Validation Rules Matter

The argument for validation isn't abstract. It's about time and trust.

Time: Every data quality issue you catch at import is an issue you don't have to investigate, fix, and explain later. The ratio isn't even close—five minutes of validation saves hours of cleanup.

Trust: When stakeholders learn that bad data got into reports, they stop trusting the reports. Even after you fix the issue, the doubt lingers. "Are we sure this number is right?" is a question that kills momentum.

Validation rules are a forcing function. They make it impossible to import garbage without at least acknowledging you're doing it. That friction is the point.

Required Columns and Types

The most basic validation: does this file have what we expect?

Column presence: Define the columns that must exist. If your customer import expects "email" and "company_name" and the file has "e-mail" and "company," that's a problem to catch now, not after the import silently maps things wrong or drops data.

Column order: Decide whether order matters. Some systems require exact column positions; others match by header name. Know which you're dealing with and validate accordingly.

Data types: Each column should have an expected type. Is this field supposed to be a number? A date? Text? A value that's technically text but contains "$45.99" in a numeric field will break calculations downstream.

Type validation catches:

Numbers stored as text ("1,234" vs 1234)
Dates in the wrong format (or not dates at all)
Empty strings where nulls should be
Mixed types in a single column

The rules don't have to be fancy. "Column A must exist and contain only integers" is a rule. Write it down, check it on every import.

Allowed Values and Ranges

Once you know the columns exist and have the right types, check whether the values make sense.

Enumerated values: If a field should only contain specific options, validate against the list. Status fields are classic—if your system expects "active," "inactive," and "pending," a value of "Active" (capitalized) or "on hold" (not in the list) shouldn't get through.

Numeric ranges: Define the reasonable bounds. Age should be 0-120. Percentages should be 0-100. Order quantities should be positive. Prices probably shouldn't be negative (unless you handle credits).

String length: Set minimums and maximums. A two-character "name" field is probably wrong. A 10,000-character "notes" field might break your UI. Phone numbers should be within a reasonable digit range.

Regex patterns: For structured text, define the pattern. US zip codes are 5 digits or 5+4 with a hyphen. URLs should start with http:// or https://. Product codes probably follow a specific format.

Here's a starter set of range rules:

Field Type	Rule
Age	0-1,200, integer
Percentage	0-100
Price/Amount	>=0 (usually)
Quantity	>0, integer
Year	1900-2100
Rating (1-5)	1,2,3,4, or 5 only

Adjust for your domain. The point is to have explicit rules rather than hoping the data is reasonable.

Date, Phone, and Email Rules

These three fields cause disproportionate pain. They deserve specific attention.

Dates are a mess because formats vary. Is "01/02/2024" January 2nd or February 1st? Depends who created the file. Your validation should either enforce a specific format (ISO 8601: YYYY-MM-DD is the least ambiguous) or at least flag ambiguous dates for review.

Date validation rules:

Must be parseable as a date
Must be within a reasonable range (not year 1900 unless you actually have historical data, not year 2099 unless you're scheduling far-future events)
Start dates must be before end dates
Dates shouldn't be in the future if they represent past events (order dates, birth dates)

Phone numbers vary by country and format. At minimum, check that they contain only valid characters (digits, spaces, parentheses, hyphens, plus signs) and fall within a reasonable length (7-15 digits typically). If you're US-only, you can be stricter: 10 digits, area code shouldn't start with 0 or 1.

Phone validation rules:

Contains only valid characters
7-15 digits after stripping formatting
No obviously fake patterns (000-000-0000, 123-456-7890)
Consistent format within the file (or flag inconsistency)

Emails need both format validation and domain validation. The format check catches obvious errors: missing @ symbol, spaces, invalid characters. Domain validation confirms the domain actually exists and can receive mail.

Email validation rules:

Contains exactly one @ symbol
Has text before and after the @
Domain has a valid TLD (.com, .org, .co.uk, etc.)
No spaces or invalid characters
Domain has MX records (if you want to verify deliverability)

Don't over-engineer email validation with complex regex. The edge cases in valid email addresses are weirder than you'd think, and most "strict" patterns reject legitimate addresses. Basic structural checks plus domain verification catch the real problems.

Stacked interface with data validation options: column type, regex, date, email, and phone, each with a checkmark or an

Row-Level vs. Dataset-Level Checks

Some rules apply to individual rows. Others apply to the file as a whole.

Row-level validation checks each record independently:

Does this row have all required fields?
Are the values in valid ranges?
Do the fields pass format validation?

Dataset-level validation looks at the file as a whole:

Are there duplicate primary keys?
Is the row count within expected bounds?
Are there unexpected patterns (same value repeated across all rows, sequential IDs with gaps)?
Does the distribution look reasonable (not 99% of values in one category)?

Dataset-level checks catch problems that row-level checks miss. A row with customer_id "12345" is valid on its own. Two rows with customer_id "12345" is a duplicate problem. You only see it when you look at the whole file.

Useful dataset-level rules:

Uniqueness: Primary key columns should have no duplicates
Completeness: Critical columns should have < X% null values
Cardinality: Categorical columns should have a reasonable number of distinct values
Row count: File should have between N and M rows (catches truncated exports or runaway appends)
Cross-field consistency: If "country" is "USA," then "state" should be a valid US state

Automating Validation

Manual validation doesn't scale. If you're checking CSVs by hand, you'll skip it when you're busy, miss things when you're tired, and apply rules inconsistently.

Automation options, from simple to sophisticated:

Spreadsheet formulas: For small files, you can build validation into Excel or Google Sheets. Conditional formatting highlights problems, and helper columns flag specific rule violations. It's manual to set up but reusable.
Scripts: Python with pandas, R, or even bash scripts can validate CSVs against defined rules. Write once, run on every import. The code becomes your documentation of what "valid" means.
Database constraints: If the CSV is heading into a database, let the database enforce rules. NOT NULL constraints, CHECK constraints, foreign keys, unique indexes. The import fails if the data violates constraints—which is exactly what you want.
Dedicated tools: Data quality platforms and ETL tools often have validation built in. CleanSmart validates structure, formats, and field rules automatically when you upload a CSV—you get a report of everything that failed before you decide whether to proceed.

The best approach depends on your volume and complexity. One CSV a week? A spreadsheet template might be fine. Dozens of files from multiple sources? You need automation.

Building Your Validation Checklist

Start with these defaults and customize for your data:

Structure checks:

All required columns present
Column names match expected (exact match or mapped)
No unexpected extra columns
Row count within expected range

Type checks:

Numeric columns contain only numbers
Date columns contain parseable dates
No mixed types within columns

Value checks:

Required fields are not null/empty
Numeric values within valid ranges
Categorical values in allowed lists
String lengths within bounds

Format checks:

Emails are valid format
Phone numbers match expected pattern
Dates in consistent format
URLs are well-formed

Integrity checks:

Primary keys are unique
Foreign keys exist in reference data
Cross-field relationships are consistent

Not every check applies to every file. But having the list means you're making conscious decisions about what to validate rather than hoping for the best.

Start Validating

Upload a CSV to CleanSmart and get an instant validation report. Every column gets type-checked, every field gets format-validated, and every row gets flagged if something's off. You'll know exactly what's wrong before you decide whether to clean it or reject it.

What should I do when a file fails validation?
It depends on what failed and why. For minor issues—a few malformed phone numbers, some dates in the wrong format—you can often clean the data and proceed. For structural problems—wrong columns, completely wrong data types, massive duplicate counts—reject the file and go back to the source. The validation failure is telling you something's wrong with the upstream process, and patching the symptoms doesn't fix the cause. Ask why the export produced bad data rather than just fixing it every time.
How strict should validation rules be?
Strict enough to catch real problems, loose enough to not reject legitimate data. Start stricter than you think you need, then loosen rules when they generate false positives. It's easier to relax a rule that's blocking good data than to tighten one after bad data has been flowing through. For critical systems (financial data, healthcare, anything regulatory), err on the side of too strict. For exploratory analysis, you can afford more flexibility.
Should I validate CSVs that come from internal systems?
Yes. Internal systems produce bad data too—maybe more than external sources, because there's an assumption of trust. Database exports get truncated. ETL jobs fail silently. Someone changes a field definition without updating downstream processes. Validate everything, regardless of source. The five minutes it takes to check is always worth it compared to the alternative.

Start Validating

< Older Post

Newer Post >

William Flaiz is a digital transformation executive and former Novartis Executive Director who has led consolidation initiatives saving enterprises over $200M in operational costs. He holds MIT's Applied Generative AI certification and specializes in helping pharmaceutical and healthcare companies align MarTech with customer-centric objectives. Connect with him on LinkedIn or at williamflaiz.com.