Guide · July 4, 2026

The AI data readiness checklist: 12 checks before you build.

By the MortarIQ Founder · 6 minute read

Most AI data readiness checklists are vibes. “Ensure high data quality.” “Establish strong governance.” You cannot check either of those boxes, because neither is a check; they are aspirations wearing a checkbox costume. Then the project starts, the retrieval pipeline reads a column called status2_old, and everyone rediscovers why the aspirations mattered.

This checklist is different in one specific way: every item on it is verifiable from warehouse metadata. Not from interviews, not from a data quality tool that needs three weeks and read access to your rows. Schema, descriptions, timestamps, tags, policies. If you can query your catalog, you can answer all twelve today, and if you would rather not do it by hand, the automation section at the end takes minutes.

The twelve checks are grouped by the six factors MortarIQ scores, adapted from the open-source Snowflake Labs framework. The grouping matters less than the habit: check what is checkable, before you build.

The six factors of AI data readiness: Clean, Current, Contextual, Compliant, Correlated, Consumable

Contextual: can a machine understand it?

1. The tables your workload touches have descriptions

Not the whole estate. The tables this AI project will actually read. A model, an agent, or the engineer wiring up retrieval has no tribal knowledge; the description field is the only voice your table has. Check coverage on the target schemas, not the average.

2. Columns with ambiguous names are documented

status, type, flag, amount, value. If a human needs to ask what a column means, an AI will guess, and it will guess confidently. Every column whose name does not fully explain itself needs a description.

Current: is it fresh enough to act on?

3. Every source table has refreshed within its expected window

Compare the last-modified timestamp against how often the table is supposed to load. A table that should refresh nightly and last changed eleven days ago is a silent failure that will feed your workload stale answers.

4. Freshness expectations are written down somewhere queryable

If nobody declared how fresh a table should be, staleness is undetectable by definition. A tag, a label, or even a convention in the description is enough to make check 3 automatic.

Compliant: can it be used without creating exposure?

5. Columns that look like personal data carry a classification tag

Names, emails, phone numbers, addresses, national identifiers. The candidates are visible from column names and types alone. Untagged PII is not a paperwork gap; it is the input an AI pipeline will happily read and repeat.

6. Tagged personal data is covered by a masking policy

A tag without a policy is a label on an open door. Check that the masking or dynamic-data-policy actually attaches to the sensitive columns, because the AI workload reads whatever the role it runs as can see.

Clean: is the structure trustworthy?

7. Keys are declared on the tables that matter

Primary keys, unique constraints, or their warehouse-native equivalents. Even where the platform does not enforce them, the declaration tells every consumer, human or machine, what a row means.

8. Types are honest

Dates stored as strings, numerics stored as text, JSON blobs holding what should be columns. Every dishonest type is a parsing decision an AI system will make for you, silently.

Correlated: do the joins survive contact?

9. Relationships between core entities are declared

Foreign keys or documented join paths between the tables your workload must combine. An undeclared relationship is a join condition someone will infer from column names, and column names lie.

10. One entity, one authoritative table

If there are four customer tables, which one does retrieval read? Duplicated entities force every consumer to relitigate which copy is canonical, and an AI consumer will just pick one.

Consumable: can a workload actually read it efficiently?

11. Big tables are partitioned or clustered for their access pattern

A workload that scans a two-terabyte table end to end for every question is a cost problem first and a latency problem second. Partitioning metadata is visible without touching a row.

12. Naming follows one convention

snake_case here, camelCase there, dim_ prefixes on half the star schema. Inconsistency is friction for people and a hallucination surface for machines that autocomplete table names.

How to actually run this

You can work through the twelve by hand. On BigQuery and Snowflake the catalog views make most checks a morning’s work for one engineer per schema, and the platform posts in this series walk through the exact surfaces to query. The failure mode is not difficulty; it is that nobody re-runs the morning’s work in August, and readiness drifts.

MortarIQ automates the list. A read-only, metadata-only connection scores your warehouse against all 50 requirements behind these checks, weighted for the workload you pick: Estate Scan, RAG, Agents, Training, or Feature Serving. The scan never reads a row of your data, which is why the security review that usually stalls this kind of tool tends not to stall this one. You get a score, the requirement-level detail behind it, and a fix plan ordered by impact.

Run the whole checklist in minutes.

Connect read-only, pick your workload, and get all twelve checks scored against the full 50-requirement framework.

Get your readiness score

What this checklist cannot tell you

Everything above is structural. A table can pass all twelve checks and still contain wrong values, and no metadata scan will catch a customer record whose email column holds a phone number. That is data quality territory, it needs row access, and tools like dbt tests own it. The honest claim for this checklist is narrower and, for AI projects, usually more urgent: it catches the failures that stop a workload from finding, understanding, and safely reading your data at all. In our experience those are the ones that stall projects, because they are invisible until integration day.

Frequently asked questions

What is AI data readiness?

AI data readiness measures whether a consumer with none of your team's context can find, understand, trust, and safely use your data. It is a property of structure and governance: documentation, freshness, declared relationships, masking on personal data, and consistent conventions. It is related to but distinct from data quality, which measures whether the values themselves are correct.

Can I run this checklist without giving anyone access to my data?

Yes. Every item on the checklist is answerable from catalog metadata: schema, column descriptions, types, freshness timestamps, masking policies, and classification tags. MortarIQ automates the whole list from a read-only, metadata-only connection, and the CLI can run the same assessment inside your own network with no account at all.

How is this different from a data quality audit?

A data quality audit reads values to check whether they are correct: nulls, duplicates, out-of-range numbers. This checklist stays one level up, at whether the data is organized, documented, and governed well enough for an AI workload to use it. Both matter, but readiness gaps are the ones that stall AI projects before quality is even measurable.

What readiness score counts as ready?

It depends on the workload, which is why MortarIQ scores against a selected profile (Estate Scan, RAG, Agents, Training, or Feature Serving) rather than a universal bar. A corpus destined for retrieval-augmented generation needs documentation coverage far more than declared foreign keys; training data needs freshness and lineage more than descriptions. Ready means the requirements your workload depends on pass.

How long does an automated readiness assessment take?

Minutes. Because the assessment reads only metadata, a scan of a warehouse with thousands of tables completes in a few minutes and produces a scored breakdown of all 50 requirements plus a prioritized fix plan.

Want to see what the automated version produces? Read a sample readiness report built entirely from metadata.