CIOs Must Redefine Data Quality for AI Success

Mar 31, 2026
Interview
CIOs Must Redefine Data Quality for AI Success

Vernon Yai stands at the intersection of traditional data integrity and the chaotic, high-stakes world of artificial intelligence. As a leading voice in data governance and protection, Yai has spent decades helping organizations navigate the shift from rigid, structured databases to the fluid, often messy data landscapes required to fuel modern machine learning. In an era where a single faulty algorithm can lead to millions in losses or a public relations disaster, his expertise in risk management and innovative detection techniques has become indispensable for CIOs looking to turn raw information into a competitive advantage. This conversation explores the shift toward contextual data assessment, the necessity of “projectizing” invisible infrastructure work, and the strategic reframing required to secure executive buy-in for foundational data preparation.

Traditional data standards prioritize “clean” datasets, yet AI often thrives on “messy” real-world inputs like logs with typos or shifting categories. How do you reorient an engineering team to embrace this paradox, and what specific metrics determine if “dirty” data is actually fit for a specific use case?

To reorient a team, you first have to dismantle the fear that “messy” means “useless.” I tell my engineers that while clean data is rare, valuable data is everywhere; we have to embrace the reality of logs full of typos, sensor readings that randomly freeze, and category names that change monthly. It can be hair-raising for a traditional data manager to see values that are “manually adjusted” by different teams, but real impact comes from making sense of that instability rather than scrubbing it away. We determine fitness by looking at the specific business use case rather than a universal standard of “squeaky clean” perfection. If the data is relevant and helps the AI grasp its subject domain fully, it is fit for use, even if it falls short of the quality standards we set for traditional IT systems.

Preparing data for AI requires moving beyond standard extract, load, and transform (ETL) routines toward contextual assessment and specialized governance. What new skills must traditional data analysts acquire to handle these pipelines, and how do you structure their training to ensure they don’t default to old habits?

Traditional analysts are used to a linear flow, but for AI, they must develop a keen sense for contextual assessment—evaluating data not just for its format, but for its role within a specific algorithm’s logic. This requires a transition from basic ETL skills to advanced AI governance work, where analysts must decide which data fragments are essential and which are noise. To prevent a default to old habits, we structure training around variegated data sources, forcing analysts to work with non-standard, uncurated inputs that don’t fit into neat rows and columns. We emphasize that their role is no longer just moving data, but interpreting it to ensure the AI isn’t misled by incomplete or garbled information that hasn’t been properly contextualized.

Executives often view data preparation as low-value “grunt work,” making budget approval difficult for invisible infrastructure. When justifying these foundational investments to the C-suite, how do you frame the risks of faulty outcomes versus the cost of prep? Please provide a step-by-step strategy for securing funding.

Securing funding requires a shift in narrative from “infrastructure cost” to “risk mitigation.” First, I explicitly acknowledge to the CEO and the board that while this groundwork seems like non-value-added grunt work, it is the only thing preventing a costly business misstep or a public relations embarrassment. Second, I walk them through the specific risks of an AI delivering faulty results due to an imperfect algorithm fed by unconditioned data. Third, I present a tiered investment plan where data prep is linked to specific AI outcomes, showing that the cost of preparation is a fraction of the cost of a failed launch. Finally, I emphasize that AI must be fully informed by whatever data is “out there” to be effective, and that cutting corners on prep effectively blinds the system they are spending millions to build.

It is common to hide data preparation within broader project tasks, which often leads to missed deadlines and cost overruns. Why is “projectizing” these tasks essential for transparency, and what specific workflow ensures data work remains visible to stakeholders throughout the project lifecycle?

Burying data preparation within larger tasks is a recipe for disaster because it makes the most time-consuming part of the project invisible until it causes a delay. By “projectizing” these tasks, we bring them into the official project plan, ensuring that every hour spent on data cleanup or assessment is tracked and accounted for in the budget. We use a workflow that includes dedicated milestones for data ingestion, screening for accuracy, and contextual filtering before any actual modeling begins. This transparency ensures that stakeholders see the complexity of the work early on, preventing the typical shock when deadlines are missed due to “unexpected” data issues. When the work is visible, it commands the resources and respect it deserves as a foundational element of the project.

Some AI systems require aggressive filtering, such as removing sensor jitter or narrowing massive research pipelines to specific parameters. How do you design governance frameworks that balance this need for data precision with the necessity of keeping the AI fully informed on its subject domain?

Designing these frameworks requires a dual-track approach where we filter for technical noise while preserving thematic depth. For instance, in an environment using sensor-generated data, we must purposefully remove “jitter” to ensure the AI isn’t reacting to hardware glitches rather than real-world signals. However, in complex fields like vaccine development, the framework must be sophisticated enough to narrow a worldwide research pipeline specifically to mentions of a single molecule by name without losing the broader context of the study. We balance this by setting precision parameters that are use-case specific, ensuring the pipeline is narrowed only when the volume of data threatens to overwhelm the algorithm’s focus. This allows the AI to remain “fully informed” on the relevant domain while ignoring the massive amounts of irrelevant or distracting data that would otherwise lead to faulty conclusions.

What is your forecast for AI data management?

I forecast that the “clean data” obsession will eventually give way to a sophisticated acceptance of data entropy, where the value of an organization is measured by its ability to extract insights from the messiest parts of its operation. Within the next few years, we will see the emergence of autonomous data-prep layers that can identify and correct sensor jitter or log inconsistencies in real-time, reducing the manual “grunt work” that currently bogs down AI initiatives. However, this automation will only increase the demand for human experts who can set the governance rules and ensure the AI remains aligned with business ethics. Ultimately, the companies that thrive will be those that stop trying to force their data into perfect boxes and instead build resilient systems capable of navigating the beautiful, chaotic reality of real-world information.

Trending

Subscribe to Newsletter

Stay informed about the latest news, developments, and solutions in data security and management.

Invalid Email Address
Invalid Email Address

We'll Be Sending You Our Best Soon

You’re all set to receive our content directly in your inbox.

Something went wrong, please try again later

Subscribe to Newsletter

Stay informed about the latest news, developments, and solutions in data security and management.

Invalid Email Address
Invalid Email Address

We'll Be Sending You Our Best Soon

You’re all set to receive our content directly in your inbox.

Something went wrong, please try again later