Vernon Yai is a distinguished data protection expert specializing in privacy protection and data governance within the healthcare sector. As a recognized thought leader, his work focuses on risk management and the creation of advanced detection techniques to protect sensitive information while ensuring its clinical utility. In this conversation, Vernon addresses the seismic shift in the FDA’s approach to real-world evidence and how the industry must evolve its data pipelines to meet new standards for completeness and traceability. His insights provide a vital roadmap for navigating the complexities of clinical data in an era where structured codes are no longer sufficient to meet regulatory scrutiny.
The FDA’s operational guidance on real-world evidence signals a definitive end to the era of structured-data-only submissions. Why is the agency now insisting that sponsors look beyond basic health data codes and dive into the complexities of clinical notes?
On February 17, 2026, the landscape for medical device submissions fundamentally changed when the FDA’s final guidance became operational, moving the focus from broad datasets to individual clinical facts. The agency realized that the traditional pipeline—relying on electronic health record fields and claims data—was essentially a house of cards because it ignored the most critical patient details hidden in narrative notes. Sponsors are now required to demonstrate that their data is relevant, reliable, and traceable for every single clinical fact, rather than just asserting the quality of a dataset as a whole. This shift acknowledges that the structural problem in how we build clinical data pipelines has finally become too large to ignore, especially since the signal that truly matters for regulatory decisions almost always lives in free text.
You’ve pointed out that social determinants of health are often completely absent from structured data. How vast is the discrepancy when we compare traditional ICD-10 codes to what can be uncovered through natural language processing?
The discrepancy is not just a minor oversight; it is a total blind spot that can derail a regulatory argument regarding patient outcomes. A 2024 study in npj Digital Medicine illustrated a staggering gap where natural language processing on clinical notes identified adverse social determinants in 93.8% of patients, while the structured ICD-10 Z-codes for those same individuals identified a mere 2.0%. This means that if a regulatory question involves how housing, food, or transportation security affects a patient, relying on structured data gives you a view that is functionally non-existent. Without the ability to parse clinical notes, you are missing almost the entire picture of the patient’s lived reality, making any derived conclusions about health equity or long-term safety inherently flawed and indefensible.
In specialized fields like oncology and neurology, how does the omission of family history and pathology details from the structured record affect the predictive power of the models we use today?
It effectively guts the predictive signal, leaving genetics-aware risk models to operate on what is essentially a twelvefold gap in information. In neurology, a 2015 study showed that 58.7% of admission notes contained specific family history, yet only 5.2% of that same information made it into the structured record. This loss is even more critical in oncology, where the data driving staging and therapy decisions lives in pathology reports—areas where extraction accuracy can reach 93.5% to 97.6% when done directly from text. Without that level of detail, a cancer registry or an external-control-arm submission becomes technically inadequate because the structured record alone is too thin to support the weight of a high-stakes regulatory claim.
Beyond the issue of missing data, the research suggests that the structured codes we do have are often “noisier” and more unreliable than most realize. What are the implications of these errors for safety analyses and pharmacovigilance?
The structured layer is not just thin; it is frequently deceptive, which is perhaps the most uncomfortable realization for those in data governance. A 2017 simulation found that only about half of the entered diagnosis codes were actually appropriate for the clinical scenario, and a 2022 study reported an average of 4.9 medication discrepancies per patient. We also have to face the cold reality that the CDC has documented about one in five new prescriptions are never even filled, and roughly half of those that are filled are taken incorrectly. For pharmacovigilance, this means that if you only look at the codes, you might miss a safety signal entirely; for instance, observed suicidality events were found to double once unstructured data was added to the surveillance window because only 3% of ideation events typically carry corresponding ICD codes.
There is a notable study regarding Charlson comorbidity scores that seems to prove that the data source itself changes the outcome of mortality predictions. How should this change how we view the validity of our current data pipelines?
This 2018 study is a watershed moment because it proves that the data layer—not the math—is what determines the validity of a clinical conclusion. Researchers computed Charlson scores from two sources: one from free-text notes and one from the structured problem list, and the results were night and day. The version pulled from the clinical notes successfully predicted long-term mortality, while the version from the structured record completely failed to do so, despite the math being identical in both cases. This highlights why the FDA’s new framework cares less about the sheer volume of a submission and more about whether the clinical facts accurately represent what happened to the patient, as a submission built on structured-only data is simply not fit for the regulatory question it seeks to answer.
Moving from data analysis to technical architecture, what must change in how companies ingest and parse medical information to meet these new, more rigorous standards of traceability?
The unit of work has to shift from “dataset completeness” to “fact-level accuracy,” which requires a complete rebuilding of traditional data pipelines from the ground up. Architecture must now ingest and parse every modality losslessly—including FHIR, HL7, DICOM, and even PDFs—without discarding the original source material. This requires healthcare-specific language models that can handle complex clinical context like negation, temporality, and assertion status to ensure the extraction is accurate and auditable. Furthermore, you need a sophisticated reconciliation layer that can handle conflicts, such as when a pharmacy feed says 40 mg but the clinical chart says 80 mg, surfacing that discrepancy rather than picking a value silently, which ensures every claim is dated and contextualized.
Capturing data correctly is one thing, but proving its accuracy fact-by-fact is another. Do you have any advice for our readers who are navigating these new regulatory waters?
My advice for readers is to shift your mindset from viewing data as a static asset to seeing every clinical assertion as a claim that requires a robust, auditable paper trail. You should immediately begin auditing your existing pipelines to see where critical information like social determinants or family history is being lost between the clinical note and the structured code. Investing in specialized language models that can handle the nuance of healthcare—such as assertion status and temporal context—is no longer a luxury but a regulatory necessity. By prioritizing a reconciliation layer that surfaces discrepancies rather than hiding them, you will build a foundation of data integrity that not only satisfies the FDA but truly reflects the clinical journey of every patient you serve.


