Electronic health record data is simultaneously the most promising and the most treacherous resource in real-world evidence. It captures clinical care as it actually happens -- the diagnoses, treatments, labs, and outcomes that emerge from millions of patient encounters. It also reflects every billing incentive, documentation quirk, and data entry shortcut that clinicians and coders have developed over decades.
I have seen studies fail not because of analytic complexity but because researchers did not understand how their data was generated. The EHR is not a research database. It is a clinical and administrative system that we are repurposing for research. That distinction matters.
Start with Data Quality Assessment
Before any analysis begins -- before you write a single line of code -- you need to understand your data. This is not optional, and it is not something you can delegate entirely to the data engineering team. You need to look at the data yourself.
Completeness varies systematically. Not all patients receive the same level of documentation. Patients seen frequently have richer records. Patients who receive care outside your health system have gaps you cannot see. Sicker patients generate more data points, which can create bias when you treat data availability as a proxy for disease burden.
I recommend starting with a simple exercise: pick a sample of patients and manually review their records. Compare what you see in the structured data to what you see in the clinical notes. Where are the discrepancies? What information is captured reliably, and what is missing? This manual review is tedious, but it will save you from drawing conclusions from data artifacts.
Accuracy is diagnosis-specific. Some conditions are coded consistently because they drive reimbursement or trigger quality measures. Other conditions are undercoded because they are secondary to the primary reason for the visit, or because coding them creates documentation burden without benefit.
For any condition central to your study, you need to assess coding accuracy empirically. What is the positive predictive value of an ICD-10 code for that condition? How many true cases are missed by relying on diagnosis codes alone? These questions have different answers in different institutions and even in different departments within the same institution.
Timeliness creates hidden biases. EHR data is not real-time. Lab results may take hours to flow into the record. Discharge diagnoses may not appear until days after discharge. Claims data can lag by weeks or months. If you are studying acute conditions or time-sensitive interventions, these lags can introduce measurement error that is hard to characterize.
Phenotype Validation is Non-Negotiable
The most common failure mode I see in EHR studies is inadequate phenotype validation. Researchers define their study population using a combination of diagnosis codes, procedure codes, and lab values, then proceed to analysis without ever checking whether that definition actually identifies the patients they think it identifies.
Chart review validation is the gold standard. Select a random sample of patients who meet your phenotype definition and have a clinician review their records. Does the clinical picture match what your algorithm predicted? What proportion of identified patients actually have the condition? This is the positive predictive value of your phenotype, and you need to know it.
Equally important: select a sample of patients who do not meet your definition and check whether any of them actually have the condition. This tells you about the sensitivity of your phenotype -- how many true cases are you missing?
Sensitivity analyses are mandatory, not optional. Even with chart validation, there is uncertainty around your phenotype definition. Address this by running your primary analysis under multiple reasonable phenotype definitions. If your conclusions hold across definitions, you can be more confident. If your conclusions change dramatically when you modify the definition slightly, that is a signal that your findings may be artifacts of how you chose to define the population.
External benchmarks provide context. How does the prevalence of your condition in your study population compare to external epidemiological estimates? If you are finding that 40% of elderly patients have heart failure when population studies suggest 10%, something is wrong -- either with your data, your phenotype definition, or both.
Confounding Control in Observational Data
Here is the uncomfortable truth about EHR studies: patients are not randomized to treatments. The treatments they receive reflect clinical decision-making that incorporates information about their health status, preferences, and circumstances. This information is often not fully captured in the structured data available to you.
Propensity score methods address measured confounding. If you can identify the factors that influence treatment assignment and measure them accurately, propensity score matching or weighting can create a comparison group that is balanced on those factors. This is standard methodology, but it only addresses confounders you can see and measure.
The key assumptions are that you have correctly specified the propensity model and that there are no unmeasured confounders. In EHR data, the second assumption is almost certainly violated. Clinicians use information -- visual impressions of patient frailty, nuances from conversations, family context -- that does not appear in structured fields.
Instrumental variable approaches can address unmeasured confounding when a valid instrument exists. The classic example is using provider preference as an instrument for treatment choice, under the assumption that which provider a patient happens to see is essentially random with respect to unmeasured patient factors. This is a strong assumption, and you need to argue for it convincingly.
Negative control outcomes expose unmeasured confounding. If your treatment affects outcomes that it should not biologically affect, you have evidence of confounding. For example, if you are studying whether a statin reduces cardiovascular events and you find it also "reduces" hip fractures, that association is almost certainly confounded -- and if one outcome is confounded, others probably are too.
Sensitivity analyses quantify how robust your conclusions are. How much unmeasured confounding would be required to explain away your observed effect? Methods like E-values give you a quantitative answer. If a modest amount of unmeasured confounding would eliminate your effect, you should be cautious about causal claims.
Time is More Complicated Than You Think
EHR data has timestamps, which makes it tempting to think of it as a clean timeline of events. It is not. Time in EHR data is complex and often misleading.
Index date selection shapes everything. When does your study begin for each patient? The choice of index date determines who enters your cohort, what counts as baseline characteristics, and what counts as follow-up. Small changes in this definition can dramatically change your results.
Immortal time bias is pervasive and easy to miss. If treatment status is determined by something that happens during follow-up, you can inadvertently credit the treatment with time during which treated patients had to survive to receive treatment. This bias inflates treatment effects and has led to retracted studies.
Protopathic bias distorts associations. When symptoms of an undiagnosed outcome trigger the treatment that gets blamed for the outcome, you observe a spurious association. The classic example is studying whether aspirin causes stomach cancer when aspirin is often prescribed for stomach pain that turns out to be early cancer symptoms.
Time-varying confounding requires special methods. When confounders change over time and also influence treatment decisions, standard regression adjustment can introduce bias rather than remove it. Methods like marginal structural models can address this, but they add complexity and require careful thought about the causal structure of your problem.
Documentation and Transparency
Rigorous methodology means nothing if it is not documented in a way that others can evaluate and reproduce.
Prespecify your analysis plan. Before you run your primary analysis, write down exactly what you are going to do. What is your study population? What is your exposure? What is your outcome? What is your statistical approach? What sensitivity analyses will you run? This prevents the temptation to selectively report only the analyses that yield favorable results.
Report your validation results. What was the positive predictive value of your phenotypes? What was the balance after propensity score adjustment? What did your negative controls show? Readers cannot evaluate your study without this information.
Provide code and analytic details. Ideally, your entire analysis should be reproducible from code and data. When data sharing is not possible due to privacy constraints, provide detailed enough documentation that someone with access to similar data could replicate your approach.
The Reality Check
I want to be clear about the limitations of even the best EHR studies. They are not randomized trials. They cannot definitively establish causation. They are subject to biases that are difficult or impossible to fully address.
This does not mean they are useless. EHR studies can generate hypotheses, provide real-world context for trial findings, study populations excluded from trials, examine long-term outcomes, and identify safety signals. They complement trials -- they do not replace them.
The danger is when researchers overclaim. When an observational study with modest confounding control is presented as if it were definitive causal evidence, that undermines trust in the entire field. The best EHR researchers are the ones who are most explicit about what their studies can and cannot establish.
My Recommendation
If you are designing an EHR-based study, here is my recommended sequence:
-
Spend time with your data before you design your study. Understand its structure, its limitations, and its quirks. Talk to the people who generated it.
-
Define your phenotypes carefully and validate them empirically. Do not assume that standard definitions work in your context.
-
Think hard about confounding. Draw the causal diagram. Identify what you can measure and what you cannot. Be honest about the assumptions you are making.
-
Design sensitivity analyses upfront. What alternative decisions could you have made? How would those change your results?
-
Document everything. Write your analysis plan before you see results. Report your methods in enough detail that others can evaluate them.
-
Present conclusions appropriately. If your methods support association but not causation, say so. If there are important limitations, acknowledge them. Build trust by being honest about what you have actually demonstrated.
EHR data is powerful, but power requires responsibility. The studies that move the field forward are the ones that pair sophisticated methods with methodological humility.