How to Actually Read a Study
Everything they should have taught us before someone shared a study on Twitter
In 2012, a cardiologist named Franz Messerli published a paper in the New England Journal of Medicine showing a striking linear correlation between a country’s per capita chocolate consumption and the number of Nobel laureates it produced. Switzerland, naturally, came out on top. Messerli estimated that an extra 0.4 kilograms of chocolate per person per year was associated with roughly one additional Nobel laureate per 10 million people.
The paper was tongue-in-cheek, of course. But it ran in one of the most prestigious medical journals on earth, it used real data, the correlation was statistically significant, and if you read only the abstract (or headline), you might think we had discovered the world’s tastiest cognitive enhancer. It is, in miniature, everything that can go wrong when you read a study poorly: a real statistical association, a mechanism, a prestigious venue, and a conclusion that is obviously, hilariously wrong. Again, Messerli and NEJM were having fun here, but…
Most misreadings of research aren’t this funny. They look more like this: someone shares a study on social media, it confirms what they already believe, the sample size “looks big,” the result is “statistically significant,” and they walk away thinking science has spoken. I have done this. You have probably done this. The entire health journalism industry is, in large part, built on doing this professionally.
This post is an attempt to explain what you should actually be doing instead. Not in the sense of “here’s a statistics textbook,” but in the sense of “here are the six or seven questions that, if you ask them every time you encounter a study, will make you a dramatically better consumer of evidence than almost everyone you know.”
I’ll be upfront: this is a skill that researchers themselves frequently get wrong. The American Statistical Association literally had to release an official statement in 2016 explaining what p-values do and don’t mean, which is a bit like the American Medical Association having to release a statement reminding doctors what a stethoscope is for. The problem isn’t that you’re dumb. The problem is that the incentive structure of science, journalism, and social media conspires to make study results seem much more clear-cut than they almost ever are.
Start with the question, not the answer
Before you look at a single number in a study, you need to know what question it was trying to answer. This sounds trivially obvious, and it is, which is why almost nobody does it.
The question has several components. Who was studied? What was the intervention or exposure? What was it compared to? What outcome was measured? Over what time period? In clinical research, this gets formalized as PICO (Population, Intervention, Comparator, Outcome), and it’s worth internalizing because it immediately reveals things that headlines hide.
Consider a headline like “Meditation reduces anxiety.” Okay. What kind of meditation? Compared to what? Doing nothing, or doing another relaxation technique? In whom? College students with mild stress, or people with diagnosed anxiety disorders? Over what period? Did they measure anxiety with a validated clinical scale, or did they just ask people if they felt better? Each of these details can transform the meaning of the finding.
Once you have the question, ask the next one: what kind of study is this, and can that kind of study answer that kind of question?
This is where most casual readers go off the rails. Different study designs can answer different questions. A randomized controlled trial (RCT) is the best tool we have for figuring out whether a treatment causes an improvement, because random assignment ensures the treatment and control groups don’t systematically differ in ways that would confuse the result. An observational study, where you just watch who takes the treatment and who doesn’t without randomizing, can be useful for studying things you can’t or shouldn’t randomize (you can’t randomly assign people to smoke for 30 years), but it’s perpetually haunted by confounding: the possibility that the groups differ in some unmeasured way that explains the outcome.
This is not a hypothetical concern. For decades, observational studies consistently found that women who took hormone replacement therapy had dramatically lower rates of heart disease, somewhere in the range of 30-50% lower risk. Then in 2002, the Women’s Health Initiative randomized trial of combined estrogen-progestin therapy found the opposite: it increased the risk of coronary heart disease, breast cancer, and stroke. A major part of the explanation is that women who chose to take HRT were, on average, wealthier, healthier, more health-conscious, and younger than women who didn’t. The observational studies weren’t measuring the effect of hormones. They were measuring the effect of being the kind of person who goes to the doctor regularly and has good insurance.
That’s the sort of thing that a study’s design either protects you from or doesn’t.
The most important number in the paper (it’s not the p-value)
If you take one habit away from this post, let it be this: look at the effect size first. The effect size is the actual magnitude of the difference the study found. It might be a risk ratio (”the treatment group had half the risk”), or an absolute risk difference (”2% of the treatment group had the outcome versus 4% of the control group”), or a mean difference (”the treatment group scored 3 points higher on a 100-point scale”).
The effect size is what the study actually found. Everything else, including the p-value, is commentary.
And that is where things get interesting. Relative and absolute effects can tell radically different stories. Suppose a drug “cuts your risk of heart attack by 50%.” Sounds incredible. But if your baseline risk was 2 in 1,000 and the drug lowered it to 1 in 1,000, that’s a 50% relative risk reduction built on an absolute reduction of 0.1 percentage points. You would need to treat 1,000 people to prevent one heart attack. Whether that’s worth it depends enormously on cost, side effects, and how much you personally value a 0.1% reduction in cardiac risk, none of which the “50% reduction” headline tells you.
This is not a made-up scenario. It’s roughly the shape of the statin-for-primary-prevention debate. The CTT Collaborators’ meta-analysis of 27 randomized trials found that for people at low vascular risk, each 1 mmol/L reduction in LDL cholesterol prevented about 11 major vascular events per 1,000 people treated over five years. That’s a meaningful relative reduction, but the absolute benefit for any individual person is small. Whether that’s “a lot” or “a little” is a judgment call that the study alone cannot make for you. The study can tell you the size of the effect. Whether the effect matters depends on values, context, and trade-offs that are outside the study’s jurisdiction.
Confidence intervals: the error bars you should actually care about
Every estimate in a study is uncertain. The confidence interval (CI) tells you how uncertain.
If a study reports a risk ratio of 0.75 (meaning the treatment group had 75% the risk of the control group), the confidence interval might be [0.60, 0.93]. That means the data are reasonably compatible with effects ranging from a 40% reduction to a 7% reduction. This is enormously useful information. It tells you not just what the study found, but what it didn’t rule out.
A common rookie mistake is to look at a confidence interval only to check whether it crosses 1.0 (or zero, for mean differences). If it doesn’t cross, the result is “statistically significant.” If it does, it isn’t. This is technically true but misses the point. A confidence interval that runs from a huge benefit to a huge harm tells you the study was basically uninformative; the data are compatible with everything. A confidence interval that sits entirely in the “trivially small effect” zone tells you the treatment probably doesn’t matter much, even if it’s “statistically significant.”
The right way to read a CI is to hold it up against what would actually matter. If you’re evaluating a depression treatment, and the minimal clinically important difference on the outcome scale is 5 points, then a CI of [0.5, 2.1] tells you the data are compatible with effects that are probably too small to make a depressed person actually feel better. The point estimate could be statistically significant and still clinically useless.
One more subtlety that Greenland and colleagues stress in their excellent guide to misinterpretations: a 95% confidence interval does not mean “there is a 95% probability that the true value falls in this interval.” In the frequentist framework (which is what most studies use), the 95% refers to the long-run performance of the procedure. If you repeated the study many times, about 95% of the intervals you would compute would contain the true value. That’s a statement about the method, not about any particular interval. In practice, this distinction rarely changes what you should do (look at the interval, take it seriously, don’t over-interpret the boundaries). But it’s worth knowing because people routinely describe CIs as probability statements about the specific interval in front of them, and that’s not quite right.
Confidence intervals are, however, only as trustworthy as the study that produced them. A tight interval around a biased estimate is just precise misinformation.
P-values: the most misunderstood number in all of science
In 2016, the American Statistical Association did something it had never done in its 177-year history: it released an official statement telling people how to interpret a specific statistical concept. The concept was the p-value, and the statement amounted to a polite, institution-wide scream of “you’re all doing it wrong.”
Here is what a p-value actually means: it tells you how incompatible the observed data are with a specified statistical model that usually includes the assumption of no effect. That’s it.
Here is what a p-value does not tell you:
It does not tell you the probability that the hypothesis is true.
It does not tell you the probability that the result happened “by chance.”
It does not tell you whether the effect is large.
It does not tell you whether the result is important.
Let me linger on these because they are counterintuitive. A p-value of 0.01 does not mean there’s a 1% chance the finding is a fluke. It means: if you assume the treatment has literally zero effect, then data this extreme (or more extreme) would occur about 1% of the time. That’s a statement about the data under a hypothetical model, not a statement about how likely the hypothesis is to be true. Whether the hypothesis is actually true depends on all sorts of things the p-value doesn’t know about: the prior probability that the treatment works, the quality of the study design, whether the outcome was prespecified, and so on.
The reason this matters practically is that people use p < 0.05 as a binary verdict machine. Below 0.05? The treatment works. Above 0.05? It doesn’t. Jacob Cohen, in a famous 1994 paper whose title, “The Earth Is Round (p < .05),” tells you everything about his mood, dismantled this ritual pretty thoroughly. The ritual persists anyway.
A 2019 editorial in The American Statistician was blunter: “We conclude, based on our review of the articles in this special issue and the description of the ASA statement on p-values, that it is time to stop using the term ‘statistically significant’ entirely.” The editors weren’t saying statistics don’t matter. They were saying that the binary framing, significant or not, hides more than it reveals.
So how should you treat a p-value when you encounter one? Roughly like a thermometer. A low p-value means the data run hot against the null hypothesis. A high p-value means they don’t. But neither tells you the temperature outside (or in this analogy, whether the treatment actually works). You have to combine it with the effect size, the confidence interval, the study design, the risk of bias, and everything else available to you.
Statistical significance vs. clinical significance (a.k.a. the distinction that would save the world)
This is the single most important conceptual distinction in this entire essay, and it is the one most consistently ignored by everyone from journalists to physicians.
Statistical significance means the data would be unusual if the null hypothesis were true, as operationalized by p < 0.05 (or whatever threshold was chosen). Clinical significance means the effect is large enough that a patient, policymaker, or anyone with skin in the game would actually care about it.
These two things can come apart in both directions.
A very large trial can detect a very small, clinically meaningless effect and still achieve p < 0.001. The SPRINT trial randomized over 9,300 people to either intensive or standard blood pressure management. It found a statistically significant benefit for intensive treatment, but the more interesting question (and the one that occupied cardiologists for years afterward) was whether the absolute benefit was large enough to justify the additional side effects, monitoring, and medication costs. The answer turned out to be context-dependent: for high-risk patients, yes; for lower-risk patients, it was less clear. The p-value alone couldn’t answer that.
Conversely, a small trial can find a potentially important effect but fail to achieve statistical significance simply because it didn’t have enough participants to detect it. This is the “absence of evidence is not evidence of absence” problem. Douglas Altman and Martin Bland warned about it in 1995 and people still treat every non-significant result as “the treatment doesn’t work.”
The fix is to think in terms of thresholds that matter. Ask: “What is the smallest effect that would actually be important?” Then look at the confidence interval and see whether it rules that threshold in or out. This is the logic behind what’s called the minimal clinically important difference (MCID), and it’s the bridge between what the statistics tell you and what the world needs to know.
Bias: the Methods section is more important than the Results
A study can have gorgeous statistics and still be wrong, because the statistical analysis is only as good as the data that goes into it. Bias is any systematic distortion that pushes the results in a particular direction, and it comes from the study’s design and execution, not from the math.
For randomized trials, the main things to check are allocation concealment (was the randomization sequence hidden from the people enrolling patients, so nobody could steer sicker patients into one group?), blinding (did the patients and clinicians know who got the treatment and who got the placebo?), and whether they analyzed patients in the groups they were assigned to, regardless of what they actually did (intention-to-treat analysis). If patients who got worse dropped out of the treatment arm, and the analysts just excluded them, the treatment is going to look a lot better than it is. Schulz and colleagues showed empirically that trials with inadequate allocation concealment and inadequate blinding produced inflated treatment effects.
For observational studies, the big worry is confounding: the treatment and control groups differ in ways that affect the outcome, and you can’t fully adjust for it. The STROBE guidelines for reporting observational studies explicitly tell authors to discuss both measured and unmeasured confounding, because statistical adjustment only works for the confounders you know about and measure well. There’s always the possibility that something you didn’t measure explains the finding. Always.
And then there’s the issue of whether the study measured what it said it would measure, or whether the goalposts quietly shifted during the game.
Preregistration and outcome switching: the fog of war
Here is a fact that should make you at least mildly uncomfortable: Chan and colleagues compared the protocols of randomized trials (what the researchers said they would measure before the study started) with their published papers (what they actually reported), and found that 62% of trials had at least one primary outcome that was either changed, introduced, or omitted between protocol and publication. Outcomes that were statistically significant were more likely to be fully reported. Outcomes that weren’t were more likely to disappear.
This is called selective outcome reporting, and it is devastating to the integrity of the evidence. If you run a study measuring 20 outcomes and only report the three that happened to cross p < 0.05, you haven’t proven anything. You’ve gone fishing and reported your catches while hiding all the times you came home empty-handed.
This is why trial registration exists. The International Committee of Medical Journal Editors requires prospective registration of clinical trials before they enroll any patients. The idea is simple: if you commit in advance to your primary outcome, your sample size, and your analysis plan, it becomes much harder to play games afterwards. So when you read a trial, check whether it was registered, and whether what the paper reports matches what the registry says. ClinicalTrials.gov is the most common registry. If a paper says its primary outcome was overall survival but the registry says it was originally progression-free survival, that’s a red flag.
The same logic applies to subgroup analyses. If a study finds that the treatment only worked in left-handed women over 55, and that subgroup wasn’t prespecified, your default reaction should be skepticism. When you slice data enough ways, you will find patterns, and nearly all of them are noise.
Sample size: important, but probably not in the way you think
Non-experts tend to judge studies by sample size first. “Was it a big study?” is one of the first questions people ask. It’s not a bad instinct, but it misses the important part.
What sample size mainly buys you is precision: smaller confidence intervals, more stable estimates. A small study might find a risk ratio of 0.50, but with a confidence interval stretching from 0.10 to 2.50, which tells you essentially nothing. A larger study might find a risk ratio of 0.80 with a CI of [0.72, 0.89], which is quite informative.
But sample size doesn’t buy you importance. A study of 500,000 people can precisely measure an effect so small that nobody cares. The precision just means you’re very confident the effect is trivially small. And a study of 200 people can observe a large, potentially important effect but not have the precision to confirm it.
The question to ask isn’t “how big is the study?” but rather “how precise is the estimate, and is the effect large enough to matter?” The confidence interval answers both questions simultaneously, which is why I keep coming back to it. It’s the Swiss Army knife of study interpretation.
No study is an island
Here’s John Ioannidis, in probably the most famous methods paper ever published, arguing in 2005 that most published research findings are false. His argument was mathematical: under realistic assumptions about study power, the prior probability that tested hypotheses are true, and the prevalence of various biases, the post-study probability that a significant result reflects a true effect is often below 50%.
You can quibble with Ioannidis’s specific assumptions (many people have), but the broad point holds up well: any single study is a noisy, biased, potentially unreliable snapshot. What matters is the totality of the evidence. One RCT is better than one observational study. But one RCT is still just one study. The gold standard, to the extent there is one, is a well-conducted systematic review or meta-analysis: a study of studies that explicitly searches for all relevant evidence on a question, assesses its quality, and synthesizes the results.
But even meta-analyses can mislead. If the included studies are all small and all biased in the same direction, pooling them gives you a very precise estimate of a biased answer. Publication bias (positive results get published, negative results sit in file drawers) can make the available literature systematically rosier than reality. The PRISMA guidelines for systematic reviews exist precisely because even syntheses of evidence can be done and reported well or badly, and transparent reporting is what lets readers tell the difference.
So when you encounter a single study, especially one making a surprising claim, the right move is not to accept or reject it on its own merits. The right move is to ask: where does this fit in the broader evidence base? Is there a Cochrane review? A recent meta-analysis? Do other studies point in the same direction, or is this an outlier? Single studies are dots. What you want is the trend line.
The habit of thought that actually matters
I started this post with a joke about chocolate and Nobel Prizes because I wanted to make a point that runs underneath everything I’ve said: the hard part of reading a study is not learning the vocabulary. It’s learning to resist the pull of a neat conclusion.
The human brain is wired to convert complexity into simple stories. “Study finds X” is a simple story. “Study finds an effect of this size, with this much uncertainty, using this design with these limitations, in this population, consistent with some previous studies and inconsistent with others, and here’s what it might mean for you specifically” is not a simple story. It is, unfortunately, the truth.
The best study readers I know (and I include researchers, clinicians, and a few unusually careful journalists) all share the same habit: they are deeply, constitutionally suspicious of confident conclusions drawn from single pieces of evidence. They ask “how big?” before “is it significant?” They ask “compared to what?” before “does it work?” They read the methods section more carefully than the abstract. They check the trial registry. And when they’ve done all of that, they usually end up not with certainty, but with a calibrated sense of how likely various possibilities are, which is less satisfying than “science proves X” but has the considerable advantage of being closer to the truth.
That’s the skill. Not a list of statistical terms you can rattle off at parties (though if you actually do this at parties, I have concerns). The skill is a temperament: the willingness to sit with uncertainty, to ask one more question before reaching a conclusion, and to change your mind when the evidence warrants it.
In a world where every study gets laundered through a headline designed to maximize clicks, that temperament is worth more than a PhD in biostatistics. I sincerely believe that.


