Statistical Significance
Statistical significance is your quality gate for product decisions. It separates signal from noise.
Your A/B Test Looks Like a Winner. But Is It?
That 3% lift in conversion might be real. Or it might be absolutely nothing.
Statistical significance is the one concept that separates PMs who ship based on evidence from those who ship based on vibes.
Let’s break it down without the academic jargon.
What statistical significance actually means
Statistical significance tells you whether an observed result is likely real or just random noise. It doesn’t tell you the result matters. It tells you the result exists.
The mechanism is straightforward: you calculate a p-value, which represents the probability of seeing your result (or something more extreme) if there were actually no difference between your variants.
A p-value of 0.05 or less means there’s less than a 5% chance the result happened by accident, and you call it “statistically significant”.statsig+1
The three concepts you need to internalize
P-value (α = 0.05): The industry-standard threshold. Below it, you reject the null hypothesis (that nothing changed). Above it, you can’t confidently say your change had an effect.statology+1
Confidence intervals: Don’t just look at a single number. The interval shows the range where the true effect likely falls. This tells you the magnitude of impact.
Effect size vs. statistical significance: A massive sample can make a 0.01% lift “significant.” That doesn’t make it worth shipping. Always pair statistical significance with practical significance.
Where PMs get it wrong
Peeking at results early. This is the number one mistake. Every time you check results before your test reaches the required sample size, you inflate your false positive rate. Set your sample size upfront using power analysis, then walk away until it’s done.statsig+1
Ignoring sample size requirements. Power analysis tells you how many users you need per variant to detect a meaningful difference. Skip this step, and you’ll either call a test too early or run it far longer than necessary.
Treating “not significant” as “no effect.” A non-significant result doesn’t prove that nothing happened. It means you didn’t collect enough evidence. This distinction matters when deciding whether to iterate or kill a feature.
A quick real-world framework
Before launching any A/B test, answer these four questions:
What’s my primary metric and minimum detectable effect (MDE)?
How many users do I need per variant (use a sample size calculator)?
How long will it take to reach that sample size, given my traffic?
What decision will I make for each possible outcome: significant win, significant loss, or inconclusive?
This prevents the “now what?“ moment after results come in.
The takeaway
Statistical significance is your quality gate for product decisions. It separates signal from noise. But it’s only one input. Pair it with effect size, segment analysis, and business context before shipping anything.
Next time someone on your team says “the test is significant,“ ask them: significant and meaningful, or just significant?
Until next time,
Samet Özkale, AI for Product Power
AI Product & Design Manager | samet.works


