
Picture by Creator
# Introduction
When making use of for a job at Meta (previously Fb), Apple, Amazon, Netflix, or Alphabet (Google) — collectively referred to as FAANG — interviews hardly ever take a look at whether or not you’ll be able to recite textbook definitions. As an alternative, interviewers wish to see whether or not you analyze information critically and whether or not you’ll determine a nasty evaluation earlier than it ships to manufacturing. Statistical traps are one of the vital dependable methods to check that.
![]()
These pitfalls replicate the varieties of selections that analysts face each day: a dashboard quantity that appears high-quality however is definitely deceptive, or an experiment consequence that appears actionable however incorporates a structural flaw. The interviewer already is aware of the reply. What they’re watching is your thought course of, together with whether or not you ask the best questions, discover lacking info, and push again on a quantity that appears good at first sight. Candidates stumble over these traps repeatedly, even these with sturdy mathematical backgrounds.
We’ll study 5 of the most typical traps.
# Understanding Simpson’s Paradox
This entice goals to catch individuals who unquestioningly belief aggregated numbers.
Simpson’s paradox occurs when a pattern seems in several teams of knowledge however vanishes or reverses when combining these teams. The traditional instance is UC Berkeley’s 1973 admissions information: general admission charges favored males, however when damaged down by division, ladies had equal or higher admission charges. The combination quantity was deceptive as a result of ladies utilized to extra aggressive departments.
The paradox is inevitable every time teams have completely different sizes and completely different base charges. Understanding that’s what can separate a surface-level reply from a deep one.
In interviews, a query would possibly seem like this: “We ran an A/B take a look at. Total, variant B had a better conversion fee. Nonetheless, once we break it down by gadget sort, variant A carried out higher on each cell and desktop. What is occurring?” A robust candidate refers to Simpson’s paradox, clarifies its trigger (group proportions differ between the 2 variants), and asks to see the breakdown slightly than belief the mixture determine.
Interviewers use this to examine whether or not you instinctively ask about subgroup distributions. In the event you simply report the general quantity, you could have misplaced factors.
// Demonstrating With A/B Take a look at Knowledge
Within the following demonstration utilizing Pandas, we are able to see how the mixture fee may be deceptive.
import pandas as pd
# A wins on each gadgets individually, however B wins in combination
# as a result of B will get most site visitors from higher-converting cell.
information = pd.DataFrame({
'gadget': ['mobile', 'mobile', 'desktop', 'desktop'],
'variant': ['A', 'B', 'A', 'B'],
'converts': [40, 765, 90, 10],
'guests': [100, 900, 900, 100],
})
information['rate'] = information['converts'] / information['visitors']
print('Per gadget:')
print(information[['device', 'variant', 'rate']].to_string(index=False))
print('nAggregate (deceptive):')
agg = information.groupby('variant')[['converts', 'visitors']].sum()
agg['rate'] = agg['converts'] / agg['visitors']
print(agg['rate'])
Output:

# Figuring out Choice Bias
This take a look at lets interviewers assess whether or not you consider the place information comes from earlier than analyzing it.
Choice bias arises when the info you could have is just not consultant of the inhabitants you are trying to know. As a result of the bias is within the information assortment course of slightly than within the evaluation, it’s easy to miss.
Take into account these attainable interview framings:
- We analyzed a survey of our customers and located that 80% are happy with the product. Does that inform us our product is nice? A strong candidate would level out that happy customers are extra probably to reply to surveys. The 80% determine most likely overstates satisfaction since sad customers most certainly selected to not take part.
- We examined clients who left final quarter and found they primarily had poor engagement scores. Ought to our consideration be on engagement to cut back churn? The issue right here is that you simply solely have engagement information for churned customers. You don’t have engagement information for customers who stayed, which makes it unimaginable to know if low engagement truly predicts churn or whether it is only a attribute of churned customers normally.
A associated variant value realizing is survivorship bias: you solely observe the outcomes that made it by some filter. In the event you solely use information from profitable merchandise to research why they succeeded, you’re ignoring those who failed for a similar causes that you’re treating as strengths.
// Simulating Survey Non-Response
We will simulate how non-response bias skews outcomes utilizing NumPy.
import numpy as np
import pandas as pd
np.random.seed(42)
# Simulate customers the place happy customers usually tend to reply
satisfaction = np.random.selection([0, 1], measurement=1000, p=[0.5, 0.5])
# Response likelihood: 80% for happy, 20% for unhappy
response_prob = np.the place(satisfaction == 1, 0.8, 0.2)
responded = np.random.rand(1000) < response_prob
print(f"True satisfaction fee: {satisfaction.imply():.2%}")
print(f"Survey satisfaction fee: {satisfaction[responded].imply():.2%}")
Output:
![]()
Interviewers use choice bias inquiries to see when you separate “what the info reveals” from “what’s true about customers.”
# Stopping p-Hacking
p-hacking (additionally known as information dredging) occurs once you run many assessments and solely report those with ( p < 0.05 ).
The problem is that ( p )-values are solely meant for particular person assessments. One false optimistic could be anticipated by likelihood alone if 20 assessments have been run at a 5% significance degree. The false discovery fee is elevated by fishing for a big consequence.
An interviewer would possibly ask you the next: “Final quarter, we performed fifteen characteristic experiments. At ( p < 0.05 ), three have been discovered to be important. Do all three should be shipped?” A weak reply says sure.
A robust reply would firstly ask what the hypotheses have been earlier than the assessments have been run, if the importance threshold was set upfront, and whether or not the group corrected for a number of comparisons.
The follow-up typically entails how you’ll design experiments to keep away from this. Pre-registering hypotheses earlier than information assortment is essentially the most direct repair, because it removes the choice to determine after the actual fact which assessments have been “actual.”
// Watching False Positives Accumulate
We will observe how false positives happen by likelihood utilizing SciPy.
import numpy as np
from scipy import stats
np.random.seed(0)
# 20 A/B assessments the place the null speculation is TRUE (no actual impact)
n_tests, alpha = 20, 0.05
false_positives = 0
for _ in vary(n_tests):
a = np.random.regular(0, 1, 1000)
b = np.random.regular(0, 1, 1000) # equivalent distribution!
if stats.ttest_ind(a, b).pvalue < alpha:
false_positives += 1
print(f'Assessments run: {n_tests}')
print(f'False positives (p<0.05): {false_positives}')
print(f'Anticipated by likelihood alone: {n_tests * alpha:.0f}')
Output:
![]()
Even with zero actual impact, ~1 in 20 assessments clears ( p < 0.05 ) by likelihood. If a group runs 15 experiments and stories solely the numerous ones, these outcomes are most certainly noise.
It’s equally vital to deal with exploratory evaluation as a type of speculation era slightly than affirmation. Earlier than anybody takes motion based mostly on an exploration consequence, a confirmatory experiment is required.
# Managing A number of Testing
This take a look at is carefully associated to p-hacking, however it’s value understanding by itself.
The a number of testing drawback is the formal statistical difficulty: once you run many speculation assessments concurrently, the likelihood of at the least one false optimistic grows rapidly. Even when the therapy has no impact, you must anticipate roughly 5 false positives when you take a look at 100 metrics in an A/B take a look at and declare something with ( p < 0.05 ) as important.
The corrections for this are well-known: Bonferroni correction (divide alpha by the variety of assessments) and Benjamini-Hochberg (controls the false discovery fee slightly than the family-wise error fee).
Bonferroni is a conservative strategy: for instance, when you take a look at 50 metrics, your per-test threshold drops to 0.001, making it tougher to detect actual results. Benjamini-Hochberg is extra acceptable when you’re keen to simply accept some false discoveries in trade for extra statistical energy.
In interviews, this comes up when discussing how an organization tracks experiment metrics. A query may be: “We monitor 50 metrics per experiment. How do you determine which of them matter?” A strong response discusses pre-specifying main metrics previous to the experiment’s execution and treating secondary metrics as exploratory whereas acknowledging the problem of a number of testing.
Interviewers are looking for out if you’re conscious that taking extra assessments ends in extra noise slightly than extra info.
# Addressing Confounding Variables
This entice catches candidates who deal with correlation as causation with out asking what else would possibly clarify the connection.
A confounding variable is one which influences each the unbiased and dependent variables, creating the phantasm of a direct relationship the place none exists.
The traditional instance: ice cream gross sales and drowning charges are correlated, however the confounder is summer time warmth; each go up in heat months. Performing on that correlation with out accounting for the confounder results in dangerous choices.
Confounding is especially harmful in observational information. In contrast to a randomized experiment, observational information doesn’t distribute potential confounders evenly between teams, so variations you see won’t be brought on by the variable you’re finding out in any respect.
A standard interview framing is: “We seen that customers who use our cell app extra are inclined to have considerably greater income. Ought to we push notifications to extend app opens?” A weak candidate says sure. A robust one asks what sort of consumer opens the app often to start with: probably essentially the most engaged, highest-value customers.
Engagement drives each app opens and spending. The app opens usually are not inflicting income; they’re a symptom of the identical underlying consumer high quality.
Interviewers use confounding to check whether or not you distinguish correlation from causation earlier than drawing conclusions, and whether or not you’ll push for randomized experimentation or propensity rating matching earlier than recommending motion.
// Simulating A Confounded Relationship
import numpy as np
import pandas as pd
np.random.seed(42)
n = 1000
# Confounder: consumer high quality (0 = low, 1 = excessive)
user_quality = np.random.binomial(1, 0.5, n)
# App opens pushed by consumer high quality, not unbiased
app_opens = user_quality * 5 + np.random.regular(0, 1, n)
# Income additionally pushed by consumer high quality, not app opens
income = user_quality * 100 + np.random.regular(0, 10, n)
df = pd.DataFrame({
'user_quality': user_quality,
'app_opens': app_opens,
'income': income
})
# Naive correlation seems to be sturdy — deceptive
naive_corr = df['app_opens'].corr(df['revenue'])
# Inside-group correlation (controlling for confounder) is close to zero
corr_low = df[df['user_quality']==0]['app_opens'].corr(df[df['user_quality']==0]['revenue'])
corr_high = df[df['user_quality']==1]['app_opens'].corr(df[df['user_quality']==1]['revenue'])
print(f"Naive correlation (app opens vs income): {naive_corr:.2f}")
print(f"Correlation controlling for consumer high quality:")
print(f" Low-quality customers: {corr_low:.2f}")
print(f" Excessive-quality customers: {corr_high:.2f}")
Output:
Naive correlation (app opens vs income): 0.91
Correlation controlling for consumer high quality:
Low-quality customers: 0.03
Excessive-quality customers: -0.07
The naive quantity seems to be like a robust sign. When you management for the confounder, it disappears fully. Interviewers who see a candidate run this sort of stratified examine (slightly than accepting the mixture correlation) know they’re speaking to somebody who won’t ship a damaged suggestion.
# Wrapping Up
All 5 of those traps have one thing in frequent: they require you to decelerate and query the info earlier than accepting what the numbers appear to point out at first look. Interviewers use these situations particularly as a result of your first intuition is commonly mistaken, and the depth of your reply after that first intuition is what separates a candidate who can work independently from one who wants route on each evaluation.

None of those concepts are obscure, and interviewers inquire about them as a result of they’re typical failure modes in actual information work. The candidate who acknowledges Simpson’s paradox in a product metric, catches a range bias in a survey, or questions whether or not an experiment consequence survived a number of comparisons is the one who will ship fewer dangerous choices.
In the event you go into FAANG interviews with a reflex to ask the next questions, you’re already forward of most candidates:
- How was this information collected?
- Are there subgroups that inform a unique story?
- What number of assessments contributed to this consequence?
Past serving to in interviews, these habits also can forestall dangerous choices from reaching manufacturing.
Nate Rosidi is an information scientist and in product technique. He is additionally an adjunct professor educating analytics, and is the founding father of StrataScratch, a platform serving to information scientists put together for his or her interviews with actual interview questions from high firms. Nate writes on the most recent tendencies within the profession market, provides interview recommendation, shares information science tasks, and covers the whole lot SQL.