Statistics Every Trader Should Know
Statistics is the backbone of quantitative trading. Every strategy, every risk metric, every signal is grounded in statistical concepts. This guide covers the essential statistics every trader and quant should know β with financial examples throughout.
Mean, Median, and Standard Deviation
You probably know these, but let's apply them to finance:
import numpy as np # Simulating 252 days of stock returns np.random.seed(42) returns = np.random.normal(0.0004, 0.015, 252) # mean ~0.04%/day, std ~1.5%/day mean_return = np.mean(returns) median_return = np.median(returns) std_return = np.std(returns) print(f"Mean daily return: {{mean_return:.4f}} ({{mean_return*252:.2%}} annualised)") print(f"Median daily return: {{median_return:.4f}}") print(f"Std deviation: {{std_return:.4f}} ({{std_return*np.sqrt(252):.2%}} annualised)") # Annualise: multiply mean by 252, multiply std by sqrt(252)Why Standard Deviation = Risk
In finance, standard deviation IS volatility. A stock with 1% daily std moves about 1% per day on average. That is about 16% per year (1% x sqrt(252)). Higher std = more unpredictable = more risk.
The Normal Distribution (And Why Returns Are Not Normal)
Many financial models assume returns follow a bell curve (normal distribution). In reality, extreme events happen much more often than a bell curve predicts. These are called βfat tails.β
from scipy import stats import matplotlib.pyplot as plt import yfinance as yf # Real stock returns spy = yf.download("SPY", start="2000-01-01", end="2024-01-01") real_returns = spy["Close"].pct_change().dropna() # Compare to normal distribution print(f"Kurtosis: {{stats.kurtosis(real_returns):.2f}}") # Normal distribution has kurtosis = 0 # Real returns have kurtosis > 0 (fat tails) print(f"Skewness: {{stats.skew(real_returns):.2f}}") # Negative skew means more extreme negative returns than positiveFat Tails Kill
The 2008 crash, COVID crash, and Flash Crash were all events that a normal distribution would say should happen once every thousands of years. They happened within 12 years of each other. Any model that assumes normality will underestimate tail risk.
Correlation vs Causation
Correlation measures if two things move together. It does NOT mean one causes the other. Ice cream sales and drowning deaths are correlated β because both increase in summer, not because ice cream causes drowning.
# Correlation between two stocks import yfinance as yf data = yf.download(["AAPL", "MSFT"], start="2023-01-01", end="2024-01-01")["Close"] returns = data.pct_change().dropna() correlation = returns["AAPL"].corr(returns["MSFT"]) print(f"AAPL-MSFT correlation: {{correlation:.3f}}") # Will be high (~0.7-0.9) because they are both big tech # But AAPL going up does NOT cause MSFT to go upRegression: Does X Predict Y?
from scipy import stats import numpy as np # Does the market (SPY) predict individual stock returns? spy = yf.download("SPY", start="2023-01-01", end="2024-01-01")["Close"].pct_change().dropna() aapl = yf.download("AAPL", start="2023-01-01", end="2024-01-01")["Close"].pct_change().dropna() # Align dates common = spy.index.intersection(aapl.index) spy_aligned = spy.loc[common] aapl_aligned = aapl.loc[common] # Linear regression: AAPL = alpha + beta * SPY slope, intercept, r_value, p_value, std_err = stats.linregress(spy_aligned, aapl_aligned) print(f"Beta (slope): {{slope:.3f}}") # How much AAPL moves per 1% SPY move print(f"Alpha: {{intercept:.6f}}") # Excess return (positive = outperforming) print(f"R-squared: {{r_value**2:.3f}}") # How much of AAPL is explained by SPY print(f"P-value: {{p_value:.6f}}") # Is this relationship statistically significant?P-Values: When Is a Result Significant?
P-Value in Plain English
The p-value answers: If there were NO real relationship, how likely is it that I would see results this extreme just by chance? A p-value below 0.05 means there is less than a 5% chance the result is random noise. But be careful: with enough data, everything becomes significant even if the effect is tiny.
Survivorship Bias
If you backtest a strategy on today's S&P 500 stocks, you are only testing on companies that survived until today. All the companies that went bankrupt, got delisted, or were acquired are excluded. This makes your backtest look better than reality.
Look-Ahead Bias
Using information from the future in your historical backtest. For example, using a company's earnings announcement to make a βdecisionβ the day before the announcement happened. Easy to do accidentally, devastating to your results.