CH
CalcHub
Back to Guides
Beginner

Statistics Every Trader Should Know

Statistics is the backbone of quantitative trading. Every strategy, every risk metric, every signal is grounded in statistical concepts. This guide covers the essential statistics every trader and quant should know β€” with financial examples throughout.

Mean, Median, and Standard Deviation

You probably know these, but let's apply them to finance:

import numpy as np # Simulating 252 days of stock returns np.random.seed(42) returns = np.random.normal(0.0004, 0.015, 252)  # mean ~0.04%/day, std ~1.5%/day mean_return = np.mean(returns) median_return = np.median(returns) std_return = np.std(returns) print(f"Mean daily return:   {{mean_return:.4f}} ({{mean_return*252:.2%}} annualised)") print(f"Median daily return: {{median_return:.4f}}") print(f"Std deviation:       {{std_return:.4f}} ({{std_return*np.sqrt(252):.2%}} annualised)") # Annualise: multiply mean by 252, multiply std by sqrt(252)

Why Standard Deviation = Risk

In finance, standard deviation IS volatility. A stock with 1% daily std moves about 1% per day on average. That is about 16% per year (1% x sqrt(252)). Higher std = more unpredictable = more risk.

The Normal Distribution (And Why Returns Are Not Normal)

Many financial models assume returns follow a bell curve (normal distribution). In reality, extreme events happen much more often than a bell curve predicts. These are called β€œfat tails.”

from scipy import stats import matplotlib.pyplot as plt import yfinance as yf # Real stock returns spy = yf.download("SPY", start="2000-01-01", end="2024-01-01") real_returns = spy["Close"].pct_change().dropna() # Compare to normal distribution print(f"Kurtosis: {{stats.kurtosis(real_returns):.2f}}") # Normal distribution has kurtosis = 0 # Real returns have kurtosis > 0 (fat tails) print(f"Skewness: {{stats.skew(real_returns):.2f}}") # Negative skew means more extreme negative returns than positive

Fat Tails Kill

The 2008 crash, COVID crash, and Flash Crash were all events that a normal distribution would say should happen once every thousands of years. They happened within 12 years of each other. Any model that assumes normality will underestimate tail risk.

Correlation vs Causation

Correlation measures if two things move together. It does NOT mean one causes the other. Ice cream sales and drowning deaths are correlated β€” because both increase in summer, not because ice cream causes drowning.

# Correlation between two stocks import yfinance as yf data = yf.download(["AAPL", "MSFT"], start="2023-01-01", end="2024-01-01")["Close"] returns = data.pct_change().dropna() correlation = returns["AAPL"].corr(returns["MSFT"]) print(f"AAPL-MSFT correlation: {{correlation:.3f}}") # Will be high (~0.7-0.9) because they are both big tech # But AAPL going up does NOT cause MSFT to go up

Regression: Does X Predict Y?

from scipy import stats import numpy as np # Does the market (SPY) predict individual stock returns? spy = yf.download("SPY", start="2023-01-01", end="2024-01-01")["Close"].pct_change().dropna() aapl = yf.download("AAPL", start="2023-01-01", end="2024-01-01")["Close"].pct_change().dropna() # Align dates common = spy.index.intersection(aapl.index) spy_aligned = spy.loc[common] aapl_aligned = aapl.loc[common] # Linear regression: AAPL = alpha + beta * SPY slope, intercept, r_value, p_value, std_err = stats.linregress(spy_aligned, aapl_aligned) print(f"Beta (slope):  {{slope:.3f}}")   # How much AAPL moves per 1% SPY move print(f"Alpha:         {{intercept:.6f}}")  # Excess return (positive = outperforming) print(f"R-squared:     {{r_value**2:.3f}}")  # How much of AAPL is explained by SPY print(f"P-value:       {{p_value:.6f}}")   # Is this relationship statistically significant?

P-Values: When Is a Result Significant?

P-Value in Plain English

The p-value answers: If there were NO real relationship, how likely is it that I would see results this extreme just by chance? A p-value below 0.05 means there is less than a 5% chance the result is random noise. But be careful: with enough data, everything becomes significant even if the effect is tiny.

Survivorship Bias

If you backtest a strategy on today's S&P 500 stocks, you are only testing on companies that survived until today. All the companies that went bankrupt, got delisted, or were acquired are excluded. This makes your backtest look better than reality.

Look-Ahead Bias

Using information from the future in your historical backtest. For example, using a company's earnings announcement to make a β€œdecision” the day before the announcement happened. Easy to do accidentally, devastating to your results.