Frequentist statistics

Summary

 * Q: What kinds of distributions appear in nature and why? How spread out are distributions? How can we get a handle on complex distributions?
 * Ideas:
 * Refresher in probability, independence, conditional probability variance
 * Chernoff/Azuma-Hoeffding
 * Martingalery
 * Normal distributions, log-normal distributions, power laws
 * Exercises/skills:
 * Applications of probability (to tricky questions)
 * Probability questions (about concentration)
 * Applications of martingales (to tricky questions)
 * Intuitive statistics

Logistics

 * Time: rough guess ~ 3 hours [Paul]
 * Teacher: ?
 * Pairs with assigning probabilities / quick and dirty reasoning

Outline

 * We'll start with a very fast intro to probability: (this may be inadvisable!)
 * When we want to apply probability theory, we start with a set of possible worlds. For example, if I flip a coin over and over again, infinitely many times, there is a possible world for each infinite sequence 0011001001…
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Then we introduce something called a <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;">measure <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">, which assigns to sets of possible worlds S a probability P(S).
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">The "probability" of any given sequence 00100101000…. is 0…
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">But we can talk about the probability of larger sets. For example, the probability of the set of worlds in which the first coin flip is 0, { 0s }, is 1/2.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">P satisfies some natural axioms; in particular, if A and B are disjoint, P(A union B) = P(A) + P(B), and for any set S, 0 <= P(S) <= 1.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">A random variable is just a function on the world. For example, the function which maps a sequence to its first element would be called a random variable, as would the function that maps a sequence to the sum of 2^{-i} for each i where the coin came up heads.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">The <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;">expectation <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;"> of a random variable X is just defined as an integral: int_w X(w) dw. You should think of this as exactly analogous to the sum, if there are only finitely many possible worlds: E[X] = sum_w P(w) X(w). If X is positive we could also define this as a supremum: E[X] = sup_{partitions Pi} sum_i P(Pi_i) * Min_{w in Pi} X(w). (picture)
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">E[X+Y] = E[X] + E[Y]
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Perhaps the most important phenomenon in probability theory is independence. Formally, A and B are independent if P(A cap B) = P(A)P(B). Two random variables X and Y are independent if for every set A defined in terms of X and every set B defined in terms of Y, A and B are independent.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">If X and Y are independent, then E[X*Y] = E[X] * E[Y]
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Note that E[X*Y] = E[X] * E[Y] is only guaranteed to be true if X and Y are independent. But E[X+Y] = E[X] + E[Y] is <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:italic;font-variant:normal;text-decoration:none;vertical-align:baseline;">always <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;"> true.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Independence captures the intuitive notion of unrelatedness, which makes it very powerful. For example, if X, Y, and Z are independent, then f(X, Y) is independent of Z.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">(Conditional probability discussion?)


 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">probability distributions are messy, complicated objects, but for distributions over real numbers there are often basically 2 things you care about:
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Mean
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Concentration about the mean
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">E.g.:
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">When I'm trying to estimate something, I care about what the best estimate is, but I also care how confident I should be--i.e., how far is the truth likely to be from my estimate?
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">When making a risky investment, I care about how much I expect to make, but I also care about how "lumpy" the winnings are--since I value money less in worlds where I'm richer, I like investments less if most of the expected winnings occur in worlds where I'm already rich
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">When trying to estimate the "surprisingness" of an observation (e.g., 65 heads out of 100 coin flips), I need to know not only the expected value, but how close to the expected value the observation should be.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">The mean of a random variable isn't so complicated, thanks to the fact that E[X+Y] = E[X] + E[Y]
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Talking about the concentration of random variables is not so straightforward in general.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">One good measure of the concentration of a random variable is its variance Var(X) = E[ (X - E[X])^2 ] = E[X^2] - E[X]^2. (prove second equality as exercise)
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Var(X+Y) = ?
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">We have (X+Y) - E[X+Y] = (X - E[X]) + (Y - E[Y]), when we square it we get (X - E[X])^2 + 2 (X - E[X]) (Y - E[Y]) + (Y - E[Y])^2
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">If X and Y are independent, E [ (X - E[X]) (Y - E[Y]) ] = E[ (X - E[X]) ] E[ (Y - E[Y]) ] = 0 * 0 = 0, so Var(X+Y) = Var(X) + Var(Y)
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">If Y = -X, then X+Y = 0, so Var(X+Y) = 0.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">If Y = X, then Var(X+Y) = Var(2X) = E[ (2(X - E[X]))^2 ] = 4 Var(X) > 2 Var(X)
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">What is the intuitive significance of variance?
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">How far do we expect X to be from its mean? Intuitively, since Var(X) is the expected square of the distance, the distance should be about sqrt(Var(X)), which we therefore call the standard deviation.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Formally:
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Var(X) = E[(X - E[X])^2] > P( |X - E[X]| > N * sqrt(Var(X)) ) * N^2 Var(X),
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">so P(|X - E[X]| > N*sqt(Var(X)) ) < 1 / N^2.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">These tools are enough to start making some interesting statements:
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Suppose I flip an unbiased coin 10000 times and I count the number of heads. What can I say about the probability that the number of heads is more than 5200?
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Var(one flip) = E[ (# of heads) - E[# of heads] ^2 ] = E[ (# of heads - 1/2)^2 ] = 1/4
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Var(two flips) = Var(one flip + one flip) = 2 Var(one flip) by independence
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Var(N flips) = N var(one flip) = N/4.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">So Var(10000 flips) = 2500
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">So sqrt(Var(10000 flips)) = 50
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">So P(# of heads > 5200) <= 1/16
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">In general, if I take the average of N random variables, each of which is bounded, then the deviation from the expected average will be about 1 / sqrt(N). If I sample N people, I expect to get an estimate for the real population mean which is within 1 / sqrt(N). If I take N little bets, I expect to gain or lose an amount which is roughly root(N) times the average stakes.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">But this bound is a little weak:
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">it gives P(# of heads > 6000) <= 1/400
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">which is a pretty small probability, but in fact no where near as small as the probability should really be.
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">How can we do better?
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Idea: we computed E[X^2] and showed that X can't often be much larger than sqrt(E[X^2])
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Instead, look at E[exp(X)], and show that X can't often be much larger than log(E[exp(X)])
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">[ Prove Chernoff ]
 * <span style="font-size:15px;font-family:Arial;color:#000000;font-weight:normal;font-style:normal;font-variant:normal;text-decoration:none;vertical-align:baseline;">Maybe go on to martingalery