# tl;dr mit 6.s085

a friend mentioned that this course is a good review of applied statistics. since it was pretty short, i decided to read through last weekend. it was a good recap of the basics. i’ve summarized the big chapter-wise tl;drs here. the case studies were very well picked. i would recommend reading through the examples in the lecture notes.

basic statistics

ask the right questions. be ultra aware of what you are answering. fake polling data - asking the right questions is more important than using the most powerful tests. examining even/odd parity is not a very conspicuous problem statement.

base rate fallacy via xkcd: when the false positive rate is high and the true positive rate is very very low, your test should not have an error threshold which is so high for establishing significance.

statistics fails silently - understand what you are doing in a step by step manner.

complex effects: important to check for them by visualizing as analysis will require more sophisticated models. we need to watch out for multimodal data - means are not very accurate estimates. we also need to watch for skew visualize the data: anscombe’s quartet is the perfect counterexample to using metrics without fully understanding their importance in context (all the graphs have the same $R^2$).

aggregate data vs individual data: immigrants population and literacy rates by state are correlated on the aggregate level. does this me an immigrants cause higher literacy rates? no. immigrants choose to settle in states with high literacy rates. individual level data shows that immigrants were in fact less likely to be literate. be ultra aware of which question you are answering. john kerry vs bush by gelman. more rich voted bush and more poor voted kerry but kerry won the rich states and bush won the poor states. all the rich people in the poor states voted bush. the few rich folks who voted kerry - did so in the rich states. this led to a false perception because poor states had very strong income-linked voting preferences.

student t distribution: gaussian random variable with an uncertain standard deviation. for a standard normal Z and U distributed as chi squared with r degrees of freedom, a random variable $t = \frac{Z}{/sqrt{U/r}}$ has a student t distribution. the uncertainty in the denominator allows you to pad your tails appropriately and allot higher probabilities to seemingly rare events when you have small samples.

correction factor for the sample variance: the sample mean is a sort of ‘optimum’ with respect to our sample and therefore we will always underestimate the dispersion of the data. since there is some uncertainty about where the true mean is, our squared terms will always be smaller than they should be. we can adjust for this by inflating the sum with an n-1 correction in the denominator and we get an unbiased estimator.

uncertainty

given some estimated value of a parameter, the confidence interval is the range of values (based on the sampling distribution) which is likely to contain the true value with a probability of $1 - \alpha$. the fixed parameter will be within our confidence interval roughly $1 - \alpha$ % of the time. the uncertainty about the parameter stems from the fact that our estimate is based on a sample: to decrease uncertainty - get more samples! we dont know where our parameter is within the confidence interval. a good way of getting a confidence interval (when you have no idea what the sampling distribution looks like) is to use the bootstrap. basic math-stat 101.

we hypothesize that the underlying parameter value is something and then check the chance (under the null distribution) of generating the data which you observe by making our assumption. if you are very unlikely to see this data then you can reject your hypothesis. the p-value is simply the chance - under the null distribution - of observing data as extreme as you saw in your experiment. a type I error is the probability of rejecting the null hypothesis incorrectly (false positive). a type II error is the failure to reject the null hypothesis even though it is false (false negative). our test was not designed well enough to catch the actual effect. permutation tests are particularly useful for this. baby’s math-stat class recap.

the power of a test is {1 - probability of a false negative}. this is the probability that you correctly reject the null hypothesis (the probability of getting a true positive). to measure the power: we need a null distribution, a presumed alternate distribution and a chosen significance level. for the power of a test, we need to check the overlap of the distributions up to the point where you reject the null. this is actually useful - choose the smallest possible effect and see if your experiment has the power to detect it. also, if you know your desired significance level and have a hypothesized and alternative value then you can determine the sample size n which will get you the statistical power that you desire from your test. diagram from the course: rejection of the null tells us that there is something systematic which violated your null hypothesis assumption: it does not tell you what that systematic cause. detecting confounding by thinking about forms of bias in your data is important - the statin study wherein the recipients of the drug were older on average.

the funnel diagram shows the value which a clinic needs to have in terms of the success rate to beat the given threshold of the test. fairly easy to arrive at this number = 32 plus/minus significance level*uncertainty/root(sample size). with larger n you can detect smaller changes. this matters because an underpowered study could have absolutely terrible results (even though you ran the appropriate hypothesis test). you need to understand the power of your study before arriving at any conclusions.

t tests along with a bunch of variance calculations/assumptions can be used to assess the difference in means for two samples. permutations tests achieve roughly the same effect with a lot less hoo-ha. how does the power of these tests differ? i know that permutation tests are actually quite powerful.

multiple comparisons, bonferroni corrections, false discovery rates: relevant xkcd. conducting 20 significance tests at a level of 0.05 would mean that one of them is possibly positive due to chance. casi is a fantastic resource for this topic.

you can run tests with dependent samples, but you must be aware of whether your data is independent (basic sanity check - lag plot). you will need to identify if you are dealing with dependent samples and you will need to adjust for it appropriately.

nonparametric statistics

again, most of this is a recap of things i am familiar with, so i’m using a light touch.

you want to use nonparametric tests when you have skewed and highly variable data, when you cannot make assumptions about the data generating process, and when your statistic has difficult to calculate sampling distributions.

the right parametric test is probably more powerful than a nonparametric test. also a quick sanity check about your assumptions would be to run both a nonparametric and a parametric test side by side and then see if the results are roughly congruent with each-other.

a bunch of nonparametric tests: kolmogorov-smirnoff, shapiro-wilk, anderson etc can be used to check distributional assumptions. unfortunately, these assume independence as well. wilcoxon’s signed-rank test can be used on paired data in order to compare the medians of two groups. mann-whitney (takes you back to the baby inference classes) can test if one group systematically larger or smaller values than the other group. there are well established sampling distributions for the statistics mentioned above.

using the data to generate a null distribution: so you are trying to examine the probability that something anomalous is happening in the schools. you do not have a sampling distribution for a complicated statistic so you treat the 50th to 75th percentile of the two distributions to obtain a null distribution for the correlation between the two metrics. this is a fantastic use of resampling.

permutation testing in the case of test answers: fantastic idea.

permutation testing: generally reserved for comparing more complicated statistics across a bunch of different groups. if there is no systematic difference then the statistic should be roughly similar upon permuting the labels. if you observe a distribution over all (or many) possible permutations than you have a good approximation to the null distribution, you can use this null distribution to arrive at a fairly accurate empirical p value. stupid idea - try and perturb your group b to understand the power of your test (for example adding 0.01 to each group a in a comparison of means test).

permutation tests to check for trends in time series data: fit trends, permute the data. check the probability of the original trend. this uses permutation tests on non-exchangeable data-points.

bootstrap aka confidence intervals for complicated statistics: treat your original sample as the overall population and sample 10000 batches of data of size n with replacement from the population. the overall bias and variance and characteristics of this simulated distribution is equal to that of the theoretical sampling distribution per the tibshirani monograph. you can get a rough confidence intervals via the percentiles of your simulated data. you can use the bc_a bootstrap in order to correct for skew, etc. kind of analogous to cross validation.

model selection via cross validation: basic model selection recap. use the validation set to choose the best values of your parameters. do not unwrap the test set until you are ready to 86 the model if things go wrong. the test set is sacred and the test set should not be unwrapped until at the last step. you can use a bootstrap analogue for your cross-validation. subsample the dataset at random K different times. then divide into training and testing data and collect all your metrics.

penalizing complexity - address tradeoff between complexity and error by directly dealing complexity. express the badness of the model as sum of error term and penalty term for model complexity. this is a form of regularization. aic is regularization by adding a penalty of 2 * the number of parameters. bic adds a penalty of number of parameters * log number of data points. this is an interesting contrast to simply penalizing the parameters. mdl is an information theoretic penalty which compresses the model and error. it attempts to strike a balance between a big model and high error.

bayesian occams razor: rests on the principle that models with fewer outcomes have an individually higher probability of on each plausible outcome. simpler model is usually right.

experimental design

replication is key! your results need to be independently verified sometimes and the basic logic is that as samples increase, your results become more consolidated. again, you could have mistakenly rejected the null hypothesis and independent replications will help confirm this.

baselines are important: what would happen in the absence of the effect. the effect needs to cause a difference but it needs to be measured relative to some reference value. generally a placebo + control + effect structure is important in randomized studies. placebo’s are sometimes able to simulate the treatment by themselves. the hawethorne effect {if you tell the subjects you are studying their responses this might bias outcomes from the outset. also the perception of an effect can warrant a change in the outcome} and the school district example.

if there are external sources of variability you must identify them and adopt a blocked design by dividing your data into further groups based on the level of the confounding factor. then examine your hypothesis within these ‘controlled groups’. block the confounding factor and analyze the blocks separately. make sure to watch for multiple comparisons.

randomization: your data points should both be randomly selected and randomly assigned to groups. the lack of randomization can lead to bias in your analysis. also randomization needs to be done appropriately - dont speak to a bunch of people in a plaza ‘randomly’ and then say that you sampled at random. that is incorrect.

george box: block what you can, randomize what you cannot.

gathering data: drawing from the representative population. multiple different strategies.

srs: draw at random without replacement. not independent samples but when the population size is large we can treat this as independent. simple random samples often overlook significant subpopulations if they are small in the population - for example: rare diseases.

stratified random sample: divide the population into non-overlapping groups with small variation and then sample proportionally from them. neymans optimal allocation would take both size and the variation into account: for a subpopulation l with a W_l proportion and sigma_l variation then we sample from group l in proportion to $W_l\sigma_l$ .

cluster sampling: divide the population into heterogeneous smaller groups which are well representative of the overall population. sample a few groups and then sample at random from within each group. so city → city blocks → sample from city block. this would be problematic if we do not account for socioeconomic indicators in zip codes in nyc for example.

unbiased list of subjects to sample from is problematic. non response bias because rates can be systematically different for different groups. frame the question neutrally to not evoke a strong response + randomize ordering of questions - framing can affect the response.

sample size doesn’t matter if your population isn’t selected well. the literary digest chose a systematically biased population to sample from (rich people) and then had 2.4 million responses but all worth nothing because non-representative of underlying population.

in an srs our samples are not independent. this leads to an finite population correction on our variance of the mean which is approximately equal to 1 when the sample size is small in comparison to the large overall population size. for a sufficiently large sample, our srs case has a sample average which is approximately normal. this is not due to the clt - which requires iid samples - we employ a separate theorem which tolerates dependence between random variables.

its best to have paired data while working with treatments. this allows for greater power of the statistical test. repeated measures design allows us to serve multiple treatments to the same person in identical settings. so back to back different treatment - therefore each person effectively serves as their own control. it is important to randomize the order of the treatment and consider the temporal effects. you should consider caffeine withdrawal for example. finals weeks etc. therefore you need identical settings. also temporal effects can be model with autocorrelation models.

so we will need to account for multiple factors which we will need to block on: left/right handed MEN and WOMEN. so four groups to block on. if we dont have enough data to replicate the experiment within each sub block we will need to assign different sub blocks to differet treatments.

the simplest example of this design can be seen in a school/semester setting. we use a latin squares setting in order to assign the treatment, control and placebo to different locations at different times in order to control for the effect of the semester: all experimental conditions are tested.

fortunately i dont have to deal with much experimental design because it seems hard - but incredibly exciting.

linear regression

very light touch on this section. i’d like to believe i know this well already.

anscombe’s quartet again: four regression plots with equivalent point estimates - the $R^2$ for each model is the same. some models are blatantly incorrect.

we assume a model on our noise and we assume that our responses can be expressed as a linear function of the $y_i$ . our estimates for the slope and intercept are $\beta_1 = corr(x,y)sd(y)/sd(x), \beta_0 = \bar{y} - \bar{x}\beta_1$ respectively. our slope is higher as the correlation is greater and as the spread of the y increases. the slope is greater as the spread of our x variable decreases - as there is a lot more variability introduced into the picture. our intercept is there to align the line as it passes through the y axis.

if there isn’t a high correlation you will be sool because your model is badly specified. the correlation looks at a z-scored data point and the covariance between them. because i was brought this up a few days ago: correlation cannot be greater than 1 (in absolute value) because of the the fact that cos(x) is bounded by 1 or this can be seen through the cauchy-schwartz inequality. for this you will need to imagine your vectors as n dimensional standardized vectors.

a strong correlation could be indicative of many things:

1. causality - but you dont know in which direction
2. hidden cause - some z causes both x and y. knowing x does give you information about y but not a causal relationship.
3. confounding - some other variable (z) influences y and possibly even more strongly than x. we dont know z therefore our conclusions are misspecified in the absence of the confounding factor.
4. coincidence - most frustratingly. sunspots and congress.

wald’s tests for the coefficients. this is very formulaic. standard error for the model predictions via basic sampling intervals. an interesting plot was the test statistic for the correlation coefficient differentiated by the number of data points used to construct the sampling distribution. the sample correlation needs to be truly large in order to assume large values of the t statistic. the blue curve is for 10 data points versus the green curve for 100 data points. extrapolation is not a very fantastic idea for linear models or models over space and time. extrapolation will often lead to false and misleading predictions.

dont fall into the trap of multiple comparisons. realize that at a significance level of alpha and around n experiments, a total of $n\alpha$ experiments will show statistically significant outcomes just by chance.

not familiar with this experimental pipeline: try checking if your model explains a significant portion of the variability via an anova test - if your anova is promising then you go ahead and starting testing your individual variables.

check for effect size via the standardized coefficients in order to compare apples to apples. this will allow you to circumvent issues of disparate scale.

the model can be viewed through lens of the r squared statistic: intuitively it is the proportion of the variance in the data which can be explained by the model. you can decompose the variance in your response as follows: variance in the response variable = mean of the squared residuals + mean of squared difference between predicted values and then mean of the response.

the f statistic is very similar: it is the measure of the variability in the data due to the model vs due to random error. is calculated as the ratio of the sum of squares of the model and the sum of squared errors and we divide through by degrees of freedom in order to obtain the correct chi squared distribution. yadda yadda this is mechanistic calculation nonsense.

more regression

very light touch on this section. i’d like to believe i know this well already.

residual analysis is is very important. if the model is valid the residuals should look roughly like the parametric assumptions you made on your noise. remember that noise is not equal residuals but it is related to the residuals through the standard $(I - H)\epsilon$ formula. the residuals are slightly correlated with one another - due to the missing degrees of freedom.

standardized residuals are uncorrelated and have the same distribution as assumed on the noise. dividing by the estimated variance of the noise gives us a chi squared denominator yields studentized residuals. analyze studentized or standardized residuals as opposed to the raw residuals. this is an interesting default? should you standardize residuals from the get go?

outlier - something far away from the rest of the data. investigate this. leverage - the influence of the point described by how far away it is in the x direction. closer to the conditional mean results in a lower leverage due to a weaker effect on the conditional mean. influential point - high leverage points which significantly affects slope.

you need to determine which one of the two factors is at play: are you systematically missing some data in a particular region or are you just dealing with some irrelevant outliers which you can address and remove from your analysis.

leverage is formally defined as the corresponding diagonal element of the hat matrix. this is the rate of change of the ith prediction with respect to the ith model response. the further the deviation - the larger the element and the more it influences the overall estimator coefficients.

cook’s distance - how much would your predictions deviate once you remove a particular point from the data set and it calculates a metric to capture this. this can also be recomputed in terms of the leverage of the data points and the residual using some fancy algebra.

robust regression aka special loss functions: median estimator - aka laplacian noise aka lad - changing the value of a point just a little could have a very great impact on the model because of the behavior of the loss function near 0. huber loss - lad but differentiable. essentially turns it into an easier optimization problem by relaxing the loss function using cauchy noise. bisquare loss - is like mse but removes loss on outliers in your analysis. this does not make much sense - because you can simply remove the outliers and use mse in which case. these would change your

heavier tailed distributions assume that you can have a reasonable probability for unlikely events which means that outliers should be a little more likely in these models. loss functions are tailored to be able to acommodate for these deviations in a less harsh manner.

ransac model: data is mostly inliers - pick a subset of point and computer a model based on it. any other points which reliably fit this model are added to the inliers. compute the error. repeat for a bunch of different subsets and then pick the model with the smallest error. this kind of just feels like you are cherry picking your training set to find the subset which best suits your model class.

ridge regression by regularizing the parameters: lambda = 0 leads to the old solution. ridge doesn’t favor sparsity - ridge favors smaller overall effect sizes. it will choose (1, 1, 1) over (0,0,3). several similar effects. ridge is an map framework of thinking about regression via a prior gaussian on the parameters with 0 mean and $1/2\lambda^2$ variance.

sparse solutions can be achieved through lasso. we want something like a hamming loss for if a coefficient is activated but that is an optimization problem over a discrete space. instead we choose to have a penalty for any nonzero parameter assuming that cumulatively the effects should not be simultaneously large and many in number. this is the same as a map framework using a laplacian prior over our parameter space. sampling distributions, confidence intervals and hypothesis tests for lasso are obtained via nonparametric statistics. this is a convex problem and we can find a tractable solution.

generalized linear model:s inverse non linear link function which captures special prediction properties. we assume that the model is some nonlinear transformation of a linear model. glms allow us to exercise more freedom in the relationship by allowing us to fully specify the distribution of response. generally cannot be solved in a closed form.

categorical data

this section is pretty comparatively new: stuff i haven’t dealt with extensively.

categorical data without a natural ordering: can be expressed in the form of a contingency table. numeric counts for each category across the n X m input boxes. risk of outcome i for treatment j: empirical conditional probability of outcome i given treatment j. relative risk of outcome i for treatment j, k is the {risk of outcome i, treatment i}/{risk of outcome i, treatment k}. odds ratio for two outcomes/treatments is the simple ratio {outcome i, ti/outcome j, ti}/{outcome i, tj/outcome j, tj}. this compares the odds of the two outcomes across two treatments. we are interested in the size of an effect as well as its significance, using a confidence interval around the odds ratio can help us capture these two things simultaneously.

simpsons paradox: confounding factors are very important for categorical data. when you aggregate trends they might seem the reverse of what they actually are due to the lack of a confounding variable. titanic example of how class survival rates is confounded with gender. since women and children were evacuate first - and more of them were in first and higher classes - the first/higher classes had a higher survival rate than the other classes. classic berkeley grad school admissions example. gender was confounded with selectiveness of department. hospital example in the sheet.

chi squared significance test: independent data points, large enough values in each entry. get the data, do some eda, check for confounding factors. then, the test can see if your treatment influences our outcome OR if they are independent of one-another. test statistic is $\sum (obs-exp)^2/exp \sim \chi^2$. you can see how this is sort of supposed to be chi squared distributed, each data point is independently collected, each term in the summation resembles the square of a standardized gaussian random variable. we sum up across r rows and c columns, however, since we have an almost fully determined system (except the last row and column), this is a chi square distribution with (r-1)(c-1) degrees of freedom. you calculate your p value for the obtained statistic.

expected counts: you get a table with xi/N of the total per row and yi/N of the total per column. you simply formulate a total for each cell as $x_iy_i/N$. this is the table in a independent world - your conditional distributions look exactly the same.

fishers test: small enough tables we can calculate the exact p values under the null hypothesis through a simple permutation argument. we can also use a permutation test in order to get some approximation to the exact test with a sufficiently large number of repetitions. sample a bunch of values for the table using random permutation and then look at the empirical p value. yates correction for large tables by subtracting 0.5 from all counts to make gaussian approximation more accurate. you can even change the null hypothesis from an test of independence to something else.

anova: analysis of categorical data {through factors which have different levels (which may have different outcomes)} with continuous outcomes. anova can be interpreted as a comparison of means (cef function) or a linear regression on categorical data. we want to check if a factor on some particular level has some systematic difference in the mean outcomes compared to other factors. our anova leans the following model: $y_i = \tau_{GLOBAL} + \tau_{x_i} + epsilon_i = \mu_{x_i} + \epsilon_i$ . we have some global mean and then some offset for each group - which equals the group specific mean. and then the particular value for that specific data point is some further gaussian noise offset from this group mean. this is analogous to the case of linear regression. we can also interpret this as a one hot encoded design matrix in linear regression with a y_i output (and no intercept in the handout). if our model is a good fit, then the x_i explain the variance to a certain extent and our groups have unequal means. our null hypothesis through either interpretation is that the group specific means are equal. we put these through the F test $F = SS_{model}/SS_{error}$ and get a p value from the resulting ratio of chi squares.

anova only tells us that there is some systematic difference: no indication of where the variation stems from. post hoc analysis via appropriate hypothesis tests will be required. will definitely need to watch out for false discovery. anova makes two main assumptions: identical variance in groups (heteroskedasticity of levels), normally distributed and independent data points.

alternative anovas:

two way anova: our model is $y_i = \mu + \tau_{x_i} + \tau_{z_i} + \gamma_{x_i, z_i} + \epsilon_i$ . we need to make sure that our categorical counts are large. our design matrix will pretty much be the same but with an extension for the extra dummy variables. you will not be able to correctly attribute and disentangle crossover effects if you do not have sufficiently large data sets.

ancova; adding a continuous variable to our anova in order to control for a continuous variable appropriately. this is treatment controlled for continuous variable and the effect on outcome.

manova: multiple outputs - which may not be independent - anova. can be extended to ancova as well.

kruskal-wallis: nonparametric anova for difference in the medians of several groups. assumes a roughly similar shape for distributions. example: different directions of skew would mess this test up - roughly same shape of the distributions.

other resources linked on the website

the engineering statistics guide is a fantastic standard reference to get unstuck. 8 chapters covering most things that you will need to get started in trying to find the right direction for your experiments. understand the assumptions being made, understand what you are trying to search for and understand what you are currently doing. again - the most important thing is checking your assumptions. else you have a system with garbage in, garbage out.

i went through some papers but they weren’t very useful without the accompanying discussions. the ‘racial preferences in dating’ paper seemed fairly straightforward to understand. abandoned this section because i have literally no way to know if i am making progress.