A couple of days ago I was trying to nail down p-value, significance levels and null hypotheses. I know that if the p-value is lower than 0.05, the result is significant and the null hypothesis is rejected – but that’s just memorization. I wanted to *understand* it, know when it should be applied, and be able to explain it to others. I turned to my trusty co-worker, R., to help in this endeavor. (If you prefer you can skip the dorky “Brianne comes to understand p-values” conversation and go right to the “why this shit is important” section below.)

**Me:** R – do you have a good understanding of the concepts underlying p-values, significance levels and null hypotheses?

**R:** Not really a *great* understanding. I use stats software to calculate p-values. Anything less than 0.05 is significant.

**Me:** Yeah*, I know* that, but can you help me understand something? C’mere and look at my screen.

*[R, who is used to me pulling us off-topic, sighs good-naturedly and comes over] *

**Me:** So look at this the bell curve in Wikipedia:

**Me:**So, a significance level is just some arbitrary number that we usually set at 0.05 or 0.01…

**R:** I don’t think it’s arbitrary.

**Me:** Well, it’s not set by …err… &*^% … I don’t know. Let’s tackle that one another day. So, let’s say we set it at 0.05 (me mumbling under my breath “arbitrarily”). That means 95% of the time (everything not green) we’re not going to observe the null hypothesis?

**R:** I hate double negatives.

**Me:** So?

**R:** No. It’s saying that there is a 5% probability of making our observation purely by chance. The smaller the number the less likely it is that our observation was made by chance. The greater the number the more likely it was made by chance. A small number rejects the null hypothesis.

**Me:** Okay. It’s an estimate of how likely we are to observe our result by chance. So…let’s say that we’re trying to determine if a chemical is stable over time. We test it several times over a month, then look at the difference over time in how the chemical performs. We do a linear regression and get a slope of 0.1* The p-value on the data is calculated to be 0.05. That means there’s a 5% probability of obtaining the observed result (0.1) or greater by chance if no real effect exists (that is, if our chemical is stable).

**Him:** Yes.

*A long pause.*

**Me:** Sweet. I think I get it. But, that means that p-value is sort of the opposite of what we observe…

**R:** Yeah, it’s counter-intuitive.

**Me:** It’s *math*. It’s all counter-intuitive to me.

Biodork: helping you lose confidence in the scientists behind the development of the technologies you use every day!

I’m kidding. Statistics is not what I do for my company. But understanding statistical analyses is important in my field and it feels good to finally have a grasp of the concept.

I was a little hesitant to write this because it feels like p-value is something that should be understood by A Real Scientist^{TM}. You see p-values all over the literature, so of course everyone understands them, right? But it’s *not* an easy concept. It *is* counter-intuitive. And I thought I’d throw it out there that I for one was struggling with it, and it’s okay if you are too. Mathphobes unite!

There’s another good reason to understand p-values and statistical analyses in general – It’s a huge part of not getting fooled by shoddy statistics put out by proponents of bullshit. I’m looking at you, alternative medicine. Some readers of this blog are probably also fans of Science-Based Medicine. SBM has many articles that discuss the statistics gymnastics that alt med proponents perform to make it seem as if their [insert bullshit product here] has an effect on [insert condition here].

I’m reading a book right now called *Intuitive Biostatistics* by Dr. Harvey Motulsky. I’m finding it hard to put down, and I don’t often say that about math books. Here’s how Dr. Motulsky explains his book:

Unlike statistics texts that emphasize mathematics, Intuitive Biostatistics focuses on proper scientific interpretation of statistical tests. The book is perfect for researchers, clinicians and students who need to understand statistical results published in biological and medical journals. Intuitive Biostatistics covers all the topics typically found in introductory texts, and adds several topics of particular interest to biologists – including multiple comparison tests, nonlinear regression and survival analysis.

The first chapter is all about how we trick ourselves into seeing patterns where none exist, and the importance of correctly analyzing data so that it can speak for itself without our dumb brains getting in the way. Motulsky gives illuminating examples throughout the book and leaves the math formulae to the statistics textbooks. You can find it free online or on Amazon if you want a paper copy. I recommend this book for every skeptic who has to deal with data sets or statistical defenses of woo.

And finally, if you have any rules or tricks to help explain p-value in a concise manner, I’d love to see them in the comments below. Also, I am a self-admitted mathphobe and doubter of my own mathematical skills, so if you catch me in an error, please do call it out. I’ll be happy to learn and grateful for the chance to clear up any misinformation that I might be spreading.

*For my mathphobe friends – a simple explanation for a linear regression is that it’s a line drawn through the points that we’ve just plotted, and a slope refers to…well…the slope of the line we’ve drawn. A slope of 0.1 in the example means there is about a 10% difference between how the chemical performed on the last day as compared to how it performed on the first day we tested it.

*Le hand-drawn by moi linear regression.*

felicisA *tiny* clarification:

“It’s an estimate of how likely we are to observe our result by chance.”

Assuming that the null hypothesis is true.That’s why we reject the null hypothesis if we get a very small p-value. That does not mean the null hypothesis is false – just that this particular experimental outcome has a low likelihood of occurring if it is true.

See:

XKCD

for another illustration…

Brianne BilyeuAn essential aspect of p-value. Thanks for pointing it out.

CuttlefishI have my stats students actually flip coins a few hundred times to get a *feel* for probability first. And we go into the relative costs of type 1 and type 2 error, and *why* a cutoff criteria might differ from the “standard” .05 or .01.

Love this stuff…

unboundP value is massively misunderstood by many people, including the scientists that use it. I would highly recommend reading an excellent article on the subject of statistics (including P values) from Science News. I found a copy here – http://ckwri.tamuk.edu/fileadmin/user_upload/A-Litt/Odds_Are__It_s_Wrong_-_Science_News.pdf

Greg LadenYeah, I think you basically get it. A few additional comments.

First, as you say in part of your post and felicis notes in a comment, the probability dealie assumes the Null hypothesis, for instance, that you really are tossing coins (or whatever). In fact, your expectation (the hypothesis) in a strongly determined system may be way different.

The 0.05 value is pretty much arbitrary. There may or may not be valid reasons to put the value there, but I would count most of that as arm waving. In the 70s and 80s, in Archaeology, we used 0.10 because otherwise nothing would ever be significance. Large scale studies on the efficacy of dangerous substances that used a 0.05 level might not be convincing to a panel of evaluators prepare to give the substance to babies. In some fields under certain conditions the arbitrary number is lower, like 0.01

Now, two somewhat more subtle and often ignored aspects of this. First, we assume in science that we are replicating and reproducing and re-trying results. With a 0.05 level of acceptance, this means that we will frequently make a mistake. In your workplace, consider the number of times someone used a p value in the last year. One in 20 of those (roughly) would be one kind of mistake. That’s a lot.

But really, it’s not a lot because it is not the case that every single experiment was the only experiment done in relation to a particular system. It is absolutely possible to get a “significant” result when you “know” you shouldn’t, or the opposite; a result that doesn’t fit the overall pattern of results. No matter what, you are always doing a kind of internal meta-study of everything.

The second thing has to do with that graph and its exact shape. In classic statistics, you calculate F statistics or regressions, or moments (mean, stand. dev. etc) and then make underlying assumptions about the nature of the numbers, which ultimately leads to a p value. In the really old days, people actually looked up some of these numbers on tables using a test statistic, degrees of freedom, etc. But those distributions that are used to ultimately get to a p-value are well understood theoretical distributions (based on a combination of empirical understanding and inference). All phenomena are divided into a handful of these theoretical distributions (even with distribution free stats) and we just assume that this all makes sense. But the actual distributions are anywhere from a little bit to a lot different than those, thus modern techniques like bootstrapping.

So imagine taking all the data you ever got from a certain lab test. Then you randomly draw 100 cases to make a subsample, and calculate the mean. Repeat 100,000 times. Make a distribution. Now, run your new experiment and find that result on that distribution.

Now THAT’s your p value. If 1000 of the previously calculated values (out of 100,000) are to the right of your new number, then you’ve got a p value of 1000/100,000. And, if you look at the exact shape of that distribution you made (with the 100,000 samples) you’ll see, perhaps, that is it not exactly like the one you use above in the diagram. It will have a shape that more accurately reflects reality. Bounded on one side, not the other. Discontinuous at high value because you use larger beakers with different graduations on them for larger amounts. Some values are missing because one of your instruments had a spot of dirt on it for three years.

Brianne BilyeuThanks for all of the information and examples, Greg. I’ll be reading over this four or five times!

jaxkayakerI recommend Gotelli and Ellison’s

A Primer of Ecological Statisticsand Zar’sBiostatistical Analysisas well. I also just purchased the 2nd edition ofIntuitive Biostatisticsand it looks good, but in the early printings of the 2nd edition, the formula for the standard deviation is missing the radical sign, and is therefore actually the formula for the variance.Andrew G.Consign your p-values to the dustbin of history and join the Bayesian Conspiracy!

(not really joking)

Brianne BilyeuOh sure. I just get a handle on p-values and now they’re outmoded. Typical! 😉

EricJRe:

Or as Einstein supposedly said:

JonasA comment on the linear regression example.

The slope in your drawing is declining, so the slope of the line should be -0.1, and not 0.1. More importantly, the interpretation of the slope is also wrong. A slope of -0.1 does not mean the activity of the chemical will be 10% lower for each day (the unit on the x-axis), but will be 0.1 lower than the day before, measured in the units on the y-aksis. Say if the activity is 1 on the first day, then the second day it will be 0.9, the third day it will be 0.8, 0.7 the fourth day, etc. 0.7 surely is not not 10% less than 0.8. To get a “10 % decline each day” interpreation, you would have to use some sort of log-transformation.

JonasCorrection on the last bit:

If your data showed a 10% declining trend, you would have to log transform your data to fit a straight line. If your data is linear already, doing a log transform would ruin the whole thing. I have to admit i’m not quite sure how i would proceed it if i wanted a “10% decline” interpreation of the slope, but i would have made the interpreation of the model more complicated than neccecary.

confusopolyI think you’d fit an exponential curve to your data instead of a line. The difference should already be pretty visible in your data when you choose that.

Brianne BilyeuYes – it would be

–0.1. Thank you. The intent was to create an line that would show a 10% trend over time (over all points). I’m not sure if an exponential curve would be the appropriate tool to use here, as confusopoly suggests. I’ll find out and report back!anatAdding to Greg’s comment: In the area of transcriptomics studies (quantifying as many as possible transcripts in a tissue/cell sample) this is an important consideration because one is measuring thousands of things for each sample. For every 2000 transcripts I am comparing I should expect about 100 to vary by 2 standard deviations or more just by chance. So if I am comparing 2 samples I need to see more differences than that to be impressed that indeed the two are expressing a different set of transcripts. That is counter-intuitive for people used to old-fashioned molecular biology where one only compared a handful of genes/transcripts at a time.

Lou JostThe problem with p-values is that they do not indicate the magnitude of the effect you are testing, which is the thing you really want to know. It is essential to understand that a highly significant p-value does not mean the effect you are testing is big or strong. It just means your sample size was large enough to detect some nonrandom signal in your data. In biology, where there is always some natural variation between groups, p-values generally don’t give you useful information.

Consider an example: suppose you are testing the impact of herbicide drift on the plant diversity of forests that are next to crop fields. You might have data sets on the diversities of forests next to sprayed fields, and forests next to unsprayed fields. Then you might test the null hypothesis that the diversities are the same in both groups of forests. You test your data and find a p-value of 0.0001. You can confidently reject the null hypothesis. Should you get excited about your success? No, because WE KNOW IN ADVANCE that any such exact null hypothesis is false, without taking any data whatsoever. It is virtually impossible that the two groups of real forests would have exactly the same diversities, to five decimal places, even if herbicide drift had no effect whatsoever. This means you can always obtain whatever p-value you want–all you need to do is make sure your sample size is large enough. The p-value framework you are using turns science into an empty game which can always be won if the investigator has enough resources to take big enough samples. (Using a directional null hypothesis would improve things, and multiple independent tests of a directional null hypothesis can give better info, but they still won’t tell you the magnitude of the effect.)

Null-hypothesis-testing (with its associated p-values) is only the right model if the mere presence of an effect is noteworthy. For example, in physics, the mere presence of objects that travel faster than light would be newsworthy (recall the recent neutrino claims). It didn’t really matter how much faster than light they traveled; a fundamental law was at stake. In this situation, p-values based on a null hypothesis of “no effect” are appropriate. But I think this situation almost never occurs in biology. Almost always, then, the null-hypothesis-testing model, with its p-values, is not appropriate. What you really want to know is not “Is there an effect?” but rather “How big is the effect?” P-values don’t tell you that (they depend strongly on sample size as well as the magnitude of the effect).

The correct approach is usually the parameter-estimation model, with confidence intervals expressing the statistical uncertainties in the estimates. This means more work– you need to find a measure whose actual magnitudes are interpretable. But it answers the real question.

All statisticians know this, and any good stat book discusses it. Yet biologists (and many others) continue to use the inappropriate model with its p-values, and most editors and reviewers actually order biologists to keep making this mistake.