Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Day 21 Pre-Class Assignment

University of Missouri

✅  Put your name here

Multiple Testing and p-Hacking; or the dangers of torturing data on YOUR favor

This comic is about data dredging (aka p-hacking), and the misrepresentation of science and statistics in the media.

Credits: xkcd.com

Learning goals for today’s pre-class assignment

  • Recognize the dangers of muliple testing and state ways that we can obtain not significant results and yet have a small p-value

  • Use stats to perform Benjamini-Hochberg procedures to adjust p-values whenever doing multiple testing

  • Be aware of Mann-Whitney’s U-test as an alternative to Student’s t-test

Assignment instructions

This assignment is due by 11:59 p.m. the day before class, and should be uploaded into the appropriate “Pre-class assignments” submission folder. If you run into issues with your code, make sure to use Slack to help each other out and receive some assistance from the instructors. Submission instructions can be found at the end of the notebook.


1. p-hacking or the ethics of responsible research

Data dredging—also known as p-hacking, cherry picking, data snooping—is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing the risk of false positives. This can be done in several ways, the two most common being some sort of:

  • Perform many statistical tests on the data and only report those that come back with significant results.

  • Add/remove one datapoint at a time until the p-value is deemed small.

1.1 But how exactly can you p-hack, maybe without realizing it?

✅  Task

Watch this video that goes into more details on how p-hacking can occur inadvertedly for each of the two points raised above:

from IPython.display import YouTubeVideo, HTML
YouTubeVideo('HDCOUXE3HMM', width=640, height=360)
Loading...

✅  Question 1

Like Josh said in the video above, say you are testing whether one of the 100 different drugs you’ve developed actually works. That means that you perform a t-test between a placebo and each of these 100 drugs—so 100 t-tests in total.

  • If one drug returns a p-value of 0.01, why can’t you just call it a day and say you’ve found a drug that works?

  • As usual, you say that a p-value smaller than 0.05 will be considered significant. If in reality none of your 100 drugs work, how many of them would still return a significant p-value?

Explain your answers here.

✅  Question 2

We still hold on to the rule that “0.05 or lower is significant.” Which means: there is a 1 in 20 chance that whatever difference/trend/effect we measure in our data, it is due to sheer chance.

Now let’s suppose you only have developed a single drug. To check whether this drug has an effect compared to placebo, you run 20 separate statistical tests with the same data:

  • Kolmogorov-Smirnov  →  p-value = 0.3

  • Mann-Whitney  →  p-value = 0.2

  • Welch’s  →  p-value = 0.5

  • [...16 other tests...]

  • t-test  →  p-value = 0.01

All these tests check whether two samples are the same or not—we will discuss more about them near the end of the course, for the time being, think of them as t-test variations. Your data fulfills the requirements for all of them.

  • The t-test returned a small p-value. Can you just call it a day and say you’ve found a drug that works?

Explain your answer here.

✅  Question 3

Same as in Q2, you have a single drug to test. Your statistical setup looks like;

  1. You collect data from 10 placebo subjects and 10 drug-treated subjects.

  2. You randomly select with replacement only 5 and 5 (5 placebo, 5 drug-treated)

  3. Then you perform a t-test with these (the only test you perform)

  4. You repeat steps 2–3 multiple times: You randomly select another 5 and 5 and peform a t-test.

Eventually, you get a p-value of 0.01.

  • Can you just call it a day and say you’ve found a drug that works?

  • What if we were replacing without replacement?

Explain your answer here.

✅  Question 4

Similar to Q3, but you change your statistical setup to resemble bootstrap:

  1. You collect data from 10 placebo subjects and 10 drug-treated subjects.

  2. Like in bootstrap, you randomly select with replacement 10 and 10—instead of 5 and 5.

  3. Then you perform a t-test with these (the only test you perform)

  4. You repeat steps 2–3 multiple times: You randomly select another 10 and 10 and peform a t-test.

Eventually, we get a p-value of 0.02

  • Can you just call it a day and say you’ve found a drug that works?

Put your answer here.

✅  Question 5

Go back to the top of the Notebook and read again the xkcd comic.

  • Do you think the main characters p-hacked?

Justify your answer here.

✅  Question 6: Full context & multiple testing

At its core, the problem of multiple testing is that if you look for something enough times, eventually you’ll eventually find it due to sheer chance. In a wider context, some events are not as impressive—or significant—as they might appear at first.

Imagine that you go to Disneyland (or your favorite amusement park) with your family/friends. Think of the odds of the following:

  • By sheer chance, you stumble upon a relative of yours.

  • There is at least one Disneyland patron that stumbles upon a relative of theirs by sheer chance.

Which statement has a higher probability of happening (and thus it is less impressive)? Can you write an equivalent pair of statements but with jelly bean colors and acne instead of Disneyland and relatives?

Write your answers here.


2. Adjust p-values for false positives and multiple testing

Plenty of times, you’ll find yourself computing 100s if not 10,000s t-tests at a time. Common scenarios:

  • You’re checking for significant differences in yield in a panel of 100 lines of maize.

    • That means you check each of the 1000 lines against the other 99, resulting in 100×99/25100\times99/2 \approx 5k different pairs and 5k t-tests.

  • You’re dealing with control and treatment subjects. You are checking which of their 20,000 genes are differentially expressed

    • In other words, you’re looking for genes which are much more expressed in control compared to treatment, or viceversa.

    • That means you perform 20,000 t-tests

These situations are prime ground for p-hacking and for false positives: results that are reported as significant but in reality they are not so.

The good news is that we can adjust the resulting p-values so that we keep only the truly significant results. Here we will only discuss two of the most common approaches you might find out there.

2.1 The Bonferroni correction

The simplest correction is to multiply every p-value by the total number of tests. If you did mm tests—and you have mm p-values p1,p2,pmp_1,p_2,\ldots p_m, then your Bonferroni-corrected p-values are mp1,mp2,,mpmmp_1,mp_2,\ldots,mp_m. Corrected p-values are capped at 1.

For example, say that the correlation between acne and green jelly beans has an associated p-value of 0.02. You might call the press and report a significant result. But you actually checked for 20 colors. So your Bonferroni-corrected p-value actually is 20×0.02=0.420\times0.02 = 0.4. Not so impressive now, is it?

Turns out that the green jelly beans were a false positive.

✅  Question 7

Like in Q1, you are checking whether any of your 100 new drugs works better than placebo.

  • Drug A produces a p-value of 0.003

  • Drug B produces a p-value of 0.0001

  • The other 98 drugs produce p-values larger than 0.05

Assuming that any p-value smaller than 0.05 will be deemed significant, are drugs A and B true or false positives?

# feel free to use Python as a calculator
apval = 0.003
bpval = 0.0001

Justify your answer here.

The Bonferroni correction has good and bad news:

  • The good news is that it is really good at keeping at bay false positives.

  • The bad news is that it is extremely conservative: it is very prone to generate false negatives. You might find Bonferroni corrected p-values that are reported as not significant when in reality they should be considered significant.

In general, only consider Bonferroni when you have thousands (if not millions) of p-values and you only want to focus on a handful of results. For example, you are looking at 20k genes and you only want to make knock-out lines for 3–4 genes.

Out in the wild you might find Bonferroni-like corrections, like Holm or Šidák. These keep at bay false positives the same way as Bonferroni while being slightly less conservative. Keyword is slightly.

2.3 The Benjamini-Hochberg (BH) procedure aka False Discovery Rate (FDR)

A very different way to correct p-values is through the Benjamini-Hochberg procedure—sometimes announced as False Discovery Rate. This method is much less conservative than Bonferroni and should be your go to whenever you are not looking at 1000s of tests.

✅  Task

Watch this other StatQuest on how to control false positives with the Benjamini-Hochberg procedure

from IPython.display import YouTubeVideo, HTML
YouTubeVideo('K8LQSvtjcEo', width=640, height=360)
Loading...

✅  Question 8

Say you are a research assistant at MU’s UDAIR (University Data, Analytics, and Institutional Research) with access to aggregated data of student enrollment since 1839 (when MU was founded).

  • Describe a research question that would produce numerous t-tests that would then require BH adjustment

  • (Assume that your data is normal)

Describe your experiment here.


3. Practicing with switchgrass

Let’s revisit the switchgrass dataset from Ricketts et al (2023). Let’s focus on Figure 2a, where:

  • For every location and genotype, they have 5 replicates for yield data per year, and 4 years of data: sample size is thus 5×4=205\times4=20.

  • Then they check whether there are significant yield differences between different genotypes (for separate locations).

End-of-season total switchgrass biomass yield for each cultivar averaged over the 4 years of the study. Colors indicate the different cultivars and letters indicate significant differences between cultivar biomass yields within each site found using Wilcox posthoc tests (p < 0.05; n = 5 replicates × 4 years = 20). Boxplot lines represent the median, first and third quartiles, and whiskers represent the minimum and maximum values, except for outliers which are represented by circles.

Credits: Ricketts (2023)

3.1 Load and wrangle the data

We start by importing the needed packages, then load the HARVEST.tsv dataset

import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
import numpy as np

rng = np.random.default_rng(42)
nudge = rng.uniform(-0.15,.15, 1000)

colors = ['r', 'orange', 'forestgreen', 'palegreen', 'royalblue', 'blueviolet']
markers = ['o','^','v','s','D','>']

data = pd.read_csv('HARVEST.tsv', sep='\t')
print('There are',len(data),'rows')
data.head()
There are 360 rows
Loading...
  • Visualize the data as jitterplots: one jitter per site and genotype.

  • The central point is the mean, while the error bars are the 95% confidence interval of the mean.

Important: yield is an in-built Python statement for functions. Never ever name a variable yield.

sites = data['SITE'].unique()
genos = ['Alamo', 'Kanlow', 'Carthage', 'Liberty', 'CIR', 'Blackwell']

fs = 12
fig, ax = plt.subplots(1,len(sites),figsize=(len(sites)*3,3.5), sharex=True, sharey=True)
for i in range(len(sites)):
    ax[i].set_title(sites[i], fontsize=fs)
    ax[i].set_facecolor('snow')
    ax[i].grid(axis = 'y', zorder=1)
    ax[i].set_xticks(range(len(genos)), '')

    for j in range(len(genos)):

        # Look only at the yield data from one site and one genotype
        yields = data.loc[ (data['SITE'] == sites[i]) & (data['GENO'] == genos[j]), 'BIOMASS_AVG_Mg_ha' ]
        ci = stats.t.ppf(0.975, len(yields) - 1)*yields.sem()

        ax[i].scatter(j + nudge[:len(yields)], yields,
                      c=colors[j], marker=markers[j], alpha=0.75, ec='lightgray', label=genos[j], zorder=2)
        ax[i].errorbar(j, yields.mean(), yerr=ci, color='k', mew=1, elinewidth=1, capsize=5, mfc='w', marker='D', zorder=3)

handles,labels = ax[0].get_legend_handles_labels()
fig.legend(handles[:len(genos)], labels[:len(genos)], ncols=len(genos),
           loc = 'center', bbox_to_anchor=(0.5,0), handletextpad=0, fontsize=fs)
fig.supylabel('Biomass yield [Mg/ha]', fontsize=fs)
fig.tight_layout();
<Figure size 900x350 with 3 Axes>

✅  Task 9 [Don’t get too hung up on this Task; we’ll discuss in class later]

  • Look at the .get_legend_handles_labels function we used above.

  • Look at the fig.legend parameters that we used in the plot above.

  • Read what these parameters do in the legend documentation. If unsure, try changing them and see what happens.

  • Are there any parts of the plot code above that are confusing?

Put your three answers here.

✅  Question 10

Let’s focus just on the Missouri data.

  • Based on the visualization alone, do you think there is a cultivar that is statiscally different from the rest?

  • What two cultivars might be statistically similar?

Put your three answers here.

3.2 Missouri and the statistical differences

Ricketts et al. (2023) say early in section 2.5 (Data processing and statistical analysis):

Because our data were not normally distributed, we used non-parametric tests (Wilcox posthoc tests [sic] with Benjamini and Hochberg p-value corrections) to compare biomass yields among cultivars and ecotypes at each site regardless of year, among sites within each cultivar, and year-to-year changes in yield.

Sidenote: The authors meant the (Wilcoxon)-Mann-Whitney U-tests that we discussed in Day 19: the non-parametric alternative to Welch’s. This “typo” is probably because R calls performs the U-test with the stats::wilcox.test function.

Back to Python:

  • We write a nested loop to consider all pairs

  • Notice the start and end points: this is to avoid comparing a cultivar to itself

  • We also keep track of all the pvalues in a Series.

In case you’re wondering why a Series:

  • A Series is much better than a list or an array because we can keep additional information: what cultivar vs what cultivar are we comparing

  • A dictionary can also keep additional information, but we cannot do quick stats and math with it.

# Do a Mann-Whitney U-test between every pair of cultivars in Missouri

site = 'MO'

# Get a subDataFrame with just the Missouri data
df = data[data['SITE'] == site]
pvals = pd.Series()

for i in range(len(genos) - 1):
    # Just the yield data for one cultivar
    yi = df.loc[df['GENO'] == genos[i], 'BIOMASS_AVG_Mg_ha']

    for j in range(i+1, len(genos)):

        # Yield data for another cultivar
        yj = df.loc[df['GENO'] == genos[j], 'BIOMASS_AVG_Mg_ha']

        # Mann-Whitney U-test
        utest = stats.mannwhitneyu(yi, yj)

        pvals.loc[genos[i]+' vs '+genos[j]] = utest.pvalue
pvals
Alamo vs Kanlow 0.001227 Alamo vs Carthage 0.000687 Alamo vs Liberty 0.000687 Alamo vs CIR 0.005115 Alamo vs Blackwell 0.005115 Kanlow vs Carthage 0.394171 Kanlow vs Liberty 0.297677 Kanlow vs CIR 0.818149 Kanlow vs Blackwell 0.989209 Carthage vs Liberty 0.473481 Carthage vs CIR 0.014364 Carthage vs Blackwell 0.049864 Liberty vs CIR 0.000161 Liberty vs Blackwell 0.000305 CIR vs Blackwell 0.394171 dtype: float64

3.3 Adjusting p-values

Looking at the pvals Series and at the jitterplot above, we get small p-values whenever the jitters are visually different. And we get large p-values otherwise.

But remember: a p-value of 0.05 indicates that there is a 1 in 20 probability that whatever difference we see in the data, it was produced by sheer chance and not any underlying biology. In this case, we are making 15 comparisons in total (check len(pvals)). So maybe, just maybe, one of the “significant” p-values is a false positive.

✅  Question 11

Carthage vs Blackwell report a p-value of 0.0499, which is significant.

  • Looking at the jitterplot and their 95% CIs, do you think this is a true positive or a false positive?

Put your answer

The good news is that SciPy stats can do p-value adjustment easily with its .false_discovery_control function. Notice that it has two methods and that by defaut it uses bh, Benjamini-Hochberg. This is the method described in the FDR StatQuest above and also the one used by Ricketts et al (2023).

# Applying a Benjamini-Hochberg correction to our pvalues
# By default, it will return a NumPy array
# But we will then put the output in a Series to keep track of what comparison is which

bh_pvals = pd.Series( stats.false_discovery_control(pvals), index=pvals.index)
bh_pvals
Alamo vs Kanlow 0.003682 Alamo vs Carthage 0.002576 Alamo vs Liberty 0.002576 Alamo vs CIR 0.010961 Alamo vs Blackwell 0.010961 Kanlow vs Carthage 0.492714 Kanlow vs Liberty 0.446515 Kanlow vs CIR 0.876589 Kanlow vs Blackwell 0.989209 Carthage vs Liberty 0.546324 Carthage vs CIR 0.026932 Carthage vs Blackwell 0.083106 Liberty vs CIR 0.002286 Liberty vs Blackwell 0.002286 CIR vs Blackwell 0.492714 dtype: float64

Now Carthage vs Blackwell has an adjusted p-value of 0.08, which confirms our visual intuition that they did not look that different. We can also check that they are the only pair that went from below 0.05 to above 0.05 after the Benjamini-Hochberg adjustment.

# Check which p-vals went from < 0.05 to >= 0.05

signif = 0.05
pvals[(bh_pvals >= signif) & (pvals < signif)]
Carthage vs Blackwell 0.049864 dtype: float64

4. Relevant sidenotes

4.1 On published inaccuracies

If you look at Figure 2a from Ricketts et al (2023), it says that the switchgrass yield between Carthage and Blackwell cultivars is significantly different in Missouri.

But we just saw they are not! We even repeated the steps Ricketts themselves described!

✅  Question 12

  • What do you think went wrong? Give it your best guess.

Put your answer

4.2 On the underlying issues behind p-hacking

In a way, you can argue that much of the p-hacking you see in academic literature is due to the pervasive Publish or Perish mindset: if you don’t publish, your academic standing is at risk. However:

  • Papers must be published in reputable journals

  • Reputable journals will publish you if and only if you have significant results

  • A result is deemed significant if and only if it has an associated p-value lower than 0.05

✅  Question 13

Picture this: you are an emerging scientist, with a really cool idea that biologically speaking makes sense. You follow through with an experiment that ultimately produces a adjusted p-value of 0.08. No correlation with Carthage.

  • What would you do next?

Put your answer


Congratulations, you’re done!

Submit this assignment by uploading it to the course Canvas web page. Go to the “Pre-class assignments” folder, find the appropriate submission folder link, and upload it there.

See you in class!

© Copyright 2026, Division of Plant Science & Technology—University of Missouri

References
  1. Ricketts, M. P., Heckman, R. W., Fay, P. A., Matamala, R., Jastrow, J. D., Fritschi, F. B., Bonnette, J., & Juenger, T. E. (2023). Local adaptation of switchgrass drives trait relations to yield and differential responses to climate and soil environments. GCB Bioenergy, 15(5), 680–696. 10.1111/gcbb.13046