Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Day 16 In-Class Assignment

University of Missouri

✅ Put your name here.

✅ Put your group member names here.

A large amount of dry soybeans.

Credits: MU Variety Testing Program

Learning goals of today’s assignment

  • Practice fitting linear models with SciPy.

  • Assess how good is the linear model in the first place.

  • Use your model to predict oil and protein content for other soybean cultivars.

Assignment instructions

Work with your group to complete this assignment. Instructions for submitting this assignment are at the end of the notebook. The assignment is due at the end of class.


Background

Soybean is currently promoted in sub-Saharan Africa as a high-quality protein source. Breeding soybean for improved seed composition requires understanding how cultivars interact with these new environments. This study evaluated genotype by environment interactions and the genetic diversity of cultivars grown in the Pan-African Soybean Variety Trials (PATs). They identified cultivars with increased protein or oil in specific environments or stable performance across environments. Their results provide valuable resources for improving soybean seed composition in sub-Saharan Africa. Work supported by USAID’s Feed the Future Soybean Innovation Lab and by the University of Missouri Northern Soybean Breeding Program.

Linear relationship between protein and oil content for various African soybean cultivars.

Credits: De Meyer et al (2024)

More details in:

De Meyer, E., Prenger, E., Mahmood, A., da Fonseca Santos, M., Chigeza, G., Song, Q., Mwadzingeni, L., Mukaro, R., Chibanda, M., Mabuyaye, G., Diers, B., & Scaboo, A. (2024) Evaluating genetic diversity and seed composition stability within Pan-African Soybean Variety Trials. Crop Science, 64, 3272–3292.

✅  Question 1

  • What is reflected on the x-axis?

  • What is reflected on the y-axis?

  • In your own words, what information do you get out of this figure?

Put your answer here


1. Setting everything up

Before jumping straight into linear regressions, we need to go through the usual setup steps.

✅  Task 2

  • Import the usual suspects: NumPy, pandas, matplotlib, and Scipy’s stats

# Remember that it is always good coding practice to import your libraries at the top before doing anything else.
# Import the rest of modules

import numpy as np

✅  Task 3

  • With pandas, load the dataset PATSA_final_QCd_rep_data.csv as a variable data

  • Drop all the pesky rows with NaNs.

  • Display its first few rows to make sure everything worked as expected.

# Load the data

# data = ???

✅  Question 4

Put your answer here.

✅  Task 5

  • With matplotlib, make a scatterplot of oil vs protein content for all the samples in the data.

  • Make sure to label your axes

You’ll notice that your scatterplot is a bit different from the one published by De Meyer et al. In general, all your correlation and regression values will be a bit off, but just a bit. That’s because we are skipping an averaging step. We’ll fix that later.

# Your plot

✅  Task 6

  • With stats.pearsonr, compute and print the Pearson correlation between oil and protein content and its associated p-value

# Pearson correlation

✅  Question 7

  • Looking at both the plot, the correlation coefficient, and its p-value, do you think that oil and protein content are correlated?

  • If so, do you think it makes sense to fit a linear model?

Put your answer here.


2. Predicting values with linear regression and stats.linregress

Now it is your turn to compute linear regressions as in the Pre-Class

✅  Task 8

  • Compute a linear regression between protein and oil soybean content. Remember that order matter!

  • Is your reported linear formula close to the one in De Meyer et al (2024)?

# Your regression

✅  Question 9

Based solely on the slope and intercept values:

  • How many grams does oil content increase (or decrease) for every protein gram gained by the soybean?

  • Theoretically speaking, what is the maximum possible oil content a soybean can ever achieve?

Put your answer here.

✅  Question 10

Compare the linregress reported values on Peason’s correlation coefficient and p-value versus those reported by pearsonr

  • Are they the same. Just type YES/NO

The cool thing of linregress is that it computes correlation at the same time!

Put your answer here.

✅  Task 11

Now plot the actual best-fit line.

  • Comment what the array pred_protein below is.

  • Copy/paste your scatterplot code from Task 5.

  • Draw the best-fit line

# Comment this line
# Replace `data` for whatever variable you loaded in T2

pred_protein = np.linspace(0.95*data['PROTEIN_g/kg'].min(), 1.05*data['PROTEIN_g/kg'].max(), 50)

# Finish the plot
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 4
      1 # Comment this line
      2 # Replace `data` for whatever variable you loaded in T2
----> 4 pred_protein = np.linspace(0.95*data['PROTEIN_g/kg'].min(), 1.05*data['PROTEIN_g/kg'].max(), 50)
      6 # Finish the plot

NameError: name 'data' is not defined

✅  Question 12

Compare the linregress reported values on Peason’s correlation coefficient and p-value versus those reported by pearsonr

  • Are they the same. Just type YES/NO

The cool thing of linregress is that it computes correlation at the same time!

Put your answer here.

✅  Task 13: Predicting Values

Based on you linear model:

  • What would be the oil content of a soybean cultivar with 500g of protein? 250g?

# Write some code to answer the two questions

✅  Task 14: the other way around

Based on you linear model:

  • What would be the protein content of a soybean cultivar with 150g of oil?

  • What is the theoretical limit of protein cotent for a soybean?

# Write some code to answer the questions above
# Feel free to add more cells if needed

3. R2R^2 to assess goodness-of-fit (using with sklearn.metrics.r2_score)

We have computed a linear model, but how much does that model actually tell us? Remember that the R2R^2 score tells us how much of the yy-axis variance is explained by the xx-axis. And you guessed it, Python can compute the R2R^2 score for us! While you don’t need to remember the formulas (because Python does it for us) it is crucial that you know how to interpret the results.

To do the computations, we need to import the submodule metrics from the module Scikit-Learn. Scikit-Learn is Python’s main module to do anything machine learning related and more! This semester we will not delve into it, but it will be a recurring friend in the next course (just saying 😉.)

In this case, we want to know how much of the oil content in a soybean can be explained by its protein content.

✅  Task 15

  • Make a variable oil_pred with the predicted oil content values for all the protein content values present in the data based on your linear model.

  • Then compute the R2R^2 score of the linear model with metrics.r2_score.

Important: the order of the arguments does matter! Read more about the function here.

# import the necessary submodule
from sklearn import metrics

# compute the R2 score

4. Averaging values: wrangling practice

As mentioned earlier, the regression values are a bit off compared to DeMeyer et al (2024). That is because the paper considers averages across genotype and environment. In the raw data we’re using, we’re considering replicates. Let’s fix it!

4.1 Pandas, .loc, and masking reminder

During the pre and in-class 12 we went through masking. As a quick reminder

# To get all the rows where certain column matches a value
data[ data['column name'] == value ]

# To get all the rows where two certain columns match two values
data[ (data['column name 1'] == value1) & (data['column name 2'] == value2) ]

# To get all the rows where two certain columns match two values
data[ (data['column name 1'] == value1) & (data['column name 2'] == value2) & (data['column name 3'] == value3)]
# And so on and so forth

# To get all the rows where one OR two certain columns match one OR two values
data[ (data['column name 1'] == value1) | (data['column name 2'] == value2) ]
# Notice that `&` means AND, while `|` means OR

We can combine masking with .loc to just get specific column(s) instead:

# get a column where two other certain columns match two values
data.loc[ (data['column name 1'] == value1) & (data['column name 2'] == value2) , 'column name 3' ]

# get two columns where two other certain columns match two values
data.loc[ (data['column name 1'] == value1) & (data['column name 2'] == value2) , ['column name 3', 'column name 4'] ]

✅  Task 15

  • With the gen and env variables below, get the protein content of all the samples from Bimha genotype grown in environment Type 1.

  • Then use .mean() to compute the mean protein content

# Mask and .loc practice
gen = 'BIMHA'
env = 'ENV1'

We can also keep a summary of all the means for all the genotype-environment combinations. This summary can be a dataframe. Its entries are written with .loc.

✅  Task 16

  • Comment every line of the code below

  • If you are unsure what a line does, copy/paste it in another cell and print its contents

    • Also duckduckgo for documentation

  • Compare the values in the DataFrame with those in Figure 3a from De Meyer et al (2024).

# Comment every line below

genotypes = data['GEN'].unique()
environments = data['ENV'].unique()
print(genotypes, environments, sep='\n\n')

protein = pd.DataFrame(0., index=genotypes, columns=environments)
for gen in protein.index:
    for env in protein.columns:
        protein.loc[gen, env] = data.loc[ (data['GEN'] == gen) & (data['ENV'] == env) , 'PROTEIN_g/kg'].mean()

mean_protein = protein.to_numpy().flatten()
protein
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 3
      1 # Comment every line below
----> 3 genotypes = data['GEN'].unique()
      4 environments = data['ENV'].unique()
      5 print(genotypes, environments, sep='\n\n')

NameError: name 'data' is not defined

✅  Task 17

  • Make a summary dataframe oil and its flattened array mean_oil for average oil content across genotypes and environments.

  • You can either copy/paste the code above and edit; or simply add a few lines to the existing code

# 

5. Recomputing the model

Now that we have average protein and oil content, let’s reproduce that actual De Meyer et al (2024) values. You’ll just need to copy paste your code from above and edit accordingly.

✅  Task 18

Using mean_protein and mean_oil arrays:

  • Compute the best-fit line and print the model values

  • Print the correlation coefficient, its p-value, and the R2R^2 score

  • Make a scatter plot of oil vs protein content and the best-fit line.

# Your code
# Add more cells below if needed

Congratulations, you’re done!

Submit this assignment by uploading it to the course Canvas web page. Go to the “In-class assignments” folder, find the appropriate submission link, and upload it there.

See you next class!

© Copyright 2026, Division of Plant Science & Technology—University of Missouri

References
  1. De Meyer, E., Prenger, E., Mahmood, A., da Fonseca Santos, M., Chigeza, G., Song, Q., Mwadzingeni, L., Mukaro, R., Chibanda, M., Mabuyaye, G., Diers, B., & Scaboo, A. (2024). Evaluating genetic diversity and seed composition stability within Pan‐African Soybean Variety Trials. Crop Science, 64(6), 3272–3292. 10.1002/csc2.21356