Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Day 26 Pre-Class Assignment

University of Missouri

✅  Put your name here

Non-linear curve-fitting and assessing your fit

An illustration of several plots of the same data with curves fitted to the points, paired with conclusions that you might draw about the person who made them.

Credits: xkcd.com

Learning goals for today’s pre-class assignment

  • Use SciPy’s optimize.curve_fit function to compute various kinds of regressions.

  • Fit and interpret confidence bands and intervals to your linear prediction.

  • Understand how confidence bands are similiar and different compared to confidence intervals

Assignment instructions

This assignment is due by 11:59 p.m. the day before class, and should be uploaded into the appropriate “Pre-class assignments” submission folder. If you run into issues with your code, make sure to use Slack to help each other out and receive some assistance from the instructors. Submission instructions can be found at the end of the notebook.

Make all the needed imports before anything else

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn import metrics

1. Univariate models and least squares optimization

In the past classes, we have worked with univariate models: we have one independent variable in the xx axis—like protein content or femur circumference—that we use to predict one response variable in the yy axis—like oil content or body mass.

So far we have focused on linear models. Which means we are looking for constant numbers aa and bb that make the best fit:

y^=ax+b,\hat{y} = ax + b,

which gives us predicted responses y^\hat y (notice the hat symbol ^{\wedge} on top of it.)

1.1 Mathematically speaking, what makes a line the best-fit line: least squares optimization

Usually in our data, we have x1,,xnx_1,\ldots,x_n independent values alongside corresponding y1,,yny_1,\ldots,y_n response variables. Any pair of a,ba,b values will give us a set of predictions y1^,,yn^\widehat{y_1},\ldots,\widehat{y_n}. We are looking specifically for the aa and bb so that most of the y^\hat y’s are as close as possible to the yy’s most of the time.

In other words, we are looking for a,ba,b so that the least squares function is as small as possible:

Least Squares(a,b)=i=0n(yiyi^)2\text{Least Squares}(a,b) = \sum_{i=0}^n\left(y_i - \widehat{y_i}\right)^2

In mathematical lingo, we want to optimize the least squares function. Keep in mind that the least squares method works in general for any parametric function, not just lines. More on that later.

1.2 Beyond linear models

We have dealt with power-law models from allometry, where we look for the best aa and bb numbers that make the best fit:

y^=axb.\hat{y} = ax^b.

In the power-law case, we were able to linearize it. That is, we can make some mathematical manipulations so it looks more like a linear model. Namely, we can log-transform both sides of the equation to get a linear-looking model:

log(y)^=blog(x)+log(a).\widehat{\log(y)} = b\log(x) + \log(a).

In all those cases, we can use stats.linregress, which also gives us Pearson’s rr correlation coefficient and its associated p-value so we can assess how good this linear fit is in the first place.

Sometimes, linearization is too difficult, if not straight impossible. For example, in general, we cannot linearize a quadratic model

y=ax2+bx+c.y = ax^2 + bx + c.

In such cases, we can use the optimize sub-module from SciPy and its curve_fit fuction! This function optimizes the Least Squares(a,b,c)\text{Least Squares}(a,b,c) function.

Let’s load it! Notice that we can load several sub-modules from SciPy at once.

from scipy import stats, optimize

2. A quadratic example: Nitrogen versus biomass accrual in wild spinach

Allometry is not limited to animals but it explains plant development as well. Let's look at Chenopodium album (wild spinach) data and its relationship between nitrogen accrual and plant age. The data comes from Bernacchi et al (2007), where they look at a quadratic allometric relationship:

log10(N mass)=a×age2+b×age+c.\log_{10}(\text{N mass}) = a\times\text{age}^2 + b\times\text{age} + c.

Notice that the proposed best-fit “lines” in the figure below are not really lines but parabolas: a degree-2 polynomial instead of a degree-1 polynomial (a line).

Whole-plant N mass (ln g) accrual for Abutilon theophrasti, Chenopodium album, Amaranthus retroflexus and Polygonum pensylvanicum individuals, over a 57 d period, in response to varying light, nutrient, water or CO2. Low, moderate and high levels of resources are designated by downward triangle, circle and upward triangle symbols, respectively. Lines represent second-order polynomial curves fit to each resource level.

Credits: Bernacchi et al (2007)

More information:

Bernacchi, CJ, Thompson, JN, Coleman, JS and McConnaughay, KDM (2007) Allometric analysis reveals relatively little variation in nitrogen versus biomass accrual in four plant species exposed to varying light, nutrients, water and CO2. Plant, Cell & Environment, 30: 1216–1222.

2.1 Loading and masking the data

  • We first load the data set Bernacchi et al 2007.xls and make sure it looks good.

data = pd.read_excel('Bernacchi et al 2007.xls')
print('The dataset contains', data.shape[0], 'rows and', data.shape[1], 'columns.')
data.head()
The dataset contains 720 rows and 9 columns.
Loading...
  • We will focus just on the data for wild spinach (Chenopodium) under different light conditions.

species, gradient = 'CHENOPODIUM', 'LIGHT'
#species, gradient = 'ABUTILON', 'WATER'

df = data[(data['SPECIES'] == species) & (data['GRADIENT'] == gradient)]
df.head()
Loading...
  • And to keep things even smaller, let’s start by looking at those samples subject to medium light conditions.

  • We want to relate plant age (independent variable) to log Nitrogen mass (response variable).

  • Let’s make a scatter plot to have a better idea what we are dealing with.

level = 'MEDIUM'
xvals = df.loc[df['LEVEL'] == level, 'AGE']
yvals = np.log(df.loc[df['LEVEL'] == level, 'N mass (g)'])

# Make a plot
fs = 12
fig, ax = plt.subplots(figsize=(4,3))
ax.scatter(xvals, yvals, c='b', marker='.', label=level.title())
ax.set_xlabel('Age', fontsize=fs)
ax.set_ylabel('Ln (N mass)', fontsize=fs)
ax.set_title(f'{species.title()} ({gradient.title()})' , fontsize=fs)
ax.legend(loc='lower right');
<Figure size 400x300 with 1 Axes>

3. My first optimize.curve_fit

The optimize.curve_fit function can be used to fit any parametric function to our data via least squares. We just need to define the function we want to fit. Do you still remember how to define functions?

3.1 Another way to compute linear models

  • Let’s keep things simple and do a linear fit first

  • We’ll need to define a linear function

The model function, f(x,...) must take the independent variable as the first argument and the parameters to fit as separate remaining arguments.

  • And we then ask curve_fit to figure out the best parameters that fit the given data.

In this case, a linear function consists of two parameters, a and b (slope and intercept, respectively).

# Linear function to optimize
# The x-axis parameter goes first
# Followed by the rest of parameters we want to estimate: slope and intercept
def linear(x, a, b):
    return a*x + b

# Get the optimized linear parameters
# And the covariance matrix
lopt, lcov = optimize.curve_fit(linear, xvals, yvals)
print('The linear model has parameters:\t', lopt)
The linear model has parameters:	 [  0.13554295 -11.30587277]
  • According to the optimized parameters, we have the equation

    logN=0.14×Age11.3.\log\text{N} = 0.14\times\text{Age} - 11.3.
  • We can verify that those are indeed the same parameters we would get with linregress:

Note: You don’t need to worry too much about the lcov (covariance matrix) for this assignment, but you are encourage to read the documentation to understand what it encodes and how it can be useful.

linregress = stats.linregress(xvals, yvals)
print('Slope:    \t',linregress.slope)
print('Intercept:\t',linregress.intercept)
Slope:    	 0.13554294763861865
Intercept:	 -11.30587276528547
  • With the function and the optimized parameters, we can predict N biomasses for our recorded age values.

  • We can also check how good this model is with the R2R^2 coefficient of determination.

# Predict y-axis values
pred_y = linear(xvals, *lopt)

# Compare the true biomass values against the predicted values
lr2 = metrics.r2_score(yvals, pred_y)
print('R2:\t', lr2)
R2:	 0.8426373326334833
  • According to our lineal model, plant age explains 84% of the variation of the log N biomass. That is pretty good!

But we should never trust numbers blindly (remember the Anscombe’s quartet): we should make plots to verify that things indeed look linear.

  • Define an increasing sequence of x-axis values, from the youngest to the oldest age value

  • Then predict the y-axis values associated to such sequence

  • This will help us then plot the best fit line instead of using axline. Later, we will plot parabolas, which cannot be drawn with axline.

# Sequence of 100 ascending age values
xaxis = np.linspace(xvals.min(), xvals.max(), 100)
lyaxis = linear(xaxis, *lopt) # predicted linear values for these 100 ages

# Plotting everything

fig, ax = plt.subplots(figsize=(4,3))
ax.set_facecolor('snow')
ax.scatter(xvals, yvals, c='b', marker='.')
ax.plot(xaxis, lyaxis, c='forestgreen', label=f'$R^2$ = {lr2:.2f}')
ax.set_xlabel('Age', fontsize=fs)
ax.set_ylabel('Ln (N mass)', fontsize=fs)
ax.set_title(f'{species.title()} ({gradient.title()})' , fontsize=fs)
ax.legend(loc='upper left');
<Figure size 400x300 with 1 Axes>
  • Can we do better?

3.2 Quadratic models: drawing parabolas

Bernacci et al (2007) originally proposed a quadratic model (ax2+bx+cax^2+bx+c) instead of a linear one. Let’s use curve_fit!

  • Notice that we are essentially copy/pasting the code we already have for the linear model.

  • We are just editing the original curve function and renaming the variables.

In this case, our quadratic model must fit three parameters: a, b, and c.

# Quadratic function to optimize
# The x-axis parameter goes first
# Followed by the rest of parameters we want to estimate: slope and intercept
def quadratic(x, a, b, c):
    return a*x**2 + b*x + c

# Get the optimized quadratic parameters
qopt, qcov = optimize.curve_fit(quadratic, xvals, yvals)
print('The quadratic model has parameters:\t', qopt)

# Predict y-axis values
pred_y = quadratic(xvals, *qopt)

# Compare the true biomass values against the predicted values
qr2 = metrics.r2_score(yvals, pred_y)

qyaxis = quadratic(xaxis, *qopt) # predicted quadratic values for the previously defined 100 ages

# Plotting everything

fig, ax = plt.subplots(figsize=(4,3))
ax.set_facecolor('snow')
ax.scatter(xvals, yvals, c='b', marker='.')
ax.plot(xaxis, qyaxis, c='m', label=f'$R^2$ = {qr2:.2f}')
ax.set_xlabel('Age', fontsize=fs)
ax.set_ylabel('Ln (N mass)', fontsize=fs)
ax.set_title(f'{species.title()} ({gradient.title()})' , fontsize=fs)
ax.legend(loc='upper left');
The quadratic model has parameters:	 [ 1.87508600e-03  3.51781749e-02 -1.02958688e+01]
<Figure size 400x300 with 1 Axes>
  • The quadratic curve looks a bit better than the line

  • With our quadratic model, age explains 86% of the variance observed in N biomass. That’s a slight improvement!


4. Fitting cubic functions, or the dangers of overfitting

What if we fit a cubic model instead (ax3+bx2+cx+dax^3 + bx^2 + cx + d). Will that improve our results even further?

  • Notice that we just copy/pasted the code from above and made some edits

In this case, our cubic model uses four parameters.

# Cubic function to optimize
# The x-axis parameter goes first
# Followed by the rest of parameters we want to estimate: slope and intercept
def cubic(x, a, b, c, d):
    return a*x**3 + b*x**2 + c*x + d

# Get the optimized quadratic parameters
copt, ccov = optimize.curve_fit(cubic, xvals, yvals)
print('The cubic model has parameters:\t', copt)

# Predict y-axis values
pred_y = cubic(xvals, *copt)

# Compare the true biomass values against the predicted values
cr2 = metrics.r2_score(yvals, pred_y)

cyaxis = cubic(xaxis, *copt) # predicted cubic values for the previously defined 100 ages

# Plotting everything

fig, ax = plt.subplots(figsize=(4,3))
ax.set_facecolor('snow')
ax.scatter(xvals, yvals, c='b', marker='.')
ax.plot(xaxis, cyaxis, c='gray', label=f'$R^2$ = {cr2:.2f}')
ax.set_xlabel('Age', fontsize=fs)
ax.set_ylabel('Ln (N mass)', fontsize=fs)
ax.set_title(f'{species.title()} ({gradient.title()})' , fontsize=fs)
ax.legend(loc='upper left');
The cubic model has parameters:	 [-1.95971672e-04  1.79670645e-02 -3.46266991e-01 -7.84297931e+00]
<Figure size 400x300 with 1 Axes>
  • The cubic curve looks even better than the parabola.

  • With our quadratic model, age explains 89% of the variance observed in N biomass. Nice!

✅  Question 1

Let’s assume you are part of the Nitrogen Biomass Lab. Of the three models we have seen so far, which model would you use?

Put your answer here.

4.1 Take a step back: are tightly-fitted curves actually good?

Remember that the true power of regression is that we can predict values—N biomass in this case—even when we don’t have those actual measurements.

  • In general, the more parameters we fit, the resulting curve will follow closer our datapoints.

  • But this risks overfitting: a model that can perfectly predict our recorded age values but it can be wildly wrong for age values we don’t have. This defeats the whole purpose of regression!

  • A good regression model should be a compromise between simplicity and information you can extract from it.

Let’s check how our three curves behave whenever we look a younger and older plants. Let’s look at age values from 1 day old to 70.

  • We draw vertical dashed lines to indicate which values are being extrapolated.

  • We also put some things in a list to loop later

# New sequence of day values
xaxis = np.linspace(1, 70, 200)

curve_function = [linear, quadratic, cubic]
opt = [lopt, qopt, copt]
color = ['forestgreen', 'm', 'gray']
title = ['linear', 'quadratic', 'cubic']

fig, ax = plt.subplots(1, len(opt), figsize=(12,3.5), sharex=True, sharey=True)
for i in range(len(ax)):
    yaxis = curve_function[i](xaxis, *opt[i])

    ax[i].set_facecolor('snow')
    ax[i].scatter(xvals, yvals, c='b', marker='.')
    ax[i].plot(xaxis, yaxis, c=color[i])
    ax[i].axvline(xvals.min(), c='k', ls='--', lw=0.5)
    ax[i].axvline(xvals.max(), c='k', ls='--', lw=0.5)
    ax[i].set_title(f'{title[i].title()} model')

#fig.suptitle(f'{species.title()} ({gradient.title()})' , fontsize=fs)
fig.supxlabel('Age', fontsize=fs)
fig.supylabel('Ln (N mass)', fontsize=fs)
fig.tight_layout()
<Figure size 1200x350 with 3 Axes>

✅  Question 2

  • What do you notice with the curves once we try to predict N biomass values for older plants (plants older than 50 days)?

  • Does this change your answer from Q1?

Put your answer here.

4.2 Confidence bands—or how you should never use cubic or higher polynomial models

(Unless you have a really good domain-knowledge reason to do so)

In general, you should avoid fitting polynomial curves of degree 3 (cubic) or higher. They produce really nice curves, but their behavior is extrememly erratic outside your observed range. Even if you don’t care about extrapolating, these high-degree polynomial are very sensitive to noise. And any real-life data will be noisy.

To illustrate the point better, let’s look at the 95% confidence bands.

Unlike the linear models, these confidence bands are not so simple to compute. The math behind it—the delta method—goes well beyond the scope of this course.

However, we still have a basic tool to get some sort of bands: bootstrap!

  1. Like in In-Class 22, we get a bootstrapped paired sample (x1, y1), ..., (xN, yN)

  2. Recompute the best-fit curve

  3. Record yaxis: the predicted values for xaxis

  4. Repeat steps 1–3 for N = 1000 times

  5. Get the 0.025- and 0.975-th quantiles of all the predicted values (corresponding to 95% of confidence).

# The following might take a bit to run
# Bootstrapping for confidence bands for the three models
rng = np.random.default_rng(42)
N = 1000

# Arrays where the predicted values will be stored
lboot = np.zeros((N, len(xaxis)))
qboot = np.zeros((N, len(xaxis)))
cboot = np.zeros((N, len(xaxis)))

# Get bootstrapped indices, like in Day 22
idx = rng.integers(0, len(xvals), size=(N, len(xvals)) )

for i in range(N):
    # Bootstrapped x and y values
    x = xvals.iloc[idx[i]]
    y = yvals.iloc[idx[i]]

    # New curves fitted to these bootstrapped samples
    lopt_, _ = optimize.curve_fit(linear, x,y)
    qopt_, _ = optimize.curve_fit(quadratic, x,y)
    copt_, _ = optimize.curve_fit(cubic, x,y)

    # Predict curve values
    lboot[i] = linear(xaxis, *lopt_)
    qboot[i] = quadratic(xaxis, *qopt_)
    cboot[i] = cubic(xaxis, *copt_)

# Get the 0.025 and 0.975 quantiles (corresponding to 95% confidence)
lq = np.quantile(lboot, [0.025, 0.975], axis=0)
qq = np.quantile(qboot, [0.025, 0.975], axis=0)
cq = np.quantile(cboot, [0.025, 0.975], axis=0)
  • Finally, we plot the 95% confidence bands

boot = [lq, qq, cq]

fig, ax = plt.subplots(1, len(opt), figsize=(12,3.5), sharex=True, sharey=True)
for i in range(len(ax)):
    yaxis = curve_function[i](xaxis, *opt[i])

    ax[i].set_facecolor('snow')
    ax[i].scatter(xvals, yvals, c='b', marker='.')
    ax[i].fill_between(xaxis, boot[i][0], boot[i][1], alpha=0.25, color=color[i], ec=None, zorder=1)
    ax[i].plot(xaxis, yaxis, c=color[i])
    ax[i].axvline(xvals.min(), c='k', ls='--', lw=0.5)
    ax[i].axvline(xvals.max(), c='k', ls='--', lw=0.5)
    ax[i].set_title(f'{title[i].title()} model')
    ax[i].set_ylim(-12.5, 2)

fig.supxlabel('Age', fontsize=fs)
fig.supylabel('Ln (N mass)', fontsize=fs)
fig.tight_layout()
<Figure size 1200x350 with 3 Axes>

✅  Question 3

  • What do you notice with the 95% confidence bands once we try to predict N biomass values for older plants (plants older than 50 days)?

  • Does this change your answer from Q1?

Put your answer here.


Congratulations, you’re done!

Submit this assignment by uploading it to the course Canvas web page. Go to the “Pre-class assignments” folder, find the appropriate submission folder link, and upload it there.

See you in class!

© Copyright 2026, Division of Plant Science & Technology—University of Missouri

References
  1. BERNACCHI, C. J., THOMPSON, J. N., COLEMAN, J. S., & MCCONNAUGHAY, K. D. M. (2007). Allometric analysis reveals relatively little variation in nitrogen versus biomass accrual in four plant species exposed to varying light, nutrients, water and CO2. Plant, Cell & Environment, 30(10), 1216–1222. 10.1111/j.1365-3040.2007.01698.x