Data Contextualization and Wrangling¶

Credits: PHD Comics
Learning goals for today’s assignment¶
Search for, locate, and contextualize data sets from the internet and develop troubleshooting methods for loading data sets into Jupyter notebook.
Articulate the context for a data set
Experiment with some of the ways you can load data into a Jupyter Notebook
Identify the arguments needed for different ways to load data
Practice loading a data file in need of wrangling
Assignment instructions¶
Work with your group to complete this assignment. Instructions for submitting this assignment are at the end of the Notebook. The assignment is due at the end of class.
1. Pre-class Review/Discussion (7-10 minutes)¶
In the pre-class, you explored two main ideas: data contextualization and data wrangling. You sought datasets on your own.
✅ Question 1
As a group, discuss:
What did you discover (or not discover) when you went to find the context for your data sets in the pre-class?
What tools did you need to load your data into your notebooks in the pre-class
Record a list of the tools/arguments below.
Include specific examples from each of your group members’ experiences.
✎ Put your tools here!
2. Practice with Loading Data¶
✅ Task 2
Below are the links to four different data sets. Work together as a group to:
Obtain the data.
Load the data into your notebook.
Make at least one change to make the data more usable.
Links:
As you work, take notes on what tools/advice you used from Part 1 (you’ll want it to answer the discussion questions)!
# put your code here to load a dataset
# change the dataset (in Python) at least once to improve the usefulness of the data✅ Question 3
Discuss the strategies you used and describe the challenges you encountered.
✎ Summarize your challenges here!
3. Finding Data to Answer an Exploratory Question¶
✅ Task 4
Below are several questions that require data (and your expert Python skills) to answer them. Your task is to choose one of the following questions and use internet resources to find a dataset that could answer the question. You can use the sites from the pre-class to get started, but you should also try a broader search as well.
Q1: Which Missouri sites experienced the largest change in air quality between 2015 and 2025?
Q2: For a single grain crop, how have the exports or imports changed over time?
Q3: How has National Park Usage/Visitation changed over time (either for a specific park or all parks as a whole)?
Q4: In the 2015 Major League Baseball season, how many games did each team win/lose?
Q5: How many exoplanets are discovered each year (since the first discovery in 1992)?
✎ Write the exploratory question you chose here!
4. Putting it All Together¶
✅ Question 5
Now that you have some practice articulating data context and finding data, it’s time to combine your skills (much like you will be asked to in the context of your semester projects!)
Obtain a dataset that will (hopefully) answer your group’s chosen exploratory question.
Then, do your best to answer the data contextualization questions (as you did in the pre-class):
Who collected/generated the data?
How was the data collected/generated?
Who/what is included in the data?
Who/what is not included in the data?
What are the limitations or biases of the data?
✎ Write your answers to the questions above here and include the link to your dataset
✅ Task 6
Now, load the dataset into your notebook in the cell(s) below. If you need to clean up your data to be usable, put those steps here! Remember that you can also open the data file(s) in Jupyter if you are having trouble with loading/cleaning.
Make sure everyone in your group is able to get the data loaded and cleaned
# put your code here!✅ Question 7
Looking at your dataset, answer the following questions:
What columns do you think will be most helpful for answering your question?
What do those columns mean/stand for?
Why do you think they might answer your question?
✎ Write your answers here
✅ Task 8
Now that your data is loaded and cleaned, you are going to make a plot (e.g. scatter plot, line plot, histogram, bar plot, etc.).
# put your code here✅ Question 9
Answer the following questions about your plot:
What is your plot showing?
What part of your exploratory question does it answer?
What information might be missing or incomplete?
How does the context of the data you articulated affect your interpretation of the plot?
✎ Write your data answers here!
5. Sharing your Work: Tell me a story¶
As a group, prepare one slide showing the main plot that answers your research question. Your slide should have:
The name of all the group members
The main question you sought to answer
Your completed plot(s) with labeled axes and legends (if appropriate)
Beyond all the Python, data science at its core is story telling. Make sure your plot tells a story.
Make sure to explicitly discuss what choices you made in your data cleaning and plot making process and how this affects the representation of the data and the associated claims you can make (e.g. what is your data showing? Who/what is missing from the data? What information does the plot not capture?). You will be asked to do this on your final project, so this is great practice! When you are making the slide, also consider what your classmates might find helpful if they encountered similar data.
Each group member is responsible for presenting some component of your group's process to the class. As you listen to other students, you are responsible for taking notes on the things that each group needed to do to get their data usable!
✎ Put your notes here!
Congratulations, you’re done!¶
Submit this assignment by uploading it to the course Canvas web page. Go to the “In-class assignments” folder, find the appropriate submission link, and upload it there.
See you next class!
Material drawn with permission from:
© Copyright 2025. Department of Computational Mathematics, Science and Engineering at Michigan State University
Adapted for:
© Copyright 2026, Division of Plant Science & Technology—University of Missouri