Day 12 Pre-Class Assignment

✅ Put your name here
¶

Masking¶

Learning goals for today’s assignment¶

Practice using masks with NumPy arrays

Assignment instructions¶

Watch the videos below, do the readings linked to below the videos, and complete the assigned programming problems. Please get started early, and come to office hours if you have any questions! Make use of Slack as well!

This assignment is due by 11:59 p.m. the day before class, and should be uploaded into appropriate the “Pre-class assignments” submission folder. Submission instructions can be found at the end of the notebook.

1. Masking¶

Masking is an extremely important and absurdly useful tool that we will be adding to our coding tool box. Fundamentally, “masking” is a process that allows us to select specific parts of our data that meet some condition. Let's work through some examples, to better understand what this means.

1.1 Masking and numbers¶

Let's start by looking at the random set of numbers.

import numpy as np
vals = np.array([3, 11, 6, 9, 7, 12, 8, 11, 5, 3, 15, 13])

✅ Task 1

Write code that uses a for loop and an if statement to identify all values of vals that are above 8.
Append the values that meet this condition to a new list
Print out this new list

# Insert your code here

What we just did was select a subset of our data; specifically, all values in vals greater than 8. Now, let's do the same thing using masking. The code below creates a mask and then uses that mask to select all of the data points that meet our condition.

mask = vals > 8
vals_masked = vals[mask]

Let's break this code down, piece by piece.

✅ Task 2

Prints out mask.

# Insert your code here

✅ Question 3

You should see a list of Boolean values (True or False).

Compare your mask values to the values of vals. What do you notice?

✎ Write your answer here

✅ Task 4

Now print out the values of vals_masked.

# Insert your code here

✅ Question 5

What values does vals_masked contain?
How are these values connected to vals, mask, and the list that you created for Task 8?

✎ Write your answer here

✅ Task 6

Try tinkering with the mask generation code (the mask = vals > 8 line) by changing the condition. For example:
- values below 8
- values above 15
- values equal to 11
- etc.
Print out the mask and vals_masked values for each one to convince yourself that the “masks” you create match your expectations.

# Insert code here

✅ Question 7

Take a moment and reflect.

What is a mask?
What does the process of masking data or values do?

✎ Write your answer here

✅ Question 8

Before moving on, explain the concept of masking as if you were talking to someone who has never coded before.

✎ Write your answer here

1.2 Masking and Pandas¶

Like we mentioned earlier, masking is absurdly useful. One of the many reasons is that it is not limited to numbers. You can also mask strings! We’ll be doing a lot of string and number masking in the upcoming assingments.

As an example, let’s look at a dataset similar to the ones you analyzed for the last in-class: mean concentration of cannabinoids in different tissues for cattle fed with hemp. We have concentration values (in ng/g) based on the number of days since the last cannabis dosage.

✅ Task 9:

Make sure you have downloaded the tissues_mean.csv file and placed it in the same folder as this Notebook.

Import the Pandas library
With Pandas, load the CSV file into a DataFrame called tconc
Display the first few lines (remember which Pandas function does that?)

# Your code here

#import
#tconc = pd.

This file is actually all the previous tissue files but combined into a single CSV.

This is actually very common when it comes to data analysis: rather than looking at individual files—one per tissue/species/treatment/etc.—you’ll be looking at a single, combined file. Some of the rows of this single file will tell you exactly which rows correspond to which treatment/tissue/plant.

Just like with NumPy, we can mask with Pandas. Say we want to focus just on “Kidney” tissue.

# The tissues are listed in the "Tissue" column
print(tconc['Tissue'])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 2
      1 # The tissues are listed in the "Tissue" column
----> 2 print(tconc['Tissue'])

NameError: name 'tconc' is not defined

# We then make a mask, like with numbers
# Remember that "==" checks if two values are identical
mask = tconc['Tissue'] == 'Kidney'
print(mask)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 3
      1 # We then make a mask, like with numbers
      2 # Remember that "==" checks if two values are identical
----> 3 mask = tconc['Tissue'] == 'Kidney'
      4 print(mask)

NameError: name 'tconc' is not defined

# Mask the dataframe
tconc[mask]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 2
      1 # Mask the dataframe
----> 2 tconc[mask]

NameError: name 'tconc' is not defined

# We usually summarize the two cells above in a single one
tconc[ tconc['Tissue'] == 'Kidney' ]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 2
      1 # We usually summarize the two cells above in a single one
----> 2 tconc[ tconc['Tissue'] == 'Kidney' ]

NameError: name 'tconc' is not defined

✅ Task 10:

Use tconc and masking for the following:

In one cell, display the data corresponding to the liver tissue
In another cell, display the data corresponding to cannabinoid concentration after 3 days

# Your code here. Add new cells below if needed.

2. Accessing Pandas values with `.loc` and `.iloc`, revisited¶

In the last session you saw how we could access row values in DataFrame using either the row names with .loc or the row index numbers with .iloc. We can also use either .loc or .iloc to access column values.

But we can also use .loc or .iloc to access row and column values simultaneously by doing

data.loc[ <row names> , <column names> ]

OR

data.iloc[ <row indices> , <column indices> ]

For example, to get the rows 5-10 and the first 4 columns, we can do:

# With index numbers and .iloc
tconc.iloc[ 5:11, :4 ]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 2
      1 # With index numbers and .iloc
----> 2 tconc.iloc[ 5:11, :4 ]

NameError: name 'tconc' is not defined

# With row/column names and .loc
tconc.loc[ [5,6,7,8,9,10], ['Tissue','Time (d)','8-THC','9-THC']]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 2
      1 # With row/column names and .loc
----> 2 tconc.loc[ [5,6,7,8,9,10], ['Tissue','Time (d)','8-THC','9-THC']]

NameError: name 'tconc' is not defined

# Alternative ONLY BECAUSE the row names are just sequential numbers
tconc.loc[ range(5,11), ['Tissue','Time (d)','8-THC','9-THC']]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 2
      1 # Alternative ONLY BECAUSE the row names are just sequential numbers
----> 2 tconc.loc[ range(5,11), ['Tissue','Time (d)','8-THC','9-THC']]

NameError: name 'tconc' is not defined

2.1 Masking and `.loc`¶

The true power of having row and column names at the same time occurs with masking.

data.loc[ <row condition/mask> , <column names> ]

For example, say we want to look at the entries corresponding to kidney tissue and Δ8-THC concentration. We know that the mask for the first part is tconc['Tissue'] == 'Kidney'. We know that the name of the column is '8-THC'. And so we do:

# Using .loc to look especifically at 8-THC concentration
tconc.loc[ tconc['Tissue'] == 'Kidney' , '8-THC' ]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 2
      1 # Using .loc to look especifically at 8-THC concentration
----> 2 tconc.loc[ tconc['Tissue'] == 'Kidney' , '8-THC' ] 

NameError: name 'tconc' is not defined

We can look at various columns if we want to: we just need to list all the column names we are interested in.

# Using .loc to get more cannabinoids
tconc.loc[ tconc['Tissue'] == 'Kidney' , ['Time (d)', '9-THC', 'CBLA'] ]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 2
      1 # Using .loc to get more cannabinoids
----> 2 tconc.loc[ tconc['Tissue'] == 'Kidney' , ['Time (d)', '9-THC', 'CBLA'] ] 

NameError: name 'tconc' is not defined

2.2 Your turn¶

✅ Task 11

Use tconc, masking, and .loc for the following:

In one cell, display the data corresponding to the liver tissue and THC-glu concentration
In another cell, display the data corresponding to THCA and CBD concentration after 3 days for all tissues

# Your answer

2.3 Bring it all together: Writing entries in a DataFrame¶

So far we have used .iloc and .loc to read values from a DataFrame. We can also use them to write values in a DataFrame. This is very useful when we want a summary DataFrame.

Say we want a DataFrame called summary_kidney where we list the sum and the maximum concentration of THCA, THC-acid, and CBDA. We put these three canabinoids in a list. The columns of the empty dataframe correspond to the cannabinoids of interest, while the rows correspond to the sum and max values.

Then with a loop, we mask just the data relevant to kidney and the cannabinoid. Finally, the sum and max are computed and the results stored in the right place of the summary dataframe

✅ Task 12

Comment what each line of the code below does

# Notice that we initially set the empty dataframe to FLOAT zeroes "0." instead of INT zeroes "0"
# The decimal point is important

cannabinoids = ['THCA', 'THC-acid', 'CBDA']
summary_kidney = pd.DataFrame(0., columns=cannabinoids, index=['sum', 'max'])

for cannabinoid in cannabinoids:
    concentrations = tconc.loc[ tconc['Tissue'] == 'Kidney', cannabinoid ]
    summary_kidney.loc[ 'sum' , cannabinoid ] = concentrations.sum()
    summary_kidney.loc[ 'max' , cannabinoid ] = concentrations.max()
summary_kidney

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 5
      1 # Notice that we initially set the empty dataframe to FLOAT zeroes "0." instead of INT zeroes "0"
      2 # The decimal point is important
      4 cannabinoids = ['THCA', 'THC-acid', 'CBDA']
----> 5 summary_kidney = pd.DataFrame(0., columns=cannabinoids, index=['sum', 'max'])
      7 for cannabinoid in cannabinoids:
      8     concentrations = tconc.loc[ tconc['Tissue'] == 'Kidney', cannabinoid ]

NameError: name 'pd' is not defined

✅ Task 13 (optional challenge)

We can forgo the loop and compute sums and maxima for all cannabinoids at once.

Comment what each line of the code below does

Do not worry if you find the code below confusing. It will make more sense as you get more practice with Pandas.

summary_kidney = pd.DataFrame(0., columns=cannabinoids, index=['sum', 'max'])

summary_kidney.loc['sum'] = tconc.loc[ tconc['Tissue'] == 'Kidney', cannabinoids].sum()
summary_kidney.loc['max'] = tconc.loc[ tconc['Tissue'] == 'Kidney', cannabinoids].max()
summary_kidney

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 summary_kidney = pd.DataFrame(0., columns=cannabinoids, index=['sum', 'max'])
      3 summary_kidney.loc['sum'] = tconc.loc[ tconc['Tissue'] == 'Kidney', cannabinoids].sum()
      4 summary_kidney.loc['max'] = tconc.loc[ tconc['Tissue'] == 'Kidney', cannabinoids].max()

NameError: name 'pd' is not defined

Follow-up Questions¶

Copy and paste the following questions into the appropriate box in the assignment survey include below and answer them there. (Note: You’ll have to fill out the section number and the assignment number and go to the “NEXT” section of the survey to paste in these questions.)

In your own words, explain why we might use a “mask” when working with data.
What is the difference between using .loc and .iloc when masking?

Congratulations, you’re done!¶

Submit this assignment by uploading it to the course Canvas web page. Go to the “Pre-class assignments” folder, find the appropriate submission folder link, and upload it there.

See you in class!

✅ Put your name here¶

Masking¶

Learning goals for today’s assignment¶

Assignment instructions¶

1. Masking¶

1.1 Masking and numbers¶

1.2 Masking and Pandas¶

2. Accessing Pandas values with .loc and .iloc, revisited¶

2.1 Masking and .loc¶

2.2 Your turn¶

2.3 Bring it all together: Writing entries in a DataFrame¶

Follow-up Questions¶

Congratulations, you’re done!¶

✅ Put your name here
¶

2. Accessing Pandas values with `.loc` and `.iloc`, revisited¶

2.1 Masking and `.loc`¶