ASSESSMENT 4
In this Assignment, you will use Python Pandas. The assignment is designed in sections, with each section having examples. You will run the code that is provided and record the results.
Select Data from Pandas Dataframes
Indexing and Selections From Pandas Dataframes
There are two kinds of indexing in pandas dataframes: location-based and label-based.
In the lesson introducing pandas dataframes, you learned that these data structures have an inherent tabular structure (i.e. rows and columns with header names) that support selecting data with indexing, such as selecting individual cells identified by their location at the intersection of rows and columns.
You can also select data from pandas dataframes without knowing the location of that data within the pandas dataframe, using specific labels such as a column name.
Location-based Indexing
After working with indexing for Python lists and numpy arrays, you are familiar with location-based indexing. You already know that Python location-based indexing begins with [0], and you have learned how to use location-based indexing to query data within Python lists or numpy arrays.
You can use location-based indexing to query pandas dataframes using the attribute .iloc and providing the row and column selection as ranges (i.e. start and stop locations along the rows and columns).
Just like with numpy arrays, the range provided is inclusive of the first value, but not the second value.
This means that you need to use the range [0:1] to select the first index, so your selection begins at [0] but does not include [1] (the second index).
For example, you can select the first row and the first column of a pandas dataframes providing the range [0:1] for the row selection and then providing the range [0:1] for the column selection.
dataframe.iloc[0:1, 0:1]
Label-based Indexing
Pandas dataframes can also be queried using label-based indexing.
This feature of pandas dataframes is very useful because you can create an index for pandas dataframes using a specific column (i.e. label) that you want to use for organizing and querying your data.
For example, you can create an index from a specific column of values, and then use the attribute .loc to select data from the pandas dataframes using a value that is found in that index.
dataframe.set_index(“column”)
dataframe.loc[[value]]
Filtering Data Values
In addition to using indexing, you can also select or filter data from pandas dataframes by querying for values that met a certain criteria.
For example, you can select data in a pandas dataframe based on specific values within a column using:
dataframe[
dataframe[“column”]
== value]
This will return all rows containing that value within the specified column.
On this page, you will review how indexing works for pandas dataframes and you will learn how to select data from pandas dataframes using both indexing and filtering of data values.
Import Python Packages and Get Data
Begin by importing the necessary Python packages and then downloading and importing data into pandas dataframes.
You can use the earthpy package to download the data files, os to set the working directory, and pandas to import data files into pandas dataframes.
# Import necessary packages
import os
import pandas as pd
import earthpy as et
# URL for .csv with avg monthly precip data
avg_monthly_precip
_url = “
https://ndownloader.figshare.com/files/12710618
“
# Download file
et.data.get_data(url=avg_monthly_precip_url)
Downloading from
https://ndownloader.figshare.com/files/12710618
‘/root/earth-analytics/data/earthpy-downloads/avg-precip-months-seasons.csv’
# Set working directory to earth-analytics
os.chdir(os.path.join(et.io.HOME, “earth-analytics”))
# Import data from .csv file
fname = os.path.join(“data”, “earthpy-downloads”,
“avg-precip-months-seasons.csv”)
avg_monthly_precip = pd.read_csv(fname)
avg_monthly_precip
Select Data Using Location Index (.iloc)
You can use .iloc to select individual rows and columns or a series of rows and columns by providing the range (i.e. start and stop locations along the rows and columns) that you want to select.
# Select first row and first column
avg_monthly_precip.iloc[0:1, 0:1]
# Select first two rows and first column
avg_monthly_precip.iloc[0:2, 0:1]
# Select first row and first two columns
avg_monthly_precip.iloc[0:1, 0:2]
# Select first row with all columns
avg_monthly_precip.iloc[0:1, :]
# Select first column with all rows
avg_monthly_precip.iloc[:, 0:1]
Select Data Using Label Index (.loc)
In addition to selecting data based on location, you can also select data based on labels.
To do this, you first create a new index using a column of values that you want to use for organizing and querying your data.
For example, you can create an index from a specific column of values using:
dataframe.set_index(“column”)
# Create new dataframe with `months` as index
avg_monthly_precip_index
= avg_monthly_precip.set_index(“months”)
avg_monthly_precip_index
Test that months no longer functions as a column by attempting to select that column name:
avg_monthly_precip_index[[“months”]]
The following error is returned:
KeyError: “None of [Index([‘months’], dtype=’object’)] are in the [columns]”
Notice that the error message indicates that the value months is not in the index. This is because months is actually now the index!
After setting an index, you can use .loc to select data from the pandas dataframe using a value that is found in that index.
When selecting text string values, you need to specify the text string with quotations “”, as shown below for the text string “Aug”.
# Select Aug using months index
avg_monthly_precip_index.loc[[“Aug”]]
Select Data Using Columns
In addition to location-based and label-based indexing, you can also select data from pandas dataframes by selecting entire columns using the column names.
For example, you can select all data from a specific column in a pandas dataframe using:
dataframe[“column”]
which provides the data from the column as a pandas series, which is a one-dimensional array. A pandas series is useful for selecting columns for plotting using matplotlib.
# Select the `months` column as series
avg_monthly_precip[“months”]
0 Jan
1 Feb
2 Mar
3 Apr
4 May
5 June
6 July
7 Aug
8 Sept
9 Oct
10 Nov
11 Dec
Name: months, dtype: object
You can also specify that you want an output that is also a pandas dataframe using:
dataframe[[“column”]]
which includes a second set of brackets [] to indicate that the output should be a pandas dataframe.
# Select the `months` column as dataframe
avg_monthly_precip[[“months”]]
Notice that your results are now a pandas dataframe.
You can also select all data from multiple columns in a pandas dataframe using:
dataframe[[“column”, “column”]]
Since the results of your selection are also a pandas dataframe, you can assign the results to a new pandas dataframe.
For example, you can create a new pandas dataframe that only contains the months and seasons columns, effectively dropping the precip values.
# Save months and seasons to new dataframe
avg_monthly_precip_text
= avg_monthly_precip[[‘months’, ‘seasons’]]
avg_monthly_precip_text
Filter Data Using Specific Values
In addition to location-based and label-based indexing, you can select or filter data based on specific values within a column using:
dataframe[dataframe[“column”] == value]
This will return all rows containing that value within the specified column.
If you are selecting data using a text string column, you need to provide the value within parentheses (e.g. “text”).
For example, you can select all rows that have a seasons value of Summer.
# Select rows with Summer in seasons
avg_monthly_precip[avg_monthly_precip[“seasons”] == “Summer”]
Again, you can also save the output to a new dataframe by setting it equal to the output of the filter.
For example, you can filter for the values in the months column that are equal to Jan and save the output to a new dataframe.
# Select rows with Jan in months
jan_avg_precip
= avg_monthly_precip[avg_monthly_precip[“months”] == “Jan”]
jan_avg_precip
You can also select data based on numeric values. Note that these selections on numeric values do not require the use of quotations “” because they are not text strings.
For example, you can select all rows that have a specific value in precip such as 1.62.
# Select rows equal to 1.62 in precip
avg_monthly_precip[avg_monthly_precip[“precip”] == 1.62]
You can also filter using a comparison operator on numeric values.
For example, you can select all rows from the dataframe that have precipitation value greater than 2.0 inches by filtering on the precip column using the greater than > operator.
# Save rows with values greater than 2.0 to new dataframe
gt2_avg_monthly_precip
= avg_monthly_precip[avg_monthly_precip[“precip”] > 2.0]
gt2_avg_monthly_precip