Pandas 3: Indexing, Subsetting, Conditionals¶

Overview¶

In this notebook, we’ll work with Pandas DataFrame and Series objects to do the following:

Specify a particular column to use as the row index name
Select various rows, columns, or both
Filter a dataset based on specified conditions
Calculate and output some summary information, using data from multiple files

We’ll once again use NYS Mesonet data from 2 September 2021 at 0200 UTC.¶

Prerequisites¶

Concepts	Importance	Notes
Pandas	Necessary	Intro

Time to learn: 30 minutes

Imports¶

import pandas as pd
import numpy as np

Create and inspect the relevant DataFrames.

nysm_sites = pd.read_csv('/spare11/atm533/data/nysm_sites.csv')
nysm_data = pd.read_csv('/spare11/atm533/data/nysm_data_2021090202.csv')

nysm_data

	station	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
0	ADDI	2021-09-02 02:00:00 UTC	13.3	13.4	92.5	0.00	0.00	0.00	2.6	5.8	...	NaN	0.0	0.0	0.0	19.9	20.3	19.9	0.52	0.44	0.44
1	ANDE	2021-09-02 02:00:00 UTC	14.0	13.7	100.0	0.28	10.67	0.00	1.9	2.6	...	NaN	0.0	0.0	0.0	19.1	19.1	19.3	0.25	0.21	0.14
2	BATA	2021-09-02 02:00:00 UTC	14.8	16.3	77.5	0.00	0.00	0.00	1.9	2.3	...	NaN	0.0	0.0	0.0	20.5	21.5	21.3	0.25	0.21	0.22
3	BEAC	2021-09-02 02:00:00 UTC	16.0	15.9	98.6	1.93	37.53	0.48	3.2	6.5	...	NaN	0.0	0.0	0.0	18.4	19.3	19.8	0.51	0.36	0.37
4	BELD	2021-09-02 02:00:00 UTC	14.3	14.5	94.9	0.00	1.16	0.00	2.6	4.8	...	NaN	0.0	0.0	0.0	19.7	20.1	20.2	0.50	0.43	0.41
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
121	WFMB	2021-09-02 02:00:00 UTC	12.9	13.6	75.0	0.00	0.00	0.00	0.8	1.5	...	NaN	0.0	0.0	0.0	18.8	19.9	19.9	0.24	0.18	0.20
122	WGAT	2021-09-02 02:00:00 UTC	13.8	13.8	79.5	0.00	0.00	0.00	1.4	4.0	...	NaN	0.0	0.0	0.0	18.8	20.2	20.7	0.16	0.25	0.08
123	WHIT	2021-09-02 02:00:00 UTC	15.7	15.8	95.9	0.00	0.00	0.00	1.4	2.6	...	NaN	0.0	0.0	0.0	19.3	20.3	20.1	0.28	0.47	0.46
124	WOLC	2021-09-02 02:00:00 UTC	14.0	16.6	84.6	0.00	0.00	0.00	0.4	0.9	...	NaN	0.0	0.0	0.0	21.9	23.5	24.2	0.18	0.03	0.07
125	YORK	2021-09-02 02:00:00 UTC	12.0	13.9	96.0	0.00	0.00	0.00	0.0	0.3	...	NaN	0.0	0.0	0.0	20.5	21.8	21.9	0.13	0.24	0.24

126 rows × 30 columns

nysm_sites

	stid	number	name	lat	lon	elevation	county	nearest_city	state	distance_from_town [km]	direction_from_town [degrees]	climate_division	climate_division_name	wfo	commissioned	decommissioned
0	ADDI	107	Addison	42.040360	-77.237260	507.6140	Steuben	Addison	NY	6.9	S	1	Western Plateau	BGM	2016-08-10 18:15:00 UTC	NaN
1	ANDE	111	Andes	42.182270	-74.801390	518.2820	Delaware	Andes	NY	1.5	WSW	2	Eastern Plateau	BGM	2016-08-04 15:55:00 UTC	NaN
2	BATA	24	Batavia	43.019940	-78.135660	276.1200	Genesee	Batavia	NY	4.9	ENE	9	Great Lakes	BUF	2016-02-18 18:40:00 UTC	NaN
3	BEAC	76	Beacon	41.528750	-73.945270	90.1598	Dutchess	Beacon	NY	3.3	NE	5	Hudson Valley	ALY	2016-08-22 16:45:00 UTC	NaN
4	BELD	90	Belden	42.223220	-75.668520	470.3700	Broome	Belden	NY	2.2	NNE	2	Eastern Plateau	BGM	2015-11-30 20:20:00 UTC	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
121	WFMB	14	Whiteface Mountain Base	44.393236	-73.858829	614.5990	Essex	Wilmington	NY	3.5	W	3	Northern Plateau	BTV	2016-01-29 20:55:00 UTC	NaN
122	WGAT	123	Woodgate	43.532408	-75.158597	442.9660	Oneida	Woodgate	NY	1.4	NNW	3	Northern Plateau	BGM	2016-08-29 18:20:00 UTC	NaN
123	WHIT	10	Whitehall	43.485073	-73.423071	36.5638	Washington	Whitehall	NY	8.0	S	7	Champlain Valley	ALY	2015-08-26 20:30:00 UTC	NaN
124	WOLC	79	Wolcott	43.228680	-76.842610	121.2190	Wayne	Wolcott	NY	2.4	WNW	9	Great Lakes	BUF	2016-03-09 18:10:00 UTC	NaN
125	YORK	99	York	42.855040	-77.847760	177.9420	Livingston	York	NY	3.6	ESE	10	Central Lakes	BUF	2016-08-09 17:55:00 UTC	NaN

126 rows × 16 columns

Select a row index name from a particular column¶

For the nysm_data DataFrame, let’s set the station column as the row index, in place of the default index. The inplace argument makes it so the DataFrame object is updated with that change.

nysm_data.set_index('station', inplace=True)

Question: How do we discover the possible arguments for the set_index function?

Answer: We can follow any function (aka "method") with a `?` to read documentation on it.

nysm_data.set_index?

One of the arguments to set_index is drop. If true, it will drop (i.e., delete) the corresponding column from the DataFrame. Its default is True, so we don’t need to include it if that’s what we want.

Let’s look at the first couple rows in the DataFrame to verify that the index got changed.

nysm_data.head(2)

	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	wind_speed_stddev_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
station
ADDI	2021-09-02 02:00:00 UTC	13.3	13.4	92.5	0.00	0.00	0.0	2.6	5.8	0.9	...	NaN	0.0	0.0	0.0	19.9	20.3	19.9	0.52	0.44	0.44
ANDE	2021-09-02 02:00:00 UTC	14.0	13.7	100.0	0.28	10.67	0.0	1.9	2.6	0.4	...	NaN	0.0	0.0	0.0	19.1	19.1	19.3	0.25	0.21	0.14

2 rows × 29 columns

Success! Now the station name is the row index.

Selecting rows and/or columns from a `DataFrame` using the `loc` and `iloc` operators (full documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)¶

Let’s say we just want to select a few of the NYS Mesonet sites for analysis. We can select rows / columns either by name or index number. First, let’s select by name.¶

The loc operator can accept a single row/column name, or a list of names. Try a single site first. As loc is a type of Python list or array index operator, we place the arguments inside brackets [].

# Display data for the Manhattan (MANH) site.
nysm_data.loc['MANH']

time                                     2021-09-02 02:00:00 UTC
temp_2m [degC]                                              18.7
temp_9m [degC]                                               NaN
relative_humidity [percent]                                100.0
precip_incremental [mm]                                     4.64
precip_local [mm]                                         111.03
precip_max_intensity [mm/min]                               1.22
avg_wind_speed_prop [m/s]                                    5.4
max_wind_speed_prop [m/s]                                    9.5
wind_speed_stddev_prop [m/s]                                 2.0
wind_direction_prop [degrees]                              352.0
wind_direction_stddev_prop [degrees]                        24.0
avg_wind_speed_sonic [m/s]                                   6.6
max_wind_speed_sonic [m/s]                                  12.5
wind_speed_stddev_sonic [m/s]                                2.2
wind_direction_sonic [degrees]                             352.0
wind_direction_stddev_sonic [degrees]                       18.0
solar_insolation [W/m^2]                                       0
station_pressure [mbar]                                    989.3
snow_depth [cm]                                              NaN
frozen_soil_05cm [bit]                                       NaN
frozen_soil_25cm [bit]                                       NaN
frozen_soil_50cm [bit]                                       NaN
soil_temp_05cm [degC]                                        NaN
soil_temp_25cm [degC]                                        NaN
soil_temp_50cm [degC]                                        NaN
soil_moisture_05cm [m^3/m^3]                                 NaN
soil_moisture_25cm [m^3/m^3]                                 NaN
soil_moisture_50cm [m^3/m^3]                                 NaN
Name: MANH, dtype: object

Multiple sites can be passed as a Python `list` … with its own set of brackets `[]`! This list can be in any order. Once we select more than one row, the output is now a `DataFrame` … and will be nicely formatted in the notebook.¶

# Display data for the Rush (RUSH), Bronx (BRON) and Voorheeseville (VOOR) sites.
nysm_data.loc[['RUSH','BRON','VOOR']]

	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	wind_speed_stddev_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
station
RUSH	2021-09-02 02:00:00 UTC	11.3	12.9	98.3	0.00	0.00	0.00	0.0	0.0	0.0	...	NaN	0.0	0.0	0.0	20.5	21.3	21.3	0.26	0.25	0.25
BRON	2021-09-02 02:00:00 UTC	18.0	17.8	99.0	6.86	110.35	1.84	6.8	11.5	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
VOOR	2021-09-02 02:00:00 UTC	14.9	14.9	98.9	0.26	11.67	0.00	2.2	3.6	0.6	...	NaN	0.0	0.0	0.0	20.2	21.1	21.1	0.25	0.33	0.19

3 rows × 29 columns

We can also select by row number () using the iloc operator. Manhattan is in row 66. Recall that Python uses zero-based indexing, so the first row is element 0.

nysm_data.iloc[65]

time                                     2021-09-02 02:00:00 UTC
temp_2m [degC]                                              18.7
temp_9m [degC]                                               NaN
relative_humidity [percent]                                100.0
precip_incremental [mm]                                     4.64
precip_local [mm]                                         111.03
precip_max_intensity [mm/min]                               1.22
avg_wind_speed_prop [m/s]                                    5.4
max_wind_speed_prop [m/s]                                    9.5
wind_speed_stddev_prop [m/s]                                 2.0
wind_direction_prop [degrees]                              352.0
wind_direction_stddev_prop [degrees]                        24.0
avg_wind_speed_sonic [m/s]                                   6.6
max_wind_speed_sonic [m/s]                                  12.5
wind_speed_stddev_sonic [m/s]                                2.2
wind_direction_sonic [degrees]                             352.0
wind_direction_stddev_sonic [degrees]                       18.0
solar_insolation [W/m^2]                                       0
station_pressure [mbar]                                    989.3
snow_depth [cm]                                              NaN
frozen_soil_05cm [bit]                                       NaN
frozen_soil_25cm [bit]                                       NaN
frozen_soil_50cm [bit]                                       NaN
soil_temp_05cm [degC]                                        NaN
soil_temp_25cm [degC]                                        NaN
soil_temp_50cm [degC]                                        NaN
soil_moisture_05cm [m^3/m^3]                                 NaN
soil_moisture_25cm [m^3/m^3]                                 NaN
soil_moisture_50cm [m^3/m^3]                                 NaN
Name: MANH, dtype: object

We can pass in a list to `iloc` too:¶

# Display data for the 1st, 32nd, and 88th row in the DataFrame.
nysm_data.iloc[[0,31,87]]

	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	wind_speed_stddev_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
station
ADDI	2021-09-02 02:00:00 UTC	13.3	13.4	92.5	0.0	0.0	0.0	2.6	5.8	0.9	...	NaN	0.0	0.0	0.0	19.9	20.3	19.9	0.52	0.44	0.44
COPE	2021-09-02 02:00:00 UTC	12.2	12.3	94.4	0.0	0.0	0.0	0.8	1.7	0.4	...	NaN	0.0	0.0	NaN	19.2	19.5	NaN	0.45	0.26	NaN
REDF	2021-09-02 02:00:00 UTC	12.4	12.5	82.1	0.0	0.0	0.0	1.8	3.2	0.6	...	NaN	0.0	0.0	0.0	18.9	19.2	18.9	0.41	0.41	0.29

3 rows × 29 columns

Now let’s select columns. To do so, we first select rows (in this case, we’ll select all rows, as indicated by `:`), and then continue with a list of columns.¶

nysm_data.loc[:,['time','temp_2m [degC]','precip_incremental [mm]']]

	time	temp_2m [degC]	precip_incremental [mm]
station
ADDI	2021-09-02 02:00:00 UTC	13.3	0.00
ANDE	2021-09-02 02:00:00 UTC	14.0	0.28
BATA	2021-09-02 02:00:00 UTC	14.8	0.00
BEAC	2021-09-02 02:00:00 UTC	16.0	1.93
BELD	2021-09-02 02:00:00 UTC	14.3	0.00
...	...	...	...
WFMB	2021-09-02 02:00:00 UTC	12.9	0.00
WGAT	2021-09-02 02:00:00 UTC	13.8	0.00
WHIT	2021-09-02 02:00:00 UTC	15.7	0.00
WOLC	2021-09-02 02:00:00 UTC	14.0	0.00
YORK	2021-09-02 02:00:00 UTC	12.0	0.00

126 rows × 3 columns

We can specify specific rows and columns¶

nysm_data.loc[['BATA','COLD','QUEE'],['time','temp_2m [degC]','precip_incremental [mm]']]

	time	temp_2m [degC]	precip_incremental [mm]
station
BATA	2021-09-02 02:00:00 UTC	14.8	0.00
COLD	2021-09-02 02:00:00 UTC	14.0	0.00
QUEE	2021-09-02 02:00:00 UTC	19.0	11.95

We can use index slicing with `iloc`; note that the last element is excluded … like typical Python slicing behavior.¶

nysm_data.iloc[0:4] # Will display first four rows of DataFrame

	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	wind_speed_stddev_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
station
ADDI	2021-09-02 02:00:00 UTC	13.3	13.4	92.5	0.00	0.00	0.00	2.6	5.8	0.9	...	NaN	0.0	0.0	0.0	19.9	20.3	19.9	0.52	0.44	0.44
ANDE	2021-09-02 02:00:00 UTC	14.0	13.7	100.0	0.28	10.67	0.00	1.9	2.6	0.4	...	NaN	0.0	0.0	0.0	19.1	19.1	19.3	0.25	0.21	0.14
BATA	2021-09-02 02:00:00 UTC	14.8	16.3	77.5	0.00	0.00	0.00	1.9	2.3	0.2	...	NaN	0.0	0.0	0.0	20.5	21.5	21.3	0.25	0.21	0.22
BEAC	2021-09-02 02:00:00 UTC	16.0	15.9	98.6	1.93	37.53	0.48	3.2	6.5	1.4	...	NaN	0.0	0.0	0.0	18.4	19.3	19.8	0.51	0.36	0.37

4 rows × 29 columns

As a convenience, one can also display a specified number of rows without using the `loc` or `iloc` operators.¶

# Display the first four rows (indexes 0, 1, 2, 3)
nysm_data[:4]

	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	wind_speed_stddev_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
station
ADDI	2021-09-02 02:00:00 UTC	13.3	13.4	92.5	0.00	0.00	0.00	2.6	5.8	0.9	...	NaN	0.0	0.0	0.0	19.9	20.3	19.9	0.52	0.44	0.44
ANDE	2021-09-02 02:00:00 UTC	14.0	13.7	100.0	0.28	10.67	0.00	1.9	2.6	0.4	...	NaN	0.0	0.0	0.0	19.1	19.1	19.3	0.25	0.21	0.14
BATA	2021-09-02 02:00:00 UTC	14.8	16.3	77.5	0.00	0.00	0.00	1.9	2.3	0.2	...	NaN	0.0	0.0	0.0	20.5	21.5	21.3	0.25	0.21	0.22
BEAC	2021-09-02 02:00:00 UTC	16.0	15.9	98.6	1.93	37.53	0.48	3.2	6.5	1.4	...	NaN	0.0	0.0	0.0	18.4	19.3	19.8	0.51	0.36	0.37

4 rows × 29 columns

Question:This next cell will fail; why?

# This will fail; uncomment to see the error message
#nysm_data[2]

Answer:Because the *row index names* are now the four-character site ids!

We can, however, pass in a slice: in this case, rows 2 and 3 (remember, slices in Python exclude the stop value, by default).¶

nysm_data[2:4]

	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	wind_speed_stddev_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
station
BATA	2021-09-02 02:00:00 UTC	14.8	16.3	77.5	0.00	0.00	0.00	1.9	2.3	0.2	...	NaN	0.0	0.0	0.0	20.5	21.5	21.3	0.25	0.21	0.22
BEAC	2021-09-02 02:00:00 UTC	16.0	15.9	98.6	1.93	37.53	0.48	3.2	6.5	1.4	...	NaN	0.0	0.0	0.0	18.4	19.3	19.8	0.51	0.36	0.37

2 rows × 29 columns

Tip: In general, rely on loc and iloc when you need to perform selections.

When you select a single column from a `DataFrame`, a `Series` is returned.¶

Select the second column from the DataFrame (the first column would be referenced as 0 in iloc) and assign the returned Series to its own object name. Inspect this Series object.

tmpc = nysm_data.iloc[:,1]
tmpc

station
ADDI    13.3
ANDE    14.0
BATA    14.8
BEAC    16.0
BELD    14.3
        ... 
WFMB    12.9
WGAT    13.8
WHIT    15.7
WOLC    14.0
YORK    12.0
Name: temp_2m [degC], Length: 126, dtype: float64

We can employ selection operations, such as `loc` and `iloc` to `Series` as well as `DataFrames`.¶

Use loc to select a particular row index:

tmpc.loc['BATA']

14.8

This next line is equivalent:

tmpc['BATA']

14.8

Use iloc:

tmpc.iloc[2]

14.8

Referencing via the array index is equivalent:

tmpc[2]

14.8

Slice a Series with iloc:

tmpc.iloc[0:4]

station
ADDI    13.3
ANDE    14.0
BATA    14.8
BEAC    16.0
Name: temp_2m [degC], dtype: float64

Equivalently, slice on the array indices:

tmpc[0:4]

station
ADDI    13.3
ANDE    14.0
BATA    14.8
BEAC    16.0
Name: temp_2m [degC], dtype: float64

Note: In a departure from typical Python behavior regarding slices, slicing index names on a Series does not exclude the stop value:

Here, ‘BEAC’ is included in the returned Series:

tmpc.loc['ANDE':'BEAC']

station
ANDE    14.0
BATA    14.8
BEAC    16.0
Name: temp_2m [degC], dtype: float64

# Equivalent to previous cell
tmpc['ANDE':'BEAC']

station
ANDE    14.0
BATA    14.8
BEAC    16.0
Name: temp_2m [degC], dtype: float64

Select particular rows by expressing conditions:¶

Applying a conditional to a `Series` produces a series of booleans (i.e., `True`/`False` values … aka a truth table)¶

# What stations had a temperature of at least a certain threshold value?
thresh = 15
tmpc >= thresh

station
ADDI    False
ANDE    False
BATA    False
BEAC     True
BELD    False
        ...  
WFMB    False
WGAT    False
WHIT     True
WOLC    False
YORK    False
Name: temp_2m [degC], Length: 126, dtype: bool

We pass in this Boolean series to the series it was derived from, in order to produce another series containing the actual values that meet the specified condition.

tmpc[tmpc >= thresh]

station
BEAC    16.0
BERK    15.4
BKLN    19.9
BREW    16.2
BRON    18.0
COPA    15.6
DEPO    16.0
DOVE    16.0
ELMI    15.2
ESSX    15.1
FAYE    15.7
GFAL    16.0
GROT    15.1
HFAL    15.9
JORD    15.7
KIND    16.7
MANH    18.7
ONTA    18.2
OTIS    15.7
PENN    15.2
QUEE    19.0
REDH    16.6
SCHO    16.4
SOME    16.5
SOUT    19.3
SPRA    15.1
STAT    18.5
STEP    15.7
STON    20.3
SUFF    16.5
TICO    15.5
WALL    16.2
WANT    24.1
WARW    16.2
WHIT    15.7
Name: temp_2m [degC], dtype: float64

Tip: Until you get used to it, the way the above two cells are coded is a bit non-intuitive. It may make more sense if we instead set the Boolean series to its own object name:

condition = tmpc >= thresh
tmpc[condition]

station
BEAC    16.0
BERK    15.4
BKLN    19.9
BREW    16.2
BRON    18.0
COPA    15.6
DEPO    16.0
DOVE    16.0
ELMI    15.2
ESSX    15.1
FAYE    15.7
GFAL    16.0
GROT    15.1
HFAL    15.9
JORD    15.7
KIND    16.7
MANH    18.7
ONTA    18.2
OTIS    15.7
PENN    15.2
QUEE    19.0
REDH    16.6
SCHO    16.4
SOME    16.5
SOUT    19.3
SPRA    15.1
STAT    18.5
STEP    15.7
STON    20.3
SUFF    16.5
TICO    15.5
WALL    16.2
WANT    24.1
WARW    16.2
WHIT    15.7
Name: temp_2m [degC], dtype: float64

We can also apply conditions to the entire `DataFrame`.¶

Show a subset of the original DataFrame containing only those sites that recorded precip during the five-minute period between NYSM observations:

prcpMeasurable = nysm_data['precip_incremental [mm]' ] > 0.0 

prcpMeasurable

station
ADDI    False
ANDE     True
BATA    False
BEAC     True
BELD    False
        ...  
WFMB    False
WGAT    False
WHIT    False
WOLC    False
YORK    False
Name: precip_incremental [mm], Length: 126, dtype: bool

nysm_data[prcpMeasurable]

	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	wind_speed_stddev_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
station
ANDE	2021-09-02 02:00:00 UTC	14.0	13.7	100.0	0.28	10.67	0.00	1.9	2.6	0.4	...	NaN	0.0	0.0	0.0	19.1	19.1	19.3	0.25	0.21	0.14
BEAC	2021-09-02 02:00:00 UTC	16.0	15.9	98.6	1.93	37.53	0.48	3.2	6.5	1.4	...	NaN	0.0	0.0	0.0	18.4	19.3	19.8	0.51	0.36	0.37
BKLN	2021-09-02 02:00:00 UTC	19.9	20.0	97.9	3.21	75.19	0.93	3.4	9.5	2.3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BREW	2021-09-02 02:00:00 UTC	16.2	16.5	99.7	1.01	26.89	0.27	4.9	9.8	1.7	...	NaN	0.0	0.0	0.0	20.3	20.3	20.3	0.41	0.33	0.35
BRON	2021-09-02 02:00:00 UTC	18.0	17.8	99.0	6.86	110.35	1.84	6.8	11.5	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
CLAR	2021-09-02 02:00:00 UTC	13.9	14.0	97.9	0.52	13.89	0.11	0.8	2.0	0.5	...	NaN	0.0	0.0	0.0	17.9	19.1	19.3	0.22	0.22	0.15
DEPO	2021-09-02 02:00:00 UTC	16.0	16.1	90.1	0.05	3.17	0.00	2.5	5.8	1.5	...	NaN	0.0	0.0	NaN	19.9	20.7	NaN	0.17	0.27	NaN
DOVE	2021-09-02 02:00:00 UTC	16.0	NaN	95.2	1.35	24.69	0.36	2.1	4.1	0.8	...	NaN	0.0	0.0	0.0	19.6	21.5	21.5	0.39	0.22	0.15
DUAN	2021-09-02 02:00:00 UTC	13.2	13.1	100.0	0.36	6.11	0.00	0.9	2.5	0.5	...	NaN	0.0	0.0	0.0	19.5	20.2	20.1	0.55	0.38	0.42
ELDR	2021-09-02 02:00:00 UTC	14.9	NaN	98.6	0.64	16.78	0.16	0.9	2.3	0.4	...	NaN	0.0	0.0	0.0	19.1	20.1	20.7	0.38	0.52	0.51
HARP	2021-09-02 02:00:00 UTC	13.4	13.6	96.6	0.18	10.54	0.00	3.1	5.7	1.2	...	NaN	0.0	0.0	0.0	18.6	19.3	19.0	0.31	0.28	0.52
HFAL	2021-09-02 02:00:00 UTC	15.9	15.9	99.5	0.86	21.89	0.20	4.3	7.5	1.3	...	NaN	0.0	0.0	0.0	19.7	21.1	22.1	0.47	0.35	0.35
KIND	2021-09-02 02:00:00 UTC	16.7	16.7	98.6	0.73	9.75	0.19	2.3	5.1	1.0	...	NaN	0.0	0.0	0.0	21.3	21.9	21.7	0.46	0.41	0.44
MANH	2021-09-02 02:00:00 UTC	18.7	NaN	100.0	4.64	111.03	1.22	5.4	9.5	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
MEDU	2021-09-02 02:00:00 UTC	14.2	14.2	94.4	0.56	10.39	0.15	3.1	5.4	1.1	...	NaN	0.0	0.0	0.0	17.7	19.1	18.8	0.39	0.37	0.38
NBRA	2021-09-02 02:00:00 UTC	14.8	14.9	94.0	0.27	8.95	0.00	1.4	3.2	0.8	...	NaN	0.0	0.0	0.0	18.9	20.7	20.5	0.44	0.43	0.50
OTIS	2021-09-02 02:00:00 UTC	15.7	15.6	98.8	0.88	28.49	0.19	1.1	2.6	0.6	...	NaN	0.0	0.0	NaN	18.4	20.4	NaN	0.34	0.28	NaN
QUEE	2021-09-02 02:00:00 UTC	19.0	NaN	98.0	11.95	63.43	2.68	10.0	16.5	2.5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
REDH	2021-09-02 02:00:00 UTC	16.6	16.6	97.8	0.82	19.73	0.24	3.7	6.6	1.2	...	NaN	0.0	0.0	0.0	21.9	23.9	24.3	0.28	0.25	0.30
ROXB	2021-09-02 02:00:00 UTC	14.0	14.0	98.2	0.60	11.85	0.14	3.0	5.3	0.9	...	NaN	0.0	0.0	NaN	17.2	17.3	NaN	0.55	0.52	NaN
SCHA	2021-09-02 02:00:00 UTC	14.9	15.1	98.1	0.53	10.01	0.12	0.5	1.4	0.4	...	NaN	0.0	0.0	0.0	21.5	22.1	22.4	0.22	0.14	0.05
SCHO	2021-09-02 02:00:00 UTC	16.4	16.2	99.9	0.55	9.07	0.18	1.6	4.5	1.0	...	NaN	0.0	0.0	0.0	21.1	21.9	21.9	0.33	0.19	0.14
SCHU	2021-09-02 02:00:00 UTC	14.8	14.7	96.3	0.07	2.75	0.00	1.3	2.4	0.4	...	NaN	0.0	0.0	0.0	21.2	22.4	22.1	0.07	0.12	0.08
SOME	2021-09-02 02:00:00 UTC	16.5	16.4	98.3	1.75	46.43	0.44	2.6	6.0	1.4	...	NaN	0.0	0.0	0.0	20.3	21.7	21.3	0.69	0.42	0.62
STAT	2021-09-02 02:00:00 UTC	18.5	18.5	97.6	0.39	93.40	0.00	6.1	9.4	1.6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
STEP	2021-09-02 02:00:00 UTC	15.7	15.6	93.0	0.73	8.54	0.21	3.1	5.9	1.1	...	NaN	0.0	0.0	0.0	19.4	19.9	19.7	0.32	0.33	0.31
SUFF	2021-09-02 02:00:00 UTC	16.5	16.2	98.6	0.65	66.08	0.16	4.2	12.1	2.0	...	NaN	0.0	0.0	0.0	20.3	20.1	21.7	0.47	0.34	0.28
TANN	2021-09-02 02:00:00 UTC	13.1	13.4	98.9	0.80	17.90	0.17	0.8	3.5	0.7	...	NaN	0.0	0.0	0.0	18.0	18.4	18.4	0.59	0.49	0.42
VOOR	2021-09-02 02:00:00 UTC	14.9	14.9	98.9	0.26	11.67	0.00	2.2	3.6	0.6	...	NaN	0.0	0.0	0.0	20.2	21.1	21.1	0.25	0.33	0.19
WALL	2021-09-02 02:00:00 UTC	16.2	16.2	98.4	1.77	24.48	0.45	3.3	5.9	1.3	...	NaN	0.0	0.0	0.0	19.9	21.9	21.5	0.57	0.42	0.44
WANT	2021-09-02 02:00:00 UTC	24.1	24.0	98.7	0.06	3.47	0.00	8.0	13.3	1.9	...	NaN	0.0	0.0	0.0	23.7	24.4	24.3	0.22	0.05	0.05
WARW	2021-09-02 02:00:00 UTC	16.2	16.2	100.0	1.35	46.92	0.29	2.7	4.4	0.6	...	NaN	0.0	0.0	0.0	18.2	21.1	21.8	0.43	0.32	0.31
WBOU	2021-09-02 02:00:00 UTC	14.4	14.2	99.1	0.52	19.27	0.11	3.6	5.8	0.9	...	NaN	0.0	0.0	0.0	17.5	20.5	21.3	0.51	0.36	0.34

33 rows × 29 columns

Of those rows that met the conditon, display the 2-m temperatures.¶

nysm_data.loc[prcpMeasurable,'temp_2m [degC]']

station
ANDE    14.0
BEAC    16.0
BKLN    19.9
BREW    16.2
BRON    18.0
CLAR    13.9
DEPO    16.0
DOVE    16.0
DUAN    13.2
ELDR    14.9
HARP    13.4
HFAL    15.9
KIND    16.7
MANH    18.7
MEDU    14.2
NBRA    14.8
OTIS    15.7
QUEE    19.0
REDH    16.6
ROXB    14.0
SCHA    14.9
SCHO    16.4
SCHU    14.8
SOME    16.5
STAT    18.5
STEP    15.7
SUFF    16.5
TANN    13.1
VOOR    14.9
WALL    16.2
WANT    24.1
WARW    16.2
WBOU    14.4
Name: temp_2m [degC], dtype: float64

Now filter based on two conditions:¶

WarmWet = (nysm_data['precip_incremental [mm]'] > 0.0) & (nysm_data['temp_2m [degC]'] > thresh)

WarmWet

station
ADDI    False
ANDE    False
BATA    False
BEAC     True
BELD    False
        ...  
WFMB    False
WGAT    False
WHIT    False
WOLC    False
YORK    False
Length: 126, dtype: bool

nysm_data.loc[WarmWet]

	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	wind_speed_stddev_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
station
BEAC	2021-09-02 02:00:00 UTC	16.0	15.9	98.6	1.93	37.53	0.48	3.2	6.5	1.4	...	NaN	0.0	0.0	0.0	18.4	19.3	19.8	0.51	0.36	0.37
BKLN	2021-09-02 02:00:00 UTC	19.9	20.0	97.9	3.21	75.19	0.93	3.4	9.5	2.3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
BREW	2021-09-02 02:00:00 UTC	16.2	16.5	99.7	1.01	26.89	0.27	4.9	9.8	1.7	...	NaN	0.0	0.0	0.0	20.3	20.3	20.3	0.41	0.33	0.35
BRON	2021-09-02 02:00:00 UTC	18.0	17.8	99.0	6.86	110.35	1.84	6.8	11.5	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
DEPO	2021-09-02 02:00:00 UTC	16.0	16.1	90.1	0.05	3.17	0.00	2.5	5.8	1.5	...	NaN	0.0	0.0	NaN	19.9	20.7	NaN	0.17	0.27	NaN
DOVE	2021-09-02 02:00:00 UTC	16.0	NaN	95.2	1.35	24.69	0.36	2.1	4.1	0.8	...	NaN	0.0	0.0	0.0	19.6	21.5	21.5	0.39	0.22	0.15
HFAL	2021-09-02 02:00:00 UTC	15.9	15.9	99.5	0.86	21.89	0.20	4.3	7.5	1.3	...	NaN	0.0	0.0	0.0	19.7	21.1	22.1	0.47	0.35	0.35
KIND	2021-09-02 02:00:00 UTC	16.7	16.7	98.6	0.73	9.75	0.19	2.3	5.1	1.0	...	NaN	0.0	0.0	0.0	21.3	21.9	21.7	0.46	0.41	0.44
MANH	2021-09-02 02:00:00 UTC	18.7	NaN	100.0	4.64	111.03	1.22	5.4	9.5	2.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
OTIS	2021-09-02 02:00:00 UTC	15.7	15.6	98.8	0.88	28.49	0.19	1.1	2.6	0.6	...	NaN	0.0	0.0	NaN	18.4	20.4	NaN	0.34	0.28	NaN
QUEE	2021-09-02 02:00:00 UTC	19.0	NaN	98.0	11.95	63.43	2.68	10.0	16.5	2.5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
REDH	2021-09-02 02:00:00 UTC	16.6	16.6	97.8	0.82	19.73	0.24	3.7	6.6	1.2	...	NaN	0.0	0.0	0.0	21.9	23.9	24.3	0.28	0.25	0.30
SCHO	2021-09-02 02:00:00 UTC	16.4	16.2	99.9	0.55	9.07	0.18	1.6	4.5	1.0	...	NaN	0.0	0.0	0.0	21.1	21.9	21.9	0.33	0.19	0.14
SOME	2021-09-02 02:00:00 UTC	16.5	16.4	98.3	1.75	46.43	0.44	2.6	6.0	1.4	...	NaN	0.0	0.0	0.0	20.3	21.7	21.3	0.69	0.42	0.62
STAT	2021-09-02 02:00:00 UTC	18.5	18.5	97.6	0.39	93.40	0.00	6.1	9.4	1.6	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
STEP	2021-09-02 02:00:00 UTC	15.7	15.6	93.0	0.73	8.54	0.21	3.1	5.9	1.1	...	NaN	0.0	0.0	0.0	19.4	19.9	19.7	0.32	0.33	0.31
SUFF	2021-09-02 02:00:00 UTC	16.5	16.2	98.6	0.65	66.08	0.16	4.2	12.1	2.0	...	NaN	0.0	0.0	0.0	20.3	20.1	21.7	0.47	0.34	0.28
WALL	2021-09-02 02:00:00 UTC	16.2	16.2	98.4	1.77	24.48	0.45	3.3	5.9	1.3	...	NaN	0.0	0.0	0.0	19.9	21.9	21.5	0.57	0.42	0.44
WANT	2021-09-02 02:00:00 UTC	24.1	24.0	98.7	0.06	3.47	0.00	8.0	13.3	1.9	...	NaN	0.0	0.0	0.0	23.7	24.4	24.3	0.22	0.05	0.05
WARW	2021-09-02 02:00:00 UTC	16.2	16.2	100.0	1.35	46.92	0.29	2.7	4.4	0.6	...	NaN	0.0	0.0	0.0	18.2	21.1	21.8	0.43	0.32	0.31

20 rows × 29 columns

Question: What if you wanted to produce some text output for each time, expressing certain extreme values for a variety of parameters, and reference it to a column in another DataFrame?

Answer: Use the idxmax / idxmin method to display the row label of the maximum / minimum value in a Series.

Let’s create Series objects from the NYS Mesonet Station Info table, as well as from the data table.

stid = nysm_sites['stid']
lats = nysm_sites['lat']
lons = nysm_sites['lon']

time = nysm_data['time']
tmpc = nysm_data['temp_2m [degC]']
rh   = nysm_data['relative_humidity [percent]']
pres = nysm_data['station_pressure [mbar]']
wspd = nysm_data['max_wind_speed_prop [m/s]']
drct = nysm_data['wind_direction_prop [degrees]']
pinc = nysm_data['precip_incremental [mm]']
ptot = nysm_data['precip_local [mm]']
pint = nysm_data['precip_max_intensity [mm/min]']

What is the maximum 2m temperature of all stations in the DataFrame?

tmpcMax = tmpc.max()
tmpcMax

24.1

In what row # occurs the maximum temperature? We might think to use the idxmax method …

siteTmpcMax = tmpc.idxmax()
siteTmpcMax

'WANT'

That works, but it outputs the row index name (recall that when we read in the data file, we specified that the site ID column woudl be used for the row index name).¶

Let’s use the argmax method to get the row index number.

indxTmpcMax = tmpc.argmax()
indxTmpcMax

We can do the same for another extrema … e.g., minimum 2m temperature.

tmpcMin = tmpc.min()
siteTmpcMin = tmpc.idxmin()
indxTmpcMin = tmpc.argmin()

Knowing the row index #, we can get the corresponding full station name from the NYSM station table:¶

siteNameTmpcMax = nysm_sites['name'].iloc[indxTmpcMax]

siteNameTmpcMin = nysm_sites['name'].iloc[indxTmpcMin]

print ("Warmest NYSM Site: %s (%s), %.1f" % (siteNameTmpcMax,siteTmpcMax, tmpcMax))
print ("Coldest NYSM Site: %s (%s), %.1f" % (siteNameTmpcMin,siteTmpcMin, tmpcMin))

Warmest NYSM Site: Wantagh (WANT), 24.1
Coldest NYSM Site: Edwards (EDWA), 10.3

Exercise: find and output the highest and lowest sea-level pressure from the dataset.

Consider: would you want to just find the max/min station pressure values, and then convert those two values to SLP? Or would it be better to first apply the SLP reduction to the entire station pressure Series?

Solution: It's possible that the max/min station pressure values might not also correspond to the stations with max/min SLP, since there is elevation dependence.

So, first create a series of the station pressure values, and then apply the SLP reduction formula. This requires that we retrieve the elevation data from the site file as well.

pres = nysm_data['station_pressure [mbar]']
elev = nysm_sites['elevation']

Now, it should be as straightforward as passing these two `Series` to the SLP reduction formula … shouldn’t it?¶

Employ the SLP reduction equation that we used in the previous notebook:

sensorHeight = .5
# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf 
slp = pres/np.exp(-1*(elev+sensorHeight)/((tmpc+273.15) * 29.263))

Looks like it worked! Let’s see what we get for max/min values:

print(slp.max(), slp.min())

nan nan

Uh-oh … `nan` signifies missing data! Let’s look at what our slp `Series` object looks like:¶

slp

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
        ..
WFMB   NaN
WGAT   NaN
WHIT   NaN
WOLC   NaN
YORK   NaN
Length: 252, dtype: float64

Look at the length of the series! It’s 252 … which is 126 times 2!¶

Let’s just look at all the rows of the series first …

pd.set_option('display.max_rows',None)
slp

0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9      NaN
10     NaN
11     NaN
12     NaN
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
18     NaN
19     NaN
20     NaN
21     NaN
22     NaN
23     NaN
24     NaN
25     NaN
26     NaN
27     NaN
28     NaN
29     NaN
30     NaN
31     NaN
32     NaN
33     NaN
34     NaN
35     NaN
36     NaN
37     NaN
38     NaN
39     NaN
40     NaN
41     NaN
42     NaN
43     NaN
44     NaN
45     NaN
46     NaN
47     NaN
48     NaN
49     NaN
50     NaN
51     NaN
52     NaN
53     NaN
54     NaN
55     NaN
56     NaN
57     NaN
58     NaN
59     NaN
60     NaN
61     NaN
62     NaN
63     NaN
64     NaN
65     NaN
66     NaN
67     NaN
68     NaN
69     NaN
70     NaN
71     NaN
72     NaN
73     NaN
74     NaN
75     NaN
76     NaN
77     NaN
78     NaN
79     NaN
80     NaN
81     NaN
82     NaN
83     NaN
84     NaN
85     NaN
86     NaN
87     NaN
88     NaN
89     NaN
90     NaN
91     NaN
92     NaN
93     NaN
94     NaN
95     NaN
96     NaN
97     NaN
98     NaN
99     NaN
100    NaN
101    NaN
102    NaN
103    NaN
104    NaN
105    NaN
106    NaN
107    NaN
108    NaN
109    NaN
110    NaN
111    NaN
112    NaN
113    NaN
114    NaN
115    NaN
116    NaN
117    NaN
118    NaN
119    NaN
120    NaN
121    NaN
122    NaN
123    NaN
124    NaN
125    NaN
ADDI   NaN
ANDE   NaN
BATA   NaN
BEAC   NaN
BELD   NaN
BELL   NaN
BELM   NaN
BERK   NaN
BING   NaN
BKLN   NaN
BRAN   NaN
BREW   NaN
BROC   NaN
BRON   NaN
BROO   NaN
BSPA   NaN
BUFF   NaN
BURD   NaN
BURT   NaN
CAMD   NaN
CAPE   NaN
CHAZ   NaN
CHES   NaN
CINC   NaN
CLAR   NaN
CLIF   NaN
CLYM   NaN
COBL   NaN
COHO   NaN
COLD   NaN
COPA   NaN
COPE   NaN
CROG   NaN
CSQR   NaN
DELE   NaN
DEPO   NaN
DOVE   NaN
DUAN   NaN
EAUR   NaN
EDIN   NaN
EDWA   NaN
ELDR   NaN
ELLE   NaN
ELMI   NaN
ESSX   NaN
FAYE   NaN
FRED   NaN
GABR   NaN
GFAL   NaN
GFLD   NaN
GROT   NaN
GROV   NaN
HAMM   NaN
HARP   NaN
HARR   NaN
HART   NaN
HERK   NaN
HFAL   NaN
ILAK   NaN
JOHN   NaN
JORD   NaN
KIND   NaN
LAUR   NaN
LOUI   NaN
MALO   NaN
MANH   NaN
MEDI   NaN
MEDU   NaN
MORR   NaN
NBRA   NaN
NEWC   NaN
NHUD   NaN
OLDF   NaN
OLEA   NaN
ONTA   NaN
OPPE   NaN
OSCE   NaN
OSWE   NaN
OTIS   NaN
OWEG   NaN
PENN   NaN
PHIL   NaN
PISE   NaN
POTS   NaN
QUEE   NaN
RAND   NaN
RAQU   NaN
REDF   NaN
REDH   NaN
ROXB   NaN
RUSH   NaN
SARA   NaN
SBRI   NaN
SCHA   NaN
SCHO   NaN
SCHU   NaN
SCIP   NaN
SHER   NaN
SOME   NaN
SOUT   NaN
SPRA   NaN
SPRI   NaN
STAT   NaN
STEP   NaN
STON   NaN
SUFF   NaN
TANN   NaN
TICO   NaN
TULL   NaN
TUPP   NaN
TYRO   NaN
VOOR   NaN
WALL   NaN
WALT   NaN
WANT   NaN
WARS   NaN
WARW   NaN
WATE   NaN
WBOU   NaN
WELL   NaN
WEST   NaN
WFMB   NaN
WGAT   NaN
WHIT   NaN
WOLC   NaN
YORK   NaN
dtype: float64

Missing (i.e. `NaN`) values everywhere, and indeed, we have double the number of rows that we thought we’d have!¶

Let’s take a look at our elev and pres Series objects.

# go back to default # of rows displayed
pd.set_option('display.max_rows',10)

elev

    507.6140
    518.2820
    276.1200
     90.1598
    470.3700
         ...   
  614.5990
  442.9660
   36.5638
  121.2190
  177.9420
Name: elevation, Length: 126, dtype: float64

pres

station
ADDI     951.47
ANDE     947.40
BATA     979.93
BEAC     994.69
BELD     954.02
         ...   
WFMB     941.26
WGAT     959.46
WHIT    1006.29
WOLC     996.90
YORK     991.17
Name: station_pressure [mbar], Length: 126, dtype: float64

The issue is that our elev and pres series have different row indexes! In order for calculations to work when applied to different Series objects, their indexes must correspond!¶

Recall that we set the row index for the nysm_data DataFrame to station id, but kept the default for nysm_sites.

Since `tmpc` and `pres` both have their row index set to station id, let’s set `elev`’s index to that as well.¶

elev.index = pres.index
elev

station
ADDI    507.6140
ANDE    518.2820
BATA    276.1200
BEAC     90.1598
BELD    470.3700
          ...   
WFMB    614.5990
WGAT    442.9660
WHIT     36.5638
WOLC    121.2190
YORK    177.9420
Name: elevation, Length: 126, dtype: float64

Now, repeat the SLP reduction calculation.¶

slp = pres/np.exp(-1*(elev+sensorHeight)/((tmpc+273.15) * 29.263))
slpMax = slp.max()
slpMin = slp.min()
print (slpMax, slpMin)

1014.0598722184151 998.940242497739

With the row indexes common among all three Series in the equation, it works!¶

We can now find the corresponding row numbers and summarize as we did for max/min temperature.

siteSlpMax = slp.idxmax()
indxSlpMax = slp.argmax()
siteSlpMin = slp.idxmin()
indxSlpMin = slp.argmin()

siteNameSlpMax = nysm_sites['name'].iloc[indxSlpMax]
siteNameSlpMin = nysm_sites['name'].iloc[indxSlpMin]

print ("Highest SLP NYSM Site: %s (%s), %.1f" % (siteNameSlpMax,siteSlpMax, slpMax))
print ("Lowest  SLP NYSM Site: %s (%s), %.1f" % (siteNameSlpMin,siteSlpMin, slpMin))

Highest SLP NYSM Site: Clymer (CLYM), 1014.1
Lowest  SLP NYSM Site: Brooklyn (BKLN), 998.9

Tip: Be mindful of how you define row indexes when working with multiple `DataFrame`s!

Summary¶

Use the loc and iloc methods to select rows, columns, or combinations of the two in a Pandas DataFrame.
These two methods can also be used for Series.
Setting conditions on particular rows or columns produce True/False values, which can then be passed into a DataFrame or Series for more meaningful output.
The idxmax / idxmin and argmax / argminfunctions can be combined to provide meaningful output from multiple DataFrames.

What’s Next?¶

In the next notebook, we will explore how Pandas lets us efficiently work with data that are date- and time-based.

ATM433/533 Fall 2023

Pandas 3: Indexing, Subsetting, Conditionals

Contents

Pandas 3: Indexing, Subsetting, Conditionals¶

Overview¶

We’ll once again use NYS Mesonet data from 2 September 2021 at 0200 UTC.¶

Prerequisites¶

Imports¶

Select a row index name from a particular column¶

Selecting rows and/or columns from a `DataFrame` using the `loc` and `iloc` operators (full documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)¶

Let’s say we just want to select a few of the NYS Mesonet sites for analysis. We can select rows / columns either by name or index number. First, let’s select by name.¶

Multiple sites can be passed as a Python `list` … with its own set of brackets `[]`! This list can be in any order. Once we select more than one row, the output is now a `DataFrame` … and will be nicely formatted in the notebook.¶

We can pass in a list to `iloc` too:¶

Now let’s select columns. To do so, we first select rows (in this case, we’ll select all rows, as indicated by `:`), and then continue with a list of columns.¶

We can specify specific rows and columns¶

We can use index slicing with `iloc`; note that the last element is excluded … like typical Python slicing behavior.¶

As a convenience, one can also display a specified number of rows without using the `loc` or `iloc` operators.¶

We can, however, pass in a slice: in this case, rows 2 and 3 (remember, slices in Python exclude the stop value, by default).¶

When you select a single column from a `DataFrame`, a `Series` is returned.¶

We can employ selection operations, such as `loc` and `iloc` to `Series` as well as `DataFrames`.¶

Select particular rows by expressing conditions:¶

Applying a conditional to a `Series` produces a series of booleans (i.e., `True`/`False` values … aka a truth table)¶

We can also apply conditions to the entire `DataFrame`.¶

Of those rows that met the conditon, display the 2-m temperatures.¶

Now filter based on two conditions:¶

That works, but it outputs the row index name (recall that when we read in the data file, we specified that the site ID column woudl be used for the row index name).¶

Knowing the row index #, we can get the corresponding full station name from the NYSM station table:¶

Now, it should be as straightforward as passing these two `Series` to the SLP reduction formula … shouldn’t it?¶

Uh-oh … `nan` signifies missing data! Let’s look at what our slp `Series` object looks like:¶

Look at the length of the series! It’s 252 … which is 126 times 2!¶

Missing (i.e. `NaN`) values everywhere, and indeed, we have double the number of rows that we thought we’d have!¶

The issue is that our elev and pres series have different row indexes! In order for calculations to work when applied to different Series objects, their indexes must correspond!¶

Since `tmpc` and `pres` both have their row index set to station id, let’s set `elev`’s index to that as well.¶

Now, repeat the SLP reduction calculation.¶

With the row indexes common among all three Series in the equation, it works!¶

Summary¶

What’s Next?¶

Resources and References¶

ATM433/533 Fall 2023

Pandas 3: Indexing, Subsetting, Conditionals

Contents

Pandas 3: Indexing, Subsetting, Conditionals¶

Overview¶

We’ll once again use NYS Mesonet data from 2 September 2021 at 0200 UTC.¶

Prerequisites¶

Imports¶

Select a row index name from a particular column¶

Selecting rows and/or columns from a DataFrame using the loc and iloc operators (full documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)¶

Let’s say we just want to select a few of the NYS Mesonet sites for analysis. We can select rows / columns either by name or index number. First, let’s select by name.¶

Multiple sites can be passed as a Python list … with its own set of brackets []! This list can be in any order. Once we select more than one row, the output is now a DataFrame … and will be nicely formatted in the notebook.¶

We can pass in a list to iloc too:¶

Now let’s select columns. To do so, we first select rows (in this case, we’ll select all rows, as indicated by :), and then continue with a list of columns.¶

We can specify specific rows and columns¶

We can use index slicing with iloc; note that the last element is excluded … like typical Python slicing behavior.¶

As a convenience, one can also display a specified number of rows without using the loc or iloc operators.¶

We can, however, pass in a slice: in this case, rows 2 and 3 (remember, slices in Python exclude the stop value, by default).¶

When you select a single column from a DataFrame, a Series is returned.¶

We can employ selection operations, such as loc and iloc to Series as well as DataFrames.¶

Select particular rows by expressing conditions:¶

Applying a conditional to a Series produces a series of booleans (i.e., True/False values … aka a truth table)¶

We can also apply conditions to the entire DataFrame.¶

Of those rows that met the conditon, display the 2-m temperatures.¶

Now filter based on two conditions:¶

That works, but it outputs the row index name (recall that when we read in the data file, we specified that the site ID column woudl be used for the row index name).¶

Knowing the row index #, we can get the corresponding full station name from the NYSM station table:¶

Now, it should be as straightforward as passing these two Series to the SLP reduction formula … shouldn’t it?¶

Uh-oh … nan signifies missing data! Let’s look at what our slp Series object looks like:¶

Look at the length of the series! It’s 252 … which is 126 times 2!¶

Missing (i.e. NaN) values everywhere, and indeed, we have double the number of rows that we thought we’d have!¶

The issue is that our elev and pres series have different row indexes! In order for calculations to work when applied to different Series objects, their indexes must correspond!¶

Since tmpc and pres both have their row index set to station id, let’s set elev’s index to that as well.¶

Now, repeat the SLP reduction calculation.¶

With the row indexes common among all three Series in the equation, it works!¶

Summary¶

What’s Next?¶

Resources and References¶

Selecting rows and/or columns from a `DataFrame` using the `loc` and `iloc` operators (full documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)¶

Multiple sites can be passed as a Python `list` … with its own set of brackets `[]`! This list can be in any order. Once we select more than one row, the output is now a `DataFrame` … and will be nicely formatted in the notebook.¶

We can pass in a list to `iloc` too:¶

Now let’s select columns. To do so, we first select rows (in this case, we’ll select all rows, as indicated by `:`), and then continue with a list of columns.¶

We can use index slicing with `iloc`; note that the last element is excluded … like typical Python slicing behavior.¶

As a convenience, one can also display a specified number of rows without using the `loc` or `iloc` operators.¶

When you select a single column from a `DataFrame`, a `Series` is returned.¶

We can employ selection operations, such as `loc` and `iloc` to `Series` as well as `DataFrames`.¶

Applying a conditional to a `Series` produces a series of booleans (i.e., `True`/`False` values … aka a truth table)¶

We can also apply conditions to the entire `DataFrame`.¶

Now, it should be as straightforward as passing these two `Series` to the SLP reduction formula … shouldn’t it?¶

Uh-oh … `nan` signifies missing data! Let’s look at what our slp `Series` object looks like:¶

Missing (i.e. `NaN`) values everywhere, and indeed, we have double the number of rows that we thought we’d have!¶

Since `tmpc` and `pres` both have their row index set to station id, let’s set `elev`’s index to that as well.¶