Pandas 8: Multi-index DataFrames
Contents
Pandas 8: Multi-index DataFrames¶
In many instances, tabular data may best be represented with more than a single row index. For the case of NYSM hourly data files, which have 126 stations each with 13 sets of 5-minute obs, we definitely want to take advantage of using Multi-index Dataframes.¶
Overview¶
Open an hour’s worth of NYSM data using a single row index dataframe
Review selection and conditional selection methodology
Create a dataframe with station ID and date/time as the two row indices
Use selection and conditions on a multi-index dataframe
Briefly define tuples as opposed to lists in Python
Work with the time index as a Datetime object
Imports¶
import pandas as pd
from datetime import datetime
Open an hour’s worth of NYSM data using a single row index dataframe¶
nysm_data_file = '/spare11/atm533/data/nysm_data_2021090202a.csv'
df = pd.read_csv(nysm_data_file)
# Look at the first few rows
df.head(3)
station | time | temp_2m [degC] | temp_9m [degC] | relative_humidity [percent] | precip_incremental [mm] | precip_local [mm] | precip_max_intensity [mm/min] | avg_wind_speed_prop [m/s] | max_wind_speed_prop [m/s] | ... | snow_depth [cm] | frozen_soil_05cm [bit] | frozen_soil_25cm [bit] | frozen_soil_50cm [bit] | soil_temp_05cm [degC] | soil_temp_25cm [degC] | soil_temp_50cm [degC] | soil_moisture_05cm [m^3/m^3] | soil_moisture_25cm [m^3/m^3] | soil_moisture_50cm [m^3/m^3] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ADDI | 2021-09-02 02:00:00 UTC | 13.3 | 13.4 | 92.5 | 0.0 | 0.0 | 0.0 | 2.6 | 5.8 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 |
1 | ADDI | 2021-09-02 02:05:00 UTC | 13.2 | 13.4 | 92.8 | 0.0 | 0.0 | 0.0 | 2.7 | 4.3 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 |
2 | ADDI | 2021-09-02 02:10:00 UTC | 13.1 | 13.3 | 92.8 | 0.0 | 0.0 | 0.0 | 2.1 | 3.4 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 |
3 rows × 30 columns
Remind ourselves that the default DataFrame
’s index is a special type of Python object, called a RangeIndex.
df.index
RangeIndex(start=0, stop=1512, step=1)
Review selection and conditional selection methodology¶
Here we have multiple stations and multiple times. We can select rows/columns, set conditions, and make further selections based on those conditions as we did in the 03_Pandas_IndexSubsetsConditonals notebook.
df[df['station']=='VOOR'] # Voorheesville, NY
station | time | temp_2m [degC] | temp_9m [degC] | relative_humidity [percent] | precip_incremental [mm] | precip_local [mm] | precip_max_intensity [mm/min] | avg_wind_speed_prop [m/s] | max_wind_speed_prop [m/s] | ... | snow_depth [cm] | frozen_soil_05cm [bit] | frozen_soil_25cm [bit] | frozen_soil_50cm [bit] | soil_temp_05cm [degC] | soil_temp_25cm [degC] | soil_temp_50cm [degC] | soil_moisture_05cm [m^3/m^3] | soil_moisture_25cm [m^3/m^3] | soil_moisture_50cm [m^3/m^3] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1332 | VOOR | 2021-09-02 02:00:00 UTC | 14.9 | 14.9 | 98.9 | 0.26 | 11.67 | 0.00 | 2.2 | 3.6 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.2 | 21.1 | 21.1 | 0.25 | 0.33 | 0.19 |
1333 | VOOR | 2021-09-02 02:05:00 UTC | 14.8 | 14.8 | 99.0 | 0.27 | 11.94 | 0.00 | 3.0 | 4.6 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.25 | 0.33 | 0.19 |
1334 | VOOR | 2021-09-02 02:10:00 UTC | 14.7 | 14.7 | 98.9 | 0.39 | 12.33 | 0.11 | 3.2 | 5.1 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.25 | 0.33 | 0.19 |
1335 | VOOR | 2021-09-02 02:15:00 UTC | 14.6 | 14.6 | 98.9 | 0.80 | 13.13 | 0.18 | 3.6 | 5.6 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.26 | 0.33 | 0.19 |
1336 | VOOR | 2021-09-02 02:20:00 UTC | 14.6 | 14.6 | 98.9 | 0.88 | 14.01 | 0.18 | 2.9 | 6.4 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.26 | 0.33 | 0.19 |
1337 | VOOR | 2021-09-02 02:25:00 UTC | 14.5 | 14.5 | 98.9 | 0.66 | 14.67 | 0.15 | 2.9 | 5.8 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.27 | 0.33 | 0.19 |
1338 | VOOR | 2021-09-02 02:30:00 UTC | 14.5 | 14.5 | 99.0 | 0.45 | 15.12 | 0.00 | 2.5 | 4.6 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.27 | 0.34 | 0.19 |
1339 | VOOR | 2021-09-02 02:35:00 UTC | 14.4 | 14.5 | 99.0 | 0.55 | 15.67 | 0.13 | 2.6 | 3.8 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.0 | 21.1 | 21.1 | 0.28 | 0.34 | 0.19 |
1340 | VOOR | 2021-09-02 02:40:00 UTC | 14.4 | 14.4 | 99.0 | 0.99 | 16.66 | 0.25 | 2.2 | 4.3 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 21.1 | 21.1 | 0.28 | 0.34 | 0.19 |
1341 | VOOR | 2021-09-02 02:45:00 UTC | 14.3 | 14.4 | 98.9 | 1.51 | 18.17 | 0.39 | 1.7 | 3.5 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 21.1 | 21.1 | 0.29 | 0.34 | 0.19 |
1342 | VOOR | 2021-09-02 02:50:00 UTC | 14.2 | 14.3 | 98.8 | 1.30 | 19.47 | 0.31 | 1.5 | 3.4 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.9 | 21.1 | 0.29 | 0.35 | 0.19 |
1343 | VOOR | 2021-09-02 02:55:00 UTC | 14.2 | 14.3 | 98.4 | 0.77 | 20.24 | 0.19 | 2.1 | 3.7 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.9 | 21.1 | 0.29 | 0.35 | 0.19 |
12 rows × 30 columns
Create a second dataframe that first contains just VOOR rows, then further subset by choosing one time.
df2 = df[df['station']=='VOOR']
df2[df2['time'] == '2021-09-02 02:50:00 UTC'] # we haven't recast this as a Datetime object ... it's just a string
station | time | temp_2m [degC] | temp_9m [degC] | relative_humidity [percent] | precip_incremental [mm] | precip_local [mm] | precip_max_intensity [mm/min] | avg_wind_speed_prop [m/s] | max_wind_speed_prop [m/s] | ... | snow_depth [cm] | frozen_soil_05cm [bit] | frozen_soil_25cm [bit] | frozen_soil_50cm [bit] | soil_temp_05cm [degC] | soil_temp_25cm [degC] | soil_temp_50cm [degC] | soil_moisture_05cm [m^3/m^3] | soil_moisture_25cm [m^3/m^3] | soil_moisture_50cm [m^3/m^3] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1342 | VOOR | 2021-09-02 02:50:00 UTC | 14.2 | 14.3 | 98.8 | 1.3 | 19.47 | 0.31 | 1.5 | 3.4 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.9 | 21.1 | 0.29 | 0.35 | 0.19 |
1 rows × 30 columns
However, there is a better way! Since our DataFrame consists of one station after another, each associated with a number of discrete times, let’s create a multi-indexed DataFrame, that has station as its outer row index and time as its inner.
Create a dataframe with station ID and date/time as the two row indices¶
df.set_index(['station', 'time'], inplace = True)
inplace
means that the df object gets re-created with its new indexes. It's convenient, but if at any point we wish to re-run cells beyond this point, it is better to start from the beginning to ensure that the df
object conforms to whatever code cell operates on it.df
temp_2m [degC] | temp_9m [degC] | relative_humidity [percent] | precip_incremental [mm] | precip_local [mm] | precip_max_intensity [mm/min] | avg_wind_speed_prop [m/s] | max_wind_speed_prop [m/s] | wind_speed_stddev_prop [m/s] | wind_direction_prop [degrees] | ... | snow_depth [cm] | frozen_soil_05cm [bit] | frozen_soil_25cm [bit] | frozen_soil_50cm [bit] | soil_temp_05cm [degC] | soil_temp_25cm [degC] | soil_temp_50cm [degC] | soil_moisture_05cm [m^3/m^3] | soil_moisture_25cm [m^3/m^3] | soil_moisture_50cm [m^3/m^3] | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
station | time | |||||||||||||||||||||
ADDI | 2021-09-02 02:00:00 UTC | 13.3 | 13.4 | 92.5 | 0.0 | 0.0 | 0.0 | 2.6 | 5.8 | 0.9 | 344.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 |
2021-09-02 02:05:00 UTC | 13.2 | 13.4 | 92.8 | 0.0 | 0.0 | 0.0 | 2.7 | 4.3 | 0.6 | 340.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 | |
2021-09-02 02:10:00 UTC | 13.1 | 13.3 | 92.8 | 0.0 | 0.0 | 0.0 | 2.1 | 3.4 | 0.5 | 336.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 | |
2021-09-02 02:15:00 UTC | 13.0 | 13.2 | 92.9 | 0.0 | 0.0 | 0.0 | 2.4 | 4.1 | 0.6 | 342.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 | |
2021-09-02 02:20:00 UTC | 12.9 | 13.1 | 93.0 | 0.0 | 0.0 | 0.0 | 2.3 | 4.6 | 0.9 | 338.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
YORK | 2021-09-02 02:35:00 UTC | 11.7 | 12.8 | 97.1 | 0.0 | 0.0 | 0.0 | 0.2 | 0.6 | 0.2 | 136.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.3 | 21.7 | 21.9 | 0.13 | 0.24 | 0.24 |
2021-09-02 02:40:00 UTC | 11.6 | 12.5 | 97.0 | 0.0 | 0.0 | 0.0 | 0.6 | 0.8 | 0.1 | 131.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.3 | 21.7 | 21.9 | 0.13 | 0.24 | 0.24 | |
2021-09-02 02:45:00 UTC | 11.6 | 12.5 | 97.2 | 0.0 | 0.0 | 0.0 | 0.1 | 0.7 | 0.2 | 165.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.3 | 21.7 | 21.9 | 0.13 | 0.24 | 0.24 | |
2021-09-02 02:50:00 UTC | 11.5 | 12.7 | 97.3 | 0.0 | 0.0 | 0.0 | 0.7 | 0.8 | 0.1 | 193.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.2 | 21.7 | 21.9 | 0.13 | 0.24 | 0.24 | |
2021-09-02 02:55:00 UTC | 11.5 | 12.4 | 97.4 | 0.0 | 0.0 | 0.0 | 0.5 | 0.7 | 0.1 | 168.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.7 | 21.9 | 0.13 | 0.24 | 0.24 |
1512 rows × 28 columns
df.index
MultiIndex([('ADDI', '2021-09-02 02:00:00 UTC'),
('ADDI', '2021-09-02 02:05:00 UTC'),
('ADDI', '2021-09-02 02:10:00 UTC'),
('ADDI', '2021-09-02 02:15:00 UTC'),
('ADDI', '2021-09-02 02:20:00 UTC'),
('ADDI', '2021-09-02 02:25:00 UTC'),
('ADDI', '2021-09-02 02:30:00 UTC'),
('ADDI', '2021-09-02 02:35:00 UTC'),
('ADDI', '2021-09-02 02:40:00 UTC'),
('ADDI', '2021-09-02 02:45:00 UTC'),
...
('YORK', '2021-09-02 02:10:00 UTC'),
('YORK', '2021-09-02 02:15:00 UTC'),
('YORK', '2021-09-02 02:20:00 UTC'),
('YORK', '2021-09-02 02:25:00 UTC'),
('YORK', '2021-09-02 02:30:00 UTC'),
('YORK', '2021-09-02 02:35:00 UTC'),
('YORK', '2021-09-02 02:40:00 UTC'),
('YORK', '2021-09-02 02:45:00 UTC'),
('YORK', '2021-09-02 02:50:00 UTC'),
('YORK', '2021-09-02 02:55:00 UTC')],
names=['station', 'time'], length=1512)
The date/time index is currently a string. Let’s do this nifty trick to convert it to a datetime
object.
df.index = df.index.set_levels([df.index.levels[0], pd.to_datetime(df.index.levels[1])])
#Examine first three row indices
df.index[:3]
MultiIndex([('ADDI', '2021-09-02 02:00:00+00:00'),
('ADDI', '2021-09-02 02:05:00+00:00'),
('ADDI', '2021-09-02 02:10:00+00:00')],
names=['station', 'time'])
Use selection and conditions on a multi-index dataframe¶
Select a column in the usual way.
df.loc['VOOR']
temp_2m [degC] | temp_9m [degC] | relative_humidity [percent] | precip_incremental [mm] | precip_local [mm] | precip_max_intensity [mm/min] | avg_wind_speed_prop [m/s] | max_wind_speed_prop [m/s] | wind_speed_stddev_prop [m/s] | wind_direction_prop [degrees] | ... | snow_depth [cm] | frozen_soil_05cm [bit] | frozen_soil_25cm [bit] | frozen_soil_50cm [bit] | soil_temp_05cm [degC] | soil_temp_25cm [degC] | soil_temp_50cm [degC] | soil_moisture_05cm [m^3/m^3] | soil_moisture_25cm [m^3/m^3] | soil_moisture_50cm [m^3/m^3] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
time | |||||||||||||||||||||
2021-09-02 02:00:00+00:00 | 14.9 | 14.9 | 98.9 | 0.26 | 11.67 | 0.00 | 2.2 | 3.6 | 0.6 | 301.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.2 | 21.1 | 21.1 | 0.25 | 0.33 | 0.19 |
2021-09-02 02:05:00+00:00 | 14.8 | 14.8 | 99.0 | 0.27 | 11.94 | 0.00 | 3.0 | 4.6 | 0.8 | 299.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.25 | 0.33 | 0.19 |
2021-09-02 02:10:00+00:00 | 14.7 | 14.7 | 98.9 | 0.39 | 12.33 | 0.11 | 3.2 | 5.1 | 0.8 | 301.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.25 | 0.33 | 0.19 |
2021-09-02 02:15:00+00:00 | 14.6 | 14.6 | 98.9 | 0.80 | 13.13 | 0.18 | 3.6 | 5.6 | 0.7 | 299.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.26 | 0.33 | 0.19 |
2021-09-02 02:20:00+00:00 | 14.6 | 14.6 | 98.9 | 0.88 | 14.01 | 0.18 | 2.9 | 6.4 | 0.9 | 300.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.26 | 0.33 | 0.19 |
2021-09-02 02:25:00+00:00 | 14.5 | 14.5 | 98.9 | 0.66 | 14.67 | 0.15 | 2.9 | 5.8 | 0.9 | 298.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.27 | 0.33 | 0.19 |
2021-09-02 02:30:00+00:00 | 14.5 | 14.5 | 99.0 | 0.45 | 15.12 | 0.00 | 2.5 | 4.6 | 0.8 | 301.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.1 | 21.1 | 21.1 | 0.27 | 0.34 | 0.19 |
2021-09-02 02:35:00+00:00 | 14.4 | 14.5 | 99.0 | 0.55 | 15.67 | 0.13 | 2.6 | 3.8 | 0.5 | 292.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.0 | 21.1 | 21.1 | 0.28 | 0.34 | 0.19 |
2021-09-02 02:40:00+00:00 | 14.4 | 14.4 | 99.0 | 0.99 | 16.66 | 0.25 | 2.2 | 4.3 | 0.8 | 298.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 21.1 | 21.1 | 0.28 | 0.34 | 0.19 |
2021-09-02 02:45:00+00:00 | 14.3 | 14.4 | 98.9 | 1.51 | 18.17 | 0.39 | 1.7 | 3.5 | 0.7 | 290.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 21.1 | 21.1 | 0.29 | 0.34 | 0.19 |
2021-09-02 02:50:00+00:00 | 14.2 | 14.3 | 98.8 | 1.30 | 19.47 | 0.31 | 1.5 | 3.4 | 0.8 | 307.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.9 | 21.1 | 0.29 | 0.35 | 0.19 |
2021-09-02 02:55:00+00:00 | 14.2 | 14.3 | 98.4 | 0.77 | 20.24 | 0.19 | 2.1 | 3.7 | 0.8 | 301.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.9 | 21.1 | 0.29 | 0.35 | 0.19 |
12 rows × 28 columns
Use loc
to select not only the station id, but also a specific time: we pass in a tuple
to the loc
method
Briefly define tuples as opposed to lists in Python¶
A tuple
is a core Python class, similar to but distinct from a list
Tuples are enclosed in parentheses (); lists are enclosed in brackets []
Elements in a
tuple
are immutable (i.e. can’t be changed), butlist
elements can be changed.
a = (4,7,9) # a tuple
print (a)
print(a[2])
# this next line won't work; comment it out to see:
# a[2] = 8
(4, 7, 9)
9
b = [4, 7, 9]
print (b)
print (b[2])
# this will work:
b[2] = 8
print(b)
[4, 7, 9]
9
[4, 7, 8]
What can be confusing is that just because something is enclosed in brackets, that doesn’t mean it’s a Python list object. In terms of Pandas, the DataFrame’s loc
method typically expects a string that’s enclosed in brackets.
df.loc[('VOOR','2021-09-02 02:45:00 UTC')]
temp_2m [degC] 14.30
temp_9m [degC] 14.40
relative_humidity [percent] 98.90
precip_incremental [mm] 1.51
precip_local [mm] 18.17
precip_max_intensity [mm/min] 0.39
avg_wind_speed_prop [m/s] 1.70
max_wind_speed_prop [m/s] 3.50
wind_speed_stddev_prop [m/s] 0.70
wind_direction_prop [degrees] 290.00
wind_direction_stddev_prop [degrees] 21.00
avg_wind_speed_sonic [m/s] 1.90
max_wind_speed_sonic [m/s] 3.60
wind_speed_stddev_sonic [m/s] 0.70
wind_direction_sonic [degrees] 294.00
wind_direction_stddev_sonic [degrees] 20.00
solar_insolation [W/m^2] 0.00
station_pressure [mbar] 994.87
snow_depth [cm] NaN
frozen_soil_05cm [bit] 0.00
frozen_soil_25cm [bit] 0.00
frozen_soil_50cm [bit] 0.00
soil_temp_05cm [degC] 19.90
soil_temp_25cm [degC] 21.10
soil_temp_50cm [degC] 21.10
soil_moisture_05cm [m^3/m^3] 0.29
soil_moisture_25cm [m^3/m^3] 0.34
soil_moisture_50cm [m^3/m^3] 0.19
Name: (VOOR, 2021-09-02 02:45:00+00:00), dtype: float64
As above, but also select a particular column:
df.loc[('VOOR','2021-09-02 02:45:00 UTC'),'temp_2m [degC]' ]
14.3
Taking advantage of multi-indexing allows us to write selection criteria that is more intuitive than how we’ve previously defined and utilized conditonal statements in our selection criteria (as in our cell earlier in the notebook with nested data frame object names). It also executes quicker!
Pass in multiple columns (either as a list or a tuple) … get back a DataFrame
whose columns are the ones we selected
df.loc[('VOOR','2021-09-02 02:45:00 UTC'),('temp_2m [degC]','precip_incremental [mm]')]
temp_2m [degC] 14.30
precip_incremental [mm] 1.51
Name: (VOOR, 2021-09-02 02:45:00+00:00), dtype: float64
Pass in multiple indexes and multiple columns also returns DataFrame
s:
Two outer and one inner index:
df.loc[(('VOOR','KIND'),'2021-09-02 02:45:00 UTC'),('temp_2m [degC]','precip_incremental [mm]')]
temp_2m [degC] | precip_incremental [mm] | ||
---|---|---|---|
station | time | ||
VOOR | 2021-09-02 02:45:00+00:00 | 14.3 | 1.51 |
KIND | 2021-09-02 02:45:00+00:00 | 16.4 | 0.75 |
Two outer and two inner:
df.loc[(('VOOR','KIND'),('2021-09-02 02:45:00 UTC','2021-09-02 02:55:00 UTC')),('temp_2m [degC]','precip_incremental [mm]')]
temp_2m [degC] | precip_incremental [mm] | ||
---|---|---|---|
station | time | ||
VOOR | 2021-09-02 02:45:00+00:00 | 14.3 | 1.51 |
2021-09-02 02:55:00+00:00 | 14.2 | 0.77 | |
KIND | 2021-09-02 02:45:00+00:00 | 16.4 | 0.75 |
2021-09-02 02:55:00+00:00 | 16.4 | 0.56 |
One outer and two inner:
df.loc[('VOOR',('2021-09-02 02:45:00 UTC','2021-09-02 02:55:00 UTC')),('temp_2m [degC]','precip_incremental [mm]')]
temp_2m [degC] | precip_incremental [mm] | ||
---|---|---|---|
station | time | ||
VOOR | 2021-09-02 02:45:00+00:00 | 14.3 | 1.51 |
2021-09-02 02:55:00+00:00 | 14.2 | 0.77 |
A more efficient way to get all stations at a particular time is via Pandas’ xs
DataFrame method.
# Pass in one index value and what index it belongs to
df.xs('2021-09-02 02:45:00 UTC', level='time')
temp_2m [degC] | temp_9m [degC] | relative_humidity [percent] | precip_incremental [mm] | precip_local [mm] | precip_max_intensity [mm/min] | avg_wind_speed_prop [m/s] | max_wind_speed_prop [m/s] | wind_speed_stddev_prop [m/s] | wind_direction_prop [degrees] | ... | snow_depth [cm] | frozen_soil_05cm [bit] | frozen_soil_25cm [bit] | frozen_soil_50cm [bit] | soil_temp_05cm [degC] | soil_temp_25cm [degC] | soil_temp_50cm [degC] | soil_moisture_05cm [m^3/m^3] | soil_moisture_25cm [m^3/m^3] | soil_moisture_50cm [m^3/m^3] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
station | |||||||||||||||||||||
ADDI | 12.8 | 13.0 | 92.3 | 0.00 | 0.00 | 0.00 | 2.3 | 5.0 | 0.9 | 346.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 20.3 | 19.9 | 0.52 | 0.44 | 0.44 |
ANDE | 14.2 | 14.0 | 99.7 | 0.16 | 13.74 | 0.00 | 3.1 | 6.0 | 0.9 | 344.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.1 | 19.1 | 19.2 | 0.28 | 0.22 | 0.14 |
BATA | 14.1 | 16.2 | 77.0 | 0.00 | 0.00 | 0.00 | 2.4 | 2.6 | 0.1 | 309.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.9 | 21.5 | 21.3 | 0.25 | 0.21 | 0.22 |
BEAC | 15.8 | 15.6 | 98.9 | 1.01 | 56.83 | 0.25 | 3.4 | 5.5 | 0.9 | 35.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 17.9 | 18.9 | 19.5 | 0.51 | 0.36 | 0.37 |
BELD | 14.1 | 14.2 | 93.8 | 0.00 | 1.16 | 0.00 | 3.4 | 6.5 | 1.2 | 5.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.7 | 20.1 | 20.2 | 0.50 | 0.43 | 0.41 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
WFMB | 13.0 | 13.6 | 74.2 | 0.00 | 0.00 | 0.00 | 0.4 | 2.0 | 0.4 | 260.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 18.3 | 19.9 | 19.9 | 0.25 | 0.18 | 0.20 |
WGAT | 13.5 | 13.4 | 80.4 | 0.00 | 0.00 | 0.00 | 0.9 | 3.3 | 0.7 | 22.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 18.4 | 20.1 | 20.7 | 0.15 | 0.25 | 0.08 |
WHIT | 16.3 | 16.3 | 93.2 | 0.00 | 0.00 | 0.00 | 1.6 | 3.1 | 0.5 | 28.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 19.1 | 20.3 | 20.1 | 0.28 | 0.47 | 0.46 |
WOLC | 13.6 | 16.3 | 86.1 | 0.00 | 0.00 | 0.00 | 0.6 | 0.9 | 0.1 | 317.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 21.5 | 23.5 | 24.2 | 0.18 | 0.03 | 0.07 |
YORK | 11.6 | 12.5 | 97.2 | 0.00 | 0.00 | 0.00 | 0.1 | 0.7 | 0.2 | 165.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 20.3 | 21.7 | 21.9 | 0.13 | 0.24 | 0.24 |
126 rows × 28 columns
Include just a list of columns (a tuple won’t work here)
df.xs('2021-09-02 02:45:00 UTC', level = 'time')[['temp_2m [degC]','temp_9m [degC]']]
temp_2m [degC] | temp_9m [degC] | |
---|---|---|
station | ||
ADDI | 12.8 | 13.0 |
ANDE | 14.2 | 14.0 |
BATA | 14.1 | 16.2 |
BEAC | 15.8 | 15.6 |
BELD | 14.1 | 14.2 |
... | ... | ... |
WFMB | 13.0 | 13.6 |
WGAT | 13.5 | 13.4 |
WHIT | 16.3 | 16.3 |
WOLC | 13.6 | 16.3 |
YORK | 11.6 | 12.5 |
126 rows × 2 columns
Work with the time index as a Datetime object:¶
Pandas’ handling of datetime
objects is incredibly powerful (see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) . Below, we select 3 times using the date_range
method.
timeRange = pd.date_range('2021-09-02 02:25', periods=3, freq='5min')
df.loc[('MANH',timeRange.values),('max_wind_speed_prop [m/s]', 'max_wind_speed_sonic [m/s]')]
/tmp/ipykernel_4158000/1874800629.py:1: FutureWarning: Indexing a timezone-aware DatetimeIndex with a timezone-naive datetime is deprecated and will raise KeyError in a future version. Use a timezone-aware object instead.
df.loc[('MANH',timeRange.values),('max_wind_speed_prop [m/s]', 'max_wind_speed_sonic [m/s]')]
max_wind_speed_prop [m/s] | max_wind_speed_sonic [m/s] | ||
---|---|---|---|
station | time | ||
MANH | 2021-09-02 02:25:00+00:00 | 8.2 | 10.2 |
2021-09-02 02:30:00+00:00 | 8.7 | 10.3 | |
2021-09-02 02:35:00+00:00 | 12.2 | 11.5 |
Summary¶
One can recast a Pandas
DataFrame
so it leverages multi-indexingSelecting and specifying conditions on a multi-index
DataFrame
work just as in a single-indexDataFrame
Pandas has a large set of utilities to work with time-series based data.
xs
is one such method.In Python, tuples are akin to lists, but a tuple’s elements are immutable.