pandas Logo

Pandas 8: Multi-index DataFrames


In many instances, tabular data may best be represented with more than a single row index. For the case of NYSM hourly data files, which have 126 stations each with 13 sets of 5-minute obs, we definitely want to take advantage of using Multi-index Dataframes.

Overview

  1. Open an hour’s worth of NYSM data using a single row index dataframe

  2. Review selection and conditional selection methodology

  3. Create a dataframe with station ID and date/time as the two row indices

  4. Use selection and conditions on a multi-index dataframe

  5. Briefly define tuples as opposed to lists in Python

  6. Work with the time index as a Datetime object

Prerequisites

Concepts

Importance

Notes

Pandas notebooks 1-8

Necessary

  • Time to learn: 15 minutes

Imports

import pandas as pd
from datetime import datetime 

Open an hour’s worth of NYSM data using a single row index dataframe

nysm_data_file = '/spare11/atm533/data/nysm_data_2021090202a.csv'
df = pd.read_csv(nysm_data_file)
# Look at the first few rows
df.head(3)
station time temp_2m [degC] temp_9m [degC] relative_humidity [percent] precip_incremental [mm] precip_local [mm] precip_max_intensity [mm/min] avg_wind_speed_prop [m/s] max_wind_speed_prop [m/s] ... snow_depth [cm] frozen_soil_05cm [bit] frozen_soil_25cm [bit] frozen_soil_50cm [bit] soil_temp_05cm [degC] soil_temp_25cm [degC] soil_temp_50cm [degC] soil_moisture_05cm [m^3/m^3] soil_moisture_25cm [m^3/m^3] soil_moisture_50cm [m^3/m^3]
0 ADDI 2021-09-02 02:00:00 UTC 13.3 13.4 92.5 0.0 0.0 0.0 2.6 5.8 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44
1 ADDI 2021-09-02 02:05:00 UTC 13.2 13.4 92.8 0.0 0.0 0.0 2.7 4.3 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44
2 ADDI 2021-09-02 02:10:00 UTC 13.1 13.3 92.8 0.0 0.0 0.0 2.1 3.4 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44

3 rows × 30 columns

Remind ourselves that the default DataFrame’s index is a special type of Python object, called a RangeIndex.

df.index
RangeIndex(start=0, stop=1512, step=1)

Review selection and conditional selection methodology

Here we have multiple stations and multiple times. We can select rows/columns, set conditions, and make further selections based on those conditions as we did in the 03_Pandas_IndexSubsetsConditonals notebook.

df[df['station']=='VOOR'] # Voorheesville, NY
station time temp_2m [degC] temp_9m [degC] relative_humidity [percent] precip_incremental [mm] precip_local [mm] precip_max_intensity [mm/min] avg_wind_speed_prop [m/s] max_wind_speed_prop [m/s] ... snow_depth [cm] frozen_soil_05cm [bit] frozen_soil_25cm [bit] frozen_soil_50cm [bit] soil_temp_05cm [degC] soil_temp_25cm [degC] soil_temp_50cm [degC] soil_moisture_05cm [m^3/m^3] soil_moisture_25cm [m^3/m^3] soil_moisture_50cm [m^3/m^3]
1332 VOOR 2021-09-02 02:00:00 UTC 14.9 14.9 98.9 0.26 11.67 0.00 2.2 3.6 ... NaN 0.0 0.0 0.0 20.2 21.1 21.1 0.25 0.33 0.19
1333 VOOR 2021-09-02 02:05:00 UTC 14.8 14.8 99.0 0.27 11.94 0.00 3.0 4.6 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.25 0.33 0.19
1334 VOOR 2021-09-02 02:10:00 UTC 14.7 14.7 98.9 0.39 12.33 0.11 3.2 5.1 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.25 0.33 0.19
1335 VOOR 2021-09-02 02:15:00 UTC 14.6 14.6 98.9 0.80 13.13 0.18 3.6 5.6 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.26 0.33 0.19
1336 VOOR 2021-09-02 02:20:00 UTC 14.6 14.6 98.9 0.88 14.01 0.18 2.9 6.4 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.26 0.33 0.19
1337 VOOR 2021-09-02 02:25:00 UTC 14.5 14.5 98.9 0.66 14.67 0.15 2.9 5.8 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.27 0.33 0.19
1338 VOOR 2021-09-02 02:30:00 UTC 14.5 14.5 99.0 0.45 15.12 0.00 2.5 4.6 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.27 0.34 0.19
1339 VOOR 2021-09-02 02:35:00 UTC 14.4 14.5 99.0 0.55 15.67 0.13 2.6 3.8 ... NaN 0.0 0.0 0.0 20.0 21.1 21.1 0.28 0.34 0.19
1340 VOOR 2021-09-02 02:40:00 UTC 14.4 14.4 99.0 0.99 16.66 0.25 2.2 4.3 ... NaN 0.0 0.0 0.0 19.9 21.1 21.1 0.28 0.34 0.19
1341 VOOR 2021-09-02 02:45:00 UTC 14.3 14.4 98.9 1.51 18.17 0.39 1.7 3.5 ... NaN 0.0 0.0 0.0 19.9 21.1 21.1 0.29 0.34 0.19
1342 VOOR 2021-09-02 02:50:00 UTC 14.2 14.3 98.8 1.30 19.47 0.31 1.5 3.4 ... NaN 0.0 0.0 0.0 19.9 20.9 21.1 0.29 0.35 0.19
1343 VOOR 2021-09-02 02:55:00 UTC 14.2 14.3 98.4 0.77 20.24 0.19 2.1 3.7 ... NaN 0.0 0.0 0.0 19.9 20.9 21.1 0.29 0.35 0.19

12 rows × 30 columns

Create a second dataframe that first contains just VOOR rows, then further subset by choosing one time.

df2 = df[df['station']=='VOOR']
df2[df2['time'] == '2021-09-02 02:50:00 UTC'] # we haven't recast this as a Datetime object ... it's just a string
station time temp_2m [degC] temp_9m [degC] relative_humidity [percent] precip_incremental [mm] precip_local [mm] precip_max_intensity [mm/min] avg_wind_speed_prop [m/s] max_wind_speed_prop [m/s] ... snow_depth [cm] frozen_soil_05cm [bit] frozen_soil_25cm [bit] frozen_soil_50cm [bit] soil_temp_05cm [degC] soil_temp_25cm [degC] soil_temp_50cm [degC] soil_moisture_05cm [m^3/m^3] soil_moisture_25cm [m^3/m^3] soil_moisture_50cm [m^3/m^3]
1342 VOOR 2021-09-02 02:50:00 UTC 14.2 14.3 98.8 1.3 19.47 0.31 1.5 3.4 ... NaN 0.0 0.0 0.0 19.9 20.9 21.1 0.29 0.35 0.19

1 rows × 30 columns

However, there is a better way! Since our DataFrame consists of one station after another, each associated with a number of discrete times, let’s create a multi-indexed DataFrame, that has station as its outer row index and time as its inner.

Create a dataframe with station ID and date/time as the two row indices

df.set_index(['station', 'time'], inplace = True)
Note: inplace means that the df object gets re-created with its new indexes. It's convenient, but if at any point we wish to re-run cells beyond this point, it is better to start from the beginning to ensure that the df object conforms to whatever code cell operates on it.
df
temp_2m [degC] temp_9m [degC] relative_humidity [percent] precip_incremental [mm] precip_local [mm] precip_max_intensity [mm/min] avg_wind_speed_prop [m/s] max_wind_speed_prop [m/s] wind_speed_stddev_prop [m/s] wind_direction_prop [degrees] ... snow_depth [cm] frozen_soil_05cm [bit] frozen_soil_25cm [bit] frozen_soil_50cm [bit] soil_temp_05cm [degC] soil_temp_25cm [degC] soil_temp_50cm [degC] soil_moisture_05cm [m^3/m^3] soil_moisture_25cm [m^3/m^3] soil_moisture_50cm [m^3/m^3]
station time
ADDI 2021-09-02 02:00:00 UTC 13.3 13.4 92.5 0.0 0.0 0.0 2.6 5.8 0.9 344.0 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44
2021-09-02 02:05:00 UTC 13.2 13.4 92.8 0.0 0.0 0.0 2.7 4.3 0.6 340.0 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44
2021-09-02 02:10:00 UTC 13.1 13.3 92.8 0.0 0.0 0.0 2.1 3.4 0.5 336.0 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44
2021-09-02 02:15:00 UTC 13.0 13.2 92.9 0.0 0.0 0.0 2.4 4.1 0.6 342.0 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44
2021-09-02 02:20:00 UTC 12.9 13.1 93.0 0.0 0.0 0.0 2.3 4.6 0.9 338.0 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
YORK 2021-09-02 02:35:00 UTC 11.7 12.8 97.1 0.0 0.0 0.0 0.2 0.6 0.2 136.0 ... NaN 0.0 0.0 0.0 20.3 21.7 21.9 0.13 0.24 0.24
2021-09-02 02:40:00 UTC 11.6 12.5 97.0 0.0 0.0 0.0 0.6 0.8 0.1 131.0 ... NaN 0.0 0.0 0.0 20.3 21.7 21.9 0.13 0.24 0.24
2021-09-02 02:45:00 UTC 11.6 12.5 97.2 0.0 0.0 0.0 0.1 0.7 0.2 165.0 ... NaN 0.0 0.0 0.0 20.3 21.7 21.9 0.13 0.24 0.24
2021-09-02 02:50:00 UTC 11.5 12.7 97.3 0.0 0.0 0.0 0.7 0.8 0.1 193.0 ... NaN 0.0 0.0 0.0 20.2 21.7 21.9 0.13 0.24 0.24
2021-09-02 02:55:00 UTC 11.5 12.4 97.4 0.0 0.0 0.0 0.5 0.7 0.1 168.0 ... NaN 0.0 0.0 0.0 20.1 21.7 21.9 0.13 0.24 0.24

1512 rows × 28 columns

df.index
MultiIndex([('ADDI', '2021-09-02 02:00:00 UTC'),
            ('ADDI', '2021-09-02 02:05:00 UTC'),
            ('ADDI', '2021-09-02 02:10:00 UTC'),
            ('ADDI', '2021-09-02 02:15:00 UTC'),
            ('ADDI', '2021-09-02 02:20:00 UTC'),
            ('ADDI', '2021-09-02 02:25:00 UTC'),
            ('ADDI', '2021-09-02 02:30:00 UTC'),
            ('ADDI', '2021-09-02 02:35:00 UTC'),
            ('ADDI', '2021-09-02 02:40:00 UTC'),
            ('ADDI', '2021-09-02 02:45:00 UTC'),
            ...
            ('YORK', '2021-09-02 02:10:00 UTC'),
            ('YORK', '2021-09-02 02:15:00 UTC'),
            ('YORK', '2021-09-02 02:20:00 UTC'),
            ('YORK', '2021-09-02 02:25:00 UTC'),
            ('YORK', '2021-09-02 02:30:00 UTC'),
            ('YORK', '2021-09-02 02:35:00 UTC'),
            ('YORK', '2021-09-02 02:40:00 UTC'),
            ('YORK', '2021-09-02 02:45:00 UTC'),
            ('YORK', '2021-09-02 02:50:00 UTC'),
            ('YORK', '2021-09-02 02:55:00 UTC')],
           names=['station', 'time'], length=1512)

The date/time index is currently a string. Let’s do this nifty trick to convert it to a datetime object.

df.index = df.index.set_levels([df.index.levels[0], pd.to_datetime(df.index.levels[1])])
#Examine first three row indices
df.index[:3]
MultiIndex([('ADDI', '2021-09-02 02:00:00+00:00'),
            ('ADDI', '2021-09-02 02:05:00+00:00'),
            ('ADDI', '2021-09-02 02:10:00+00:00')],
           names=['station', 'time'])

Use selection and conditions on a multi-index dataframe

Select a column in the usual way.

df.loc['VOOR']
temp_2m [degC] temp_9m [degC] relative_humidity [percent] precip_incremental [mm] precip_local [mm] precip_max_intensity [mm/min] avg_wind_speed_prop [m/s] max_wind_speed_prop [m/s] wind_speed_stddev_prop [m/s] wind_direction_prop [degrees] ... snow_depth [cm] frozen_soil_05cm [bit] frozen_soil_25cm [bit] frozen_soil_50cm [bit] soil_temp_05cm [degC] soil_temp_25cm [degC] soil_temp_50cm [degC] soil_moisture_05cm [m^3/m^3] soil_moisture_25cm [m^3/m^3] soil_moisture_50cm [m^3/m^3]
time
2021-09-02 02:00:00+00:00 14.9 14.9 98.9 0.26 11.67 0.00 2.2 3.6 0.6 301.0 ... NaN 0.0 0.0 0.0 20.2 21.1 21.1 0.25 0.33 0.19
2021-09-02 02:05:00+00:00 14.8 14.8 99.0 0.27 11.94 0.00 3.0 4.6 0.8 299.0 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.25 0.33 0.19
2021-09-02 02:10:00+00:00 14.7 14.7 98.9 0.39 12.33 0.11 3.2 5.1 0.8 301.0 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.25 0.33 0.19
2021-09-02 02:15:00+00:00 14.6 14.6 98.9 0.80 13.13 0.18 3.6 5.6 0.7 299.0 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.26 0.33 0.19
2021-09-02 02:20:00+00:00 14.6 14.6 98.9 0.88 14.01 0.18 2.9 6.4 0.9 300.0 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.26 0.33 0.19
2021-09-02 02:25:00+00:00 14.5 14.5 98.9 0.66 14.67 0.15 2.9 5.8 0.9 298.0 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.27 0.33 0.19
2021-09-02 02:30:00+00:00 14.5 14.5 99.0 0.45 15.12 0.00 2.5 4.6 0.8 301.0 ... NaN 0.0 0.0 0.0 20.1 21.1 21.1 0.27 0.34 0.19
2021-09-02 02:35:00+00:00 14.4 14.5 99.0 0.55 15.67 0.13 2.6 3.8 0.5 292.0 ... NaN 0.0 0.0 0.0 20.0 21.1 21.1 0.28 0.34 0.19
2021-09-02 02:40:00+00:00 14.4 14.4 99.0 0.99 16.66 0.25 2.2 4.3 0.8 298.0 ... NaN 0.0 0.0 0.0 19.9 21.1 21.1 0.28 0.34 0.19
2021-09-02 02:45:00+00:00 14.3 14.4 98.9 1.51 18.17 0.39 1.7 3.5 0.7 290.0 ... NaN 0.0 0.0 0.0 19.9 21.1 21.1 0.29 0.34 0.19
2021-09-02 02:50:00+00:00 14.2 14.3 98.8 1.30 19.47 0.31 1.5 3.4 0.8 307.0 ... NaN 0.0 0.0 0.0 19.9 20.9 21.1 0.29 0.35 0.19
2021-09-02 02:55:00+00:00 14.2 14.3 98.4 0.77 20.24 0.19 2.1 3.7 0.8 301.0 ... NaN 0.0 0.0 0.0 19.9 20.9 21.1 0.29 0.35 0.19

12 rows × 28 columns

Use loc to select not only the station id, but also a specific time: we pass in a tuple to the loc method

Briefly define tuples as opposed to lists in Python

A tuple is a core Python class, similar to but distinct from a list

  1. Tuples are enclosed in parentheses (); lists are enclosed in brackets []

  2. Elements in a tuple are immutable (i.e. can’t be changed), but list elements can be changed.

a = (4,7,9) # a tuple
print (a)
print(a[2])
# this next line won't work; comment it out to see:
# a[2] = 8
(4, 7, 9)
9
b = [4, 7, 9]
print (b)
print (b[2])
# this will work:
b[2] = 8
print(b)
[4, 7, 9]
9
[4, 7, 8]
Tip: Some Pandas methods accept tuples as their arguments, although lists will usually work too. When in doubt, append a ? to the method call to see the documentation.

What can be confusing is that just because something is enclosed in brackets, that doesn’t mean it’s a Python list object. In terms of Pandas, the DataFrame’s loc method typically expects a string that’s enclosed in brackets.

Tip: Note: we can pass in the requested time as a string ... it will work!
df.loc[('VOOR','2021-09-02 02:45:00 UTC')]
temp_2m [degC]                            14.30
temp_9m [degC]                            14.40
relative_humidity [percent]               98.90
precip_incremental [mm]                    1.51
precip_local [mm]                         18.17
precip_max_intensity [mm/min]              0.39
avg_wind_speed_prop [m/s]                  1.70
max_wind_speed_prop [m/s]                  3.50
wind_speed_stddev_prop [m/s]               0.70
wind_direction_prop [degrees]            290.00
wind_direction_stddev_prop [degrees]      21.00
avg_wind_speed_sonic [m/s]                 1.90
max_wind_speed_sonic [m/s]                 3.60
wind_speed_stddev_sonic [m/s]              0.70
wind_direction_sonic [degrees]           294.00
wind_direction_stddev_sonic [degrees]     20.00
solar_insolation [W/m^2]                   0.00
station_pressure [mbar]                  994.87
snow_depth [cm]                             NaN
frozen_soil_05cm [bit]                     0.00
frozen_soil_25cm [bit]                     0.00
frozen_soil_50cm [bit]                     0.00
soil_temp_05cm [degC]                     19.90
soil_temp_25cm [degC]                     21.10
soil_temp_50cm [degC]                     21.10
soil_moisture_05cm [m^3/m^3]               0.29
soil_moisture_25cm [m^3/m^3]               0.34
soil_moisture_50cm [m^3/m^3]               0.19
Name: (VOOR, 2021-09-02 02:45:00+00:00), dtype: float64

As above, but also select a particular column:

df.loc[('VOOR','2021-09-02 02:45:00 UTC'),'temp_2m [degC]' ]
14.3

Taking advantage of multi-indexing allows us to write selection criteria that is more intuitive than how we’ve previously defined and utilized conditonal statements in our selection criteria (as in our cell earlier in the notebook with nested data frame object names). It also executes quicker!

Pass in multiple columns (either as a list or a tuple) … get back a DataFrame whose columns are the ones we selected

df.loc[('VOOR','2021-09-02 02:45:00 UTC'),('temp_2m [degC]','precip_incremental [mm]')]
temp_2m [degC]             14.30
precip_incremental [mm]     1.51
Name: (VOOR, 2021-09-02 02:45:00+00:00), dtype: float64

Pass in multiple indexes and multiple columns also returns DataFrames:

  1. Two outer and one inner index:

df.loc[(('VOOR','KIND'),'2021-09-02 02:45:00 UTC'),('temp_2m [degC]','precip_incremental [mm]')]
temp_2m [degC] precip_incremental [mm]
station time
VOOR 2021-09-02 02:45:00+00:00 14.3 1.51
KIND 2021-09-02 02:45:00+00:00 16.4 0.75
  1. Two outer and two inner:

df.loc[(('VOOR','KIND'),('2021-09-02 02:45:00 UTC','2021-09-02 02:55:00 UTC')),('temp_2m [degC]','precip_incremental [mm]')]
temp_2m [degC] precip_incremental [mm]
station time
VOOR 2021-09-02 02:45:00+00:00 14.3 1.51
2021-09-02 02:55:00+00:00 14.2 0.77
KIND 2021-09-02 02:45:00+00:00 16.4 0.75
2021-09-02 02:55:00+00:00 16.4 0.56
  1. One outer and two inner:

df.loc[('VOOR',('2021-09-02 02:45:00 UTC','2021-09-02 02:55:00 UTC')),('temp_2m [degC]','precip_incremental [mm]')]
temp_2m [degC] precip_incremental [mm]
station time
VOOR 2021-09-02 02:45:00+00:00 14.3 1.51
2021-09-02 02:55:00+00:00 14.2 0.77

A more efficient way to get all stations at a particular time is via Pandas’ xs DataFrame method.

# Pass in one index value and what index it belongs to
df.xs('2021-09-02 02:45:00 UTC', level='time')
temp_2m [degC] temp_9m [degC] relative_humidity [percent] precip_incremental [mm] precip_local [mm] precip_max_intensity [mm/min] avg_wind_speed_prop [m/s] max_wind_speed_prop [m/s] wind_speed_stddev_prop [m/s] wind_direction_prop [degrees] ... snow_depth [cm] frozen_soil_05cm [bit] frozen_soil_25cm [bit] frozen_soil_50cm [bit] soil_temp_05cm [degC] soil_temp_25cm [degC] soil_temp_50cm [degC] soil_moisture_05cm [m^3/m^3] soil_moisture_25cm [m^3/m^3] soil_moisture_50cm [m^3/m^3]
station
ADDI 12.8 13.0 92.3 0.00 0.00 0.00 2.3 5.0 0.9 346.0 ... NaN 0.0 0.0 0.0 19.9 20.3 19.9 0.52 0.44 0.44
ANDE 14.2 14.0 99.7 0.16 13.74 0.00 3.1 6.0 0.9 344.0 ... NaN 0.0 0.0 0.0 19.1 19.1 19.2 0.28 0.22 0.14
BATA 14.1 16.2 77.0 0.00 0.00 0.00 2.4 2.6 0.1 309.0 ... NaN 0.0 0.0 0.0 19.9 21.5 21.3 0.25 0.21 0.22
BEAC 15.8 15.6 98.9 1.01 56.83 0.25 3.4 5.5 0.9 35.0 ... NaN 0.0 0.0 0.0 17.9 18.9 19.5 0.51 0.36 0.37
BELD 14.1 14.2 93.8 0.00 1.16 0.00 3.4 6.5 1.2 5.0 ... NaN 0.0 0.0 0.0 19.7 20.1 20.2 0.50 0.43 0.41
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
WFMB 13.0 13.6 74.2 0.00 0.00 0.00 0.4 2.0 0.4 260.0 ... NaN 0.0 0.0 0.0 18.3 19.9 19.9 0.25 0.18 0.20
WGAT 13.5 13.4 80.4 0.00 0.00 0.00 0.9 3.3 0.7 22.0 ... NaN 0.0 0.0 0.0 18.4 20.1 20.7 0.15 0.25 0.08
WHIT 16.3 16.3 93.2 0.00 0.00 0.00 1.6 3.1 0.5 28.0 ... NaN 0.0 0.0 0.0 19.1 20.3 20.1 0.28 0.47 0.46
WOLC 13.6 16.3 86.1 0.00 0.00 0.00 0.6 0.9 0.1 317.0 ... NaN 0.0 0.0 0.0 21.5 23.5 24.2 0.18 0.03 0.07
YORK 11.6 12.5 97.2 0.00 0.00 0.00 0.1 0.7 0.2 165.0 ... NaN 0.0 0.0 0.0 20.3 21.7 21.9 0.13 0.24 0.24

126 rows × 28 columns

Include just a list of columns (a tuple won’t work here)

df.xs('2021-09-02 02:45:00 UTC', level = 'time')[['temp_2m [degC]','temp_9m [degC]']]
temp_2m [degC] temp_9m [degC]
station
ADDI 12.8 13.0
ANDE 14.2 14.0
BATA 14.1 16.2
BEAC 15.8 15.6
BELD 14.1 14.2
... ... ...
WFMB 13.0 13.6
WGAT 13.5 13.4
WHIT 16.3 16.3
WOLC 13.6 16.3
YORK 11.6 12.5

126 rows × 2 columns

Work with the time index as a Datetime object:

Pandas’ handling of datetime objects is incredibly powerful (see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) . Below, we select 3 times using the date_range method.

timeRange = pd.date_range('2021-09-02 02:25', periods=3, freq='5min')
df.loc[('MANH',timeRange.values),('max_wind_speed_prop [m/s]', 'max_wind_speed_sonic [m/s]')]
/tmp/ipykernel_4158000/1874800629.py:1: FutureWarning: Indexing a timezone-aware DatetimeIndex with a timezone-naive datetime is deprecated and will raise KeyError in a future version. Use a timezone-aware object instead.
  df.loc[('MANH',timeRange.values),('max_wind_speed_prop [m/s]', 'max_wind_speed_sonic [m/s]')]
max_wind_speed_prop [m/s] max_wind_speed_sonic [m/s]
station time
MANH 2021-09-02 02:25:00+00:00 8.2 10.2
2021-09-02 02:30:00+00:00 8.7 10.3
2021-09-02 02:35:00+00:00 12.2 11.5

Summary

  • One can recast a Pandas DataFrame so it leverages multi-indexing

  • Selecting and specifying conditions on a multi-index DataFrame work just as in a single-index DataFrame

  • Pandas has a large set of utilities to work with time-series based data. xs is one such method.

  • In Python, tuples are akin to lists, but a tuple’s elements are immutable.

What’s Next?

We will next explore dataframes with geographic attributes, via the GeoPandas package.