Pandas: Split/Apply/Combine

Overview¶

In this notebook, we will analyze a “relatively” large (~1 million-row) DataFrame, consisting of all five-minute NYS Mesonet observations from August 2025. We will make heavy use of Pandas’ powerful groupby function, which will split the dataset into smaller groups; apply functions on the groups; and then combine the results of these functions into a new DataFrame.

Prerequisites¶

Concepts	Importance	Notes
Pandas	Necessary

Time to learn: 20 minutes

Imports¶

import pandas as pd
import numpy as np
import metpy.calc as mpcalc
from metpy.units import units
from datetime import datetime

Read in NYSM 5-minute data from a past month. Use the default row and column index names, but convert the date/time data from `String` to `datetime`. Explicitly set the timezone.¶

nysm_data_file = '/spare11/atm533/data/202508_nysm_merged.csv'
timeFormat = "%Y-%m-%d %H:%M:%S UTC"
nysm_data = pd.read_csv(nysm_data_file,parse_dates=['time'], date_format=timeFormat)
nysm_data['time'].dt.tz_localize(tz='UTC')

0         2025-07-31 23:00:00+00:00
1         2025-07-31 23:05:00+00:00
2         2025-07-31 23:10:00+00:00
3         2025-07-31 23:15:00+00:00
4         2025-07-31 23:20:00+00:00
                     ...           
1274359   2025-09-03 20:05:00+00:00
1274360   2025-09-03 20:10:00+00:00
1274361   2025-09-03 20:15:00+00:00
1274362   2025-09-03 20:20:00+00:00
1274363   2025-09-03 20:25:00+00:00
Name: time, Length: 1274364, dtype: datetime64[ns, UTC]

Remove unwanted columns

nysm_data.drop(['temp_9m [degC]',
       'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',
       'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',
       'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',
       'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',
       'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',
       'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')

Each dataframe has varying column names. Let’s standardize by creating a dictionary that will map current column names to common (and in some cases, much shorter) names.

column_mapping = {'station' : 'STID',
                  'time': 'TIME',
                  'temp_2m [degC]': 'TMPC',
                  'relative_humidity [percent]': 'RELH',
                  'precip_incremental [mm]': 'PRCP',
                  'precip_local [mm]': 'PTOT',
                  'precip_max_intensity [mm/min]': 'PRAT',
                  'avg_wind_speed_sonic [m/s]': 'SPED',
                  'max_wind_speed_sonic [m/s]': 'GUMS',
                  'wind_direction_sonic [degrees]': 'DRCT', 
                  'station_pressure [mbar]': 'PRES',
                  'stid': 'STID',
                  'name': 'NAME',
                  'lat': 'SLAT',
                  'lon': 'SLON',
                  'elevation': 'SELV',
                  'STN':  'STID',
                  'YYMMDD/HHMM': 'TIME'}

Rename the columns according to our dictionary. Then examine each Dataframe to see how they look.

nysm_data.rename(columns=column_mapping, inplace=True)

nysm_data.head()

The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create a column for dewpoint; for the purposes of this notebook, we’ll not do the SLP conversion.

tmpc = nysm_data['TMPC'].values * units('degC')
rh = nysm_data['RELH'].values * units('percent')

tmpc

nysm_data['DWPC'] = mpcalc.dewpoint_from_relative_humidity(tmpc, rh)

Let’s rearrange the columns into a more logical order.

colOrder = ['STID','TIME','TMPC','DWPC','RELH','PRES','DRCT','SPED','GUMS','PRCP','PTOT','PRAT']

nysm_data = nysm_data.reindex(columns=colOrder)

nysm_data

Split-Apply-Combine¶

Now go through the Split-Apply-Combine methodology, as defined by Wickham, 2011 and illustrated by McKinney, 2017 (see refs at end of notebook)

SplitApplyCombine

Split: Group by station id¶

grouped = nysm_data.groupby(['STID'])

grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x14be50f64e30>

This has split the DataFrame into a Pandas groupby object.

Apply and Combine:¶

Now we apply a function on the now-split groups, which then combines the group and the values returned by the function into a new DataFrame.

One built-in groupby function is describe. It returns a DataFrame containing summary statistics, grouped by station, for each column that contains numerical values:

df_stns = grouped.describe() # Depending on the size of your DataFrame, this may take a while

df_stns

Analyze DataFrames produced by the S/A/C technique¶

Tip: This is just another Pandas DataFrame, but with multiple (i.e. hierarchical) column indices. Specifically, each column from the original dataframe now has sub-columns, such as count, mean, and 50%.

Five-minute and daily accumulated precip:

df_stns[['PRCP','PTOT']]

Temperature

df_stns ['TMPC']

What’s the maximum 5-minute precip value for the entire NYSM network for that day?

df_stns['PRCP']['max'].max()

357.21

Next we’ll select just a couple of rows.

df_stns.loc[['VOOR','WOLC']]

Next, split by station again, apply the max/min functions, and combine into two DataFrames which contain max and min values for all measured variables:

df_maxes = nysm_data.groupby(['STID']).max()

df_mins = nysm_data.groupby(['STID']).min()

df_maxes

Tip: Assign new names for these two series; else they will both be labeled TMPC!

maxT = df_maxes['TMPC'].rename('TMax')

minT = df_mins['TMPC'].rename('TMin')

Create a DataFrame out of these two Series, using Pandas’ concat function.

pd.concat([maxT,minT],axis='columns')

Group by date/time instead of station id, and then apply/combine.

grouped = nysm_data.groupby(['TIME'])

df_Maxes = grouped.max()
df_Mins = grouped.min()

df_Maxes

Note: This is a bit deceptive since it looks like YORK had the maximum values for all parameters. But that is because the station ID, STID got sorted as well.

We could construct a DataFrame containing just the max/mins for some selected parameters for each time.

maxT = df_Maxes[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']]

maxT

Time to re-sort!: If we do not re-sort the index of this DataFrame, the next cell won't work right!

maxT = maxT.sort_index()

columnMap = {'TMPC': 'TMax',
             'DWPC': 'TdMax',
             'SPED': 'SPDMax',
             'GUMS': 'GUSTMax',
             'PRCP': 'P5MinMax',
             'PTOT': 'PDayMax',
             'PRAT': 'PRateMax'}
maxT.rename(columns=columnMap, inplace=True)

maxT

Do the same but for minima (obviously, most of the columns will have 0 as a minimum)

minT = df_Mins[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']].sort_index()
columnMap = {'TMPC': 'TMin',
             'DWPC': 'TdMin',
             'SPED': 'SPDMin',
             'GUMS': 'GUSTMin',
             'PRCP': 'P5MinMin',
             'PTOT': 'PDayMin',
             'PRAT': 'PRateMin'}
minT.rename(columns=columnMap, inplace=True)

minT

Note: We don't see a station ID column, but that's because each max and min likely comes from a different station.

Summary¶

Pandas’ groupby method forms the basis of this library’s split-apply-combine methodology.

What’s Next?¶

Next, we will look at tabular datasets that are so large that they are best represented in a format other than plain-text.

Resources and References¶

References¶

Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1). 10.18637/jss.v040.i01

Week 3 (Sep. 8-14)

Merging Pandas DataFrames

Week 4 (Sep. 15-21)

Week 4 overview