Pandas: Split/Apply/Combine
Overview¶
In this notebook, we will analyze a “relatively” large (~1 million-row) DataFrame
, consisting of all five-minute NYS Mesonet observations from August 2025. We will make heavy use of Pandas’ powerful groupby
function, which will split the dataset into smaller groups; apply functions on the groups; and then combine the results of these functions into a new DataFrame
.
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Pandas | Necessary |
- Time to learn: 20 minutes
Imports¶
import pandas as pd
import numpy as np
import metpy.calc as mpcalc
from metpy.units import units
from datetime import datetime
Read in NYSM 5-minute data from a past month. Use the default row and column index names, but convert the date/time data from String
to datetime
. Explicitly set the timezone.¶
nysm_data_file = '/spare11/atm533/data/202508_nysm_merged.csv'
timeFormat = "%Y-%m-%d %H:%M:%S UTC"
nysm_data = pd.read_csv(nysm_data_file,parse_dates=['time'], date_format=timeFormat)
nysm_data['time'].dt.tz_localize(tz='UTC')
0 2025-07-31 23:00:00+00:00
1 2025-07-31 23:05:00+00:00
2 2025-07-31 23:10:00+00:00
3 2025-07-31 23:15:00+00:00
4 2025-07-31 23:20:00+00:00
...
1274359 2025-09-03 20:05:00+00:00
1274360 2025-09-03 20:10:00+00:00
1274361 2025-09-03 20:15:00+00:00
1274362 2025-09-03 20:20:00+00:00
1274363 2025-09-03 20:25:00+00:00
Name: time, Length: 1274364, dtype: datetime64[ns, UTC]
Remove unwanted columns
nysm_data.drop(['temp_9m [degC]',
'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',
'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',
'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',
'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',
'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',
'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')
Each dataframe has varying column names. Let’s standardize by creating a dictionary
that will map current column names to common (and in some cases, much shorter) names.
column_mapping = {'station' : 'STID',
'time': 'TIME',
'temp_2m [degC]': 'TMPC',
'relative_humidity [percent]': 'RELH',
'precip_incremental [mm]': 'PRCP',
'precip_local [mm]': 'PTOT',
'precip_max_intensity [mm/min]': 'PRAT',
'avg_wind_speed_sonic [m/s]': 'SPED',
'max_wind_speed_sonic [m/s]': 'GUMS',
'wind_direction_sonic [degrees]': 'DRCT',
'station_pressure [mbar]': 'PRES',
'stid': 'STID',
'name': 'NAME',
'lat': 'SLAT',
'lon': 'SLON',
'elevation': 'SELV',
'STN': 'STID',
'YYMMDD/HHMM': 'TIME'}
Rename the columns according to our dictionary. Then examine each Dataframe to see how they look.
nysm_data.rename(columns=column_mapping, inplace=True)
nysm_data.head()
The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create a column for dewpoint; for the purposes of this notebook, we’ll not do the SLP conversion.
tmpc = nysm_data['TMPC'].values * units('degC')
rh = nysm_data['RELH'].values * units('percent')
tmpc
nysm_data['DWPC'] = mpcalc.dewpoint_from_relative_humidity(tmpc, rh)
Let’s rearrange the columns into a more logical order.
colOrder = ['STID','TIME','TMPC','DWPC','RELH','PRES','DRCT','SPED','GUMS','PRCP','PTOT','PRAT']
nysm_data = nysm_data.reindex(columns=colOrder)
nysm_data
Split-Apply-Combine¶
Now go through the Split-Apply-Combine methodology, as defined by Wickham, 2011 and illustrated by McKinney, 2017 (see refs at end of notebook)
Split: Group by station id¶
grouped = nysm_data.groupby(['STID'])
grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x14be50f64e30>
This has split the DataFrame into a Pandas groupby
object.
Apply and Combine:¶
Now we apply a function on the now-split groups, which then combines the group and the values returned by the function into a new DataFrame.
One built-in groupby
function is describe
. It returns a DataFrame
containing summary statistics, grouped by station, for each column that contains numerical values:
df_stns = grouped.describe() # Depending on the size of your DataFrame, this may take a while
df_stns
Analyze DataFrames produced by the S/A/C technique¶
count
, mean
, and 50%
.Five-minute and daily accumulated precip:
df_stns[['PRCP','PTOT']]
Temperature
df_stns ['TMPC']
What’s the maximum 5-minute precip value for the entire NYSM network for that day?
df_stns['PRCP']['max'].max()
357.21
Next we’ll select just a couple of rows.
df_stns.loc[['VOOR','WOLC']]
Next, split by station again, apply the max/min functions, and combine into two DataFrames which contain max and min values for all measured variables:
df_maxes = nysm_data.groupby(['STID']).max()
df_mins = nysm_data.groupby(['STID']).min()
df_maxes
maxT = df_maxes['TMPC'].rename('TMax')
minT = df_mins['TMPC'].rename('TMin')
Create a DataFrame out of these two Series, using Pandas’ concat
function.
pd.concat([maxT,minT],axis='columns')
Group by date/time instead of station id, and then apply/combine.
grouped = nysm_data.groupby(['TIME'])
df_Maxes = grouped.max()
df_Mins = grouped.min()
df_Maxes
We could construct a DataFrame containing just the max/mins for some selected parameters for each time.
maxT = df_Maxes[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']]
maxT
maxT = maxT.sort_index()
columnMap = {'TMPC': 'TMax',
'DWPC': 'TdMax',
'SPED': 'SPDMax',
'GUMS': 'GUSTMax',
'PRCP': 'P5MinMax',
'PTOT': 'PDayMax',
'PRAT': 'PRateMax'}
maxT.rename(columns=columnMap, inplace=True)
maxT
Do the same but for minima (obviously, most of the columns will have 0 as a minimum)
minT = df_Mins[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']].sort_index()
columnMap = {'TMPC': 'TMin',
'DWPC': 'TdMin',
'SPED': 'SPDMin',
'GUMS': 'GUSTMin',
'PRCP': 'P5MinMin',
'PTOT': 'PDayMin',
'PRAT': 'PRateMin'}
minT.rename(columns=columnMap, inplace=True)
minT
Summary¶
- Pandas’
groupby
method forms the basis of this library’s split-apply-combine methodology.
What’s Next?¶
Next, we will look at tabular datasets that are so large that they are best represented in a format other than plain-text.
Resources and References¶
- Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1). 10.18637/jss.v040.i01