\n",
"\n",
"# Pandas Notebook 6: GroupBy + Split/Apply/Combine\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview\n",
"1. Read in and customize a DataFrame, using default row index\n",
"1. Split/Apply/Combine\n",
"1. Analyze DataFrames produced by the S/A/C technique\n",
"\n",
"## Prerequisites\n",
"\n",
"| Concepts | Importance | Notes |\n",
"| --- | --- | --- |\n",
"| Pandas notebooks 1-8 | Necessary | |\n",
"\n",
"* **Time to learn**: 20 minutes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from datetime import datetime"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Read in NYSM 5-minute data from a particular day. Use the default row and column index names, but convert the date/time data from `String` to `datetime`. Explicitly set the timezone."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data_file = '/spare11/atm533/data/nysm_data_20210902.csv'\n",
"\n",
"timeFormat = \"%Y-%m-%d %H:%M:%S UTC\"\n",
"parseTime = lambda x: datetime.strptime(x, timeFormat)\n",
"\n",
"nysm_data = pd.read_csv(nysm_data_file,parse_dates=['time'], date_parser=parseTime)\n",
"nysm_data['time'].dt.tz_localize(tz='UTC')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove unwanted columns"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.drop(['temp_9m [degC]',\n",
" 'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',\n",
" 'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',\n",
" 'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',\n",
" 'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',\n",
" 'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',\n",
" 'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each dataframe has varying column names. Let's standardize by creating a `dictionary` that will map current column names to common (and in some cases, much shorter) names."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"column_mapping = {'station' : 'STID',\n",
" 'time': 'TIME',\n",
" 'temp_2m [degC]': 'TMPC',\n",
" 'relative_humidity [percent]': 'RELH',\n",
" 'precip_incremental [mm]': 'PRCP',\n",
" 'precip_local [mm]': 'PTOT',\n",
" 'precip_max_intensity [mm/min]': 'PRAT',\n",
" 'avg_wind_speed_sonic [m/s]': 'SPED',\n",
" 'max_wind_speed_sonic [m/s]': 'GUMS',\n",
" 'wind_direction_sonic [degrees]': 'DRCT', \n",
" 'station_pressure [mbar]': 'PRES',\n",
" 'stid': 'STID',\n",
" 'name': 'NAME',\n",
" 'lat': 'SLAT',\n",
" 'lon': 'SLON',\n",
" 'elevation': 'SELV',\n",
" 'STN': 'STID',\n",
" 'YYMMDD/HHMM': 'TIME'}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rename the columns according to our dictionary. Then examine each Dataframe to see how they look."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.rename(columns=column_mapping, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create a column for dewpoint; for the purposes of this notebook, we'll not do the SLP conversion."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmpc = nysm_data['TMPC']\n",
"rh = nysm_data['RELH']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's rearrange the columns into a more logical order."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"colOrder = ['STID','TIME','TMPC','DWPC','RELH','PRES','DRCT','SPED','GUMS','PRCP','PTOT','PRAT']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data = nysm_data.reindex(columns=colOrder)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split-Apply-Combine"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now go through the *Split-Apply-Combine* methodology, as defined by Wickham, 2011 and illustrated by McKinney, 2017 (see refs at end of notebook)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Split: Group by station id"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"grouped = nysm_data.groupby(['STID'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"grouped"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This has *split* the DataFrame into a Pandas `groupby` object. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Apply and Combine:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we *apply* a function on the now-split groups, which then *combines* the group and the values returned by the function into a new DataFrame. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One built-in `groupby` function is `describe`. It returns a `DataFrame` containing summary statistics, grouped by station, for each column that contains numerical values:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%time df_stns = grouped.describe() # Depending on the size of your DataFrame, this may take a while"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_stns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Analyze DataFrames produced by the S/A/C technique"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" Tip: This is just another Pandas DataFrame, but with a multiple (i.e. hierarchical) column index (\"under the hood\", the Multiindexing we demonstrated in Pandas_06 uses `groupby`), and thus we could subset on selected rows and columns / sub-columns from it.
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Five-minute and daily accumulated precip:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"df_stns[['PRCP','PTOT']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Temperature"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_stns ['TMPC']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What's the maximum 5-minute precip value for the entire NYSM network for that day?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_stns['PRCP']['max'].max()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we'll select just a couple of rows. We'll re-set the Pandas' defaults so all rows and columns are displayed as a result of this selection; scroll to the right to see all columns:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.set_option('display.max_rows', None)\n",
"pd.set_option('display.max_columns', None)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_stns.loc[['VOOR','WOLC']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, *split* by station again, *apply* the max/min functions, and *combine* into two DataFrames which contain max and min values for all measured variables:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_maxes = nysm_data.groupby(['STID']).max()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_mins = nysm_data.groupby(['STID']).min()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_maxes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"Tip: Assign new names for these two series; else they will both be labeled TMPC!
\n",
"Note: This is a bit deceptive since it looks like YORK had the maximum values for all parameters. But that is because the station ID, STID got sorted as well.
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could construct a DataFrame containing just the max/mins for some selected parameters for each time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"maxT = df_Maxes[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"maxT"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"Time to re-sort!: If we do not re-sort the index of this DataFrame, the next cell won't work right!