{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\"pandas
\n", "\n", "# Pandas Notebook 6: GroupBy + Split/Apply/Combine\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "1. Read in and customize a DataFrame, using default row index\n", "1. Split/Apply/Combine\n", "1. Analyze DataFrames produced by the S/A/C technique\n", "\n", "## Prerequisites\n", "\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "| Pandas notebooks 1-8 | Necessary | |\n", "\n", "* **Time to learn**: 20 minutes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from datetime import datetime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Read in NYSM 5-minute data from a particular day. Use the default row and column index names, but convert the date/time data from `String` to `datetime`. Explicitly set the timezone." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data_file = '/spare11/atm533/data/nysm_data_20210902.csv'\n", "\n", "timeFormat = \"%Y-%m-%d %H:%M:%S UTC\"\n", "parseTime = lambda x: datetime.strptime(x, timeFormat)\n", "\n", "nysm_data = pd.read_csv(nysm_data_file,parse_dates=['time'], date_parser=parseTime)\n", "nysm_data['time'].dt.tz_localize(tz='UTC')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove unwanted columns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.drop(['temp_9m [degC]',\n", " 'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',\n", " 'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',\n", " 'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',\n", " 'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',\n", " 'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',\n", " 'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each dataframe has varying column names. Let's standardize by creating a `dictionary` that will map current column names to common (and in some cases, much shorter) names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column_mapping = {'station' : 'STID',\n", " 'time': 'TIME',\n", " 'temp_2m [degC]': 'TMPC',\n", " 'relative_humidity [percent]': 'RELH',\n", " 'precip_incremental [mm]': 'PRCP',\n", " 'precip_local [mm]': 'PTOT',\n", " 'precip_max_intensity [mm/min]': 'PRAT',\n", " 'avg_wind_speed_sonic [m/s]': 'SPED',\n", " 'max_wind_speed_sonic [m/s]': 'GUMS',\n", " 'wind_direction_sonic [degrees]': 'DRCT', \n", " 'station_pressure [mbar]': 'PRES',\n", " 'stid': 'STID',\n", " 'name': 'NAME',\n", " 'lat': 'SLAT',\n", " 'lon': 'SLON',\n", " 'elevation': 'SELV',\n", " 'STN': 'STID',\n", " 'YYMMDD/HHMM': 'TIME'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rename the columns according to our dictionary. Then examine each Dataframe to see how they look." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.rename(columns=column_mapping, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create a column for dewpoint; for the purposes of this notebook, we'll not do the SLP conversion." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tmpc = nysm_data['TMPC']\n", "rh = nysm_data['RELH']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's rearrange the columns into a more logical order." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "colOrder = ['STID','TIME','TMPC','DWPC','RELH','PRES','DRCT','SPED','GUMS','PRCP','PTOT','PRAT']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data = nysm_data.reindex(columns=colOrder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split-Apply-Combine" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now go through the *Split-Apply-Combine* methodology, as defined by Wickham, 2011 and illustrated by McKinney, 2017 (see refs at end of notebook)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\"SplitApplyCombine\"
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Split: Group by station id" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = nysm_data.groupby(['STID'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This has *split* the DataFrame into a Pandas `groupby` object. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Apply and Combine:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we *apply* a function on the now-split groups, which then *combines* the group and the values returned by the function into a new DataFrame. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One built-in `groupby` function is `describe`. It returns a `DataFrame` containing summary statistics, grouped by station, for each column that contains numerical values:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time df_stns = grouped.describe() # Depending on the size of your DataFrame, this may take a while" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_stns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Analyze DataFrames produced by the S/A/C technique" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tip: This is just another Pandas DataFrame, but with a multiple (i.e. hierarchical) column index (\"under the hood\", the Multiindexing we demonstrated in Pandas_06 uses `groupby`), and thus we could subset on selected rows and columns / sub-columns from it.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Five-minute and daily accumulated precip:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df_stns[['PRCP','PTOT']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Temperature" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_stns ['TMPC']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's the maximum 5-minute precip value for the entire NYSM network for that day?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_stns['PRCP']['max'].max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we'll select just a couple of rows. We'll re-set the Pandas' defaults so all rows and columns are displayed as a result of this selection; scroll to the right to see all columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_rows', None)\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_stns.loc[['VOOR','WOLC']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, *split* by station again, *apply* the max/min functions, and *combine* into two DataFrames which contain max and min values for all measured variables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_maxes = nysm_data.groupby(['STID']).max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_mins = nysm_data.groupby(['STID']).min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_maxes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Tip: Assign new names for these two series; else they will both be labeled TMPC!
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT = df_maxes['TMPC'].rename('TMax')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minT = df_mins['TMPC'].rename('TMin')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a DataFrame out of these two Series, using Pandas' `concat` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([maxT,minT],axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Group by* date/time instead of station id, and then apply/combine." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = nysm_data.groupby(['TIME'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_Maxes = grouped.max()\n", "df_Mins = grouped.min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_Maxes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Note: This is a bit deceptive since it looks like YORK had the maximum values for all parameters. But that is because the station ID, STID got sorted as well.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could construct a DataFrame containing just the max/mins for some selected parameters for each time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT = df_Maxes[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Time to re-sort!: If we do not re-sort the index of this DataFrame, the next cell won't work right!
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT = maxT.sort_index()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "columnMap = {'TMPC': 'TMax',\n", " 'DWPC': 'TdMax',\n", " 'SPED': 'SPDMax',\n", " 'GUMS': 'GUSTMax',\n", " 'PRCP': 'P5MinMax',\n", " 'PTOT': 'PDayMax',\n", " 'PRAT': 'PRateMax'}\n", "maxT.rename(columns=columnMap, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do the same but for minima (obviously, most of the columns will have 0 as a minimum)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minT = df_Mins[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']].sort_index()\n", "columnMap = {'TMPC': 'TMin',\n", " 'DWPC': 'TdMin',\n", " 'SPED': 'SPDMin',\n", " 'GUMS': 'GUSTMin',\n", " 'PRCP': 'P5MinMin',\n", " 'PTOT': 'PDayMin',\n", " 'PRAT': 'PRateMin'}\n", "minT.rename(columns=columnMap, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Note: We don't see a station ID column, but that's because each max and min likely comes from a different station.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Summary\n", "* Pandas' `groupby` method forms the basis of this library's *split-apply-combine* methodology.\n", "\n", "### What's Next?\n", "Next, we will look at `Dataframe`s with multiple indices.\n", "## Resources and References\n", "1. [MetPy Monday 99](https://www.youtube.com/watch?v=31k53iHE-yw&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=90&ab_channel=Unidata) \n", "1. [Split/Apply/Combine (Wickham, 2011)](http://dx.doi.org/10.18637/jss.v040.i01)\n", "1. [Ch. 10 notebook from Github Repository for Wes McKinney's Pandas for Data Analysis book](https://github.com/wesm/pydata-book) " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 August 2022 Environment", "language": "python", "name": "aug22" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }