{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><img src=\"https://github.com/pandas-dev/pandas/raw/main/web/pandas/static/img/pandas.svg\" alt=\"pandas Logo\" style=\"width: 800px;\"/></center>\n",
    "\n",
    "# Pandas Notebook 6: GroupBy + Split/Apply/Combine\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "1. Read in and customize a DataFrame, using default row index\n",
    "1. Split/Apply/Combine\n",
    "1. Analyze DataFrames produced by the S/A/C technique\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "| Concepts | Importance | Notes |\n",
    "| --- | --- | --- |\n",
    "| Pandas notebooks 1-8 | Necessary | |\n",
    "\n",
    "* **Time to learn**: 20 minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "from datetime import datetime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  Read in NYSM 5-minute data from a particular day. Use the default row and column index names, but convert the date/time data from `String` to `datetime`. Explicitly set the timezone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data_file = '/spare11/atm533/data/nysm_data_20210902.csv'\n",
    "\n",
    "timeFormat = \"%Y-%m-%d %H:%M:%S UTC\"\n",
    "parseTime = lambda x: datetime.strptime(x, timeFormat)\n",
    "\n",
    "nysm_data = pd.read_csv(nysm_data_file,parse_dates=['time'], date_parser=parseTime)\n",
    "nysm_data['time'].dt.tz_localize(tz='UTC')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove unwanted columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.drop(['temp_9m [degC]',\n",
    "       'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',\n",
    "       'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',\n",
    "       'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',\n",
    "       'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',\n",
    "       'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',\n",
    "       'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each dataframe has varying column names. Let's standardize by creating a `dictionary` that will map current column names to common (and in some cases, much shorter) names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "column_mapping = {'station' : 'STID',\n",
    "                  'time': 'TIME',\n",
    "                  'temp_2m [degC]': 'TMPC',\n",
    "                  'relative_humidity [percent]': 'RELH',\n",
    "                  'precip_incremental [mm]': 'PRCP',\n",
    "                  'precip_local [mm]': 'PTOT',\n",
    "                  'precip_max_intensity [mm/min]': 'PRAT',\n",
    "                  'avg_wind_speed_sonic [m/s]': 'SPED',\n",
    "                  'max_wind_speed_sonic [m/s]': 'GUMS',\n",
    "                  'wind_direction_sonic [degrees]': 'DRCT', \n",
    "                  'station_pressure [mbar]': 'PRES',\n",
    "                  'stid': 'STID',\n",
    "                  'name': 'NAME',\n",
    "                  'lat': 'SLAT',\n",
    "                  'lon': 'SLON',\n",
    "                  'elevation': 'SELV',\n",
    "                  'STN':  'STID',\n",
    "                  'YYMMDD/HHMM': 'TIME'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rename the columns according to our dictionary. Then examine each Dataframe to see how they look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.rename(columns=column_mapping, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create a column for dewpoint; for the purposes of this notebook, we'll not do the SLP conversion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmpc = nysm_data['TMPC']\n",
    "rh = nysm_data['RELH']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's rearrange the columns into a more logical order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colOrder = ['STID','TIME','TMPC','DWPC','RELH','PRES','DRCT','SPED','GUMS','PRCP','PTOT','PRAT']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data = nysm_data.reindex(columns=colOrder)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split-Apply-Combine"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now go through the *Split-Apply-Combine* methodology, as defined by Wickham, 2011 and illustrated by McKinney, 2017 (see refs at end of notebook)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><img src=\"https://www.oreilly.com/api/v2/epubs/9781783985128/files/graphics/5128OS_09_01.jpg\" alt=\"SplitApplyCombine\" style=\"width: 350px;\"/></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Split: Group by station id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = nysm_data.groupby(['STID'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This has *split* the DataFrame into a Pandas `groupby` object. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Apply and Combine:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we *apply* a function on the now-split groups, which then *combines* the group and the values returned by the function into a new DataFrame. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One built-in `groupby` function is `describe`. It returns a `DataFrame` containing summary statistics, grouped by station, for each column that contains numerical values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%time df_stns = grouped.describe() # Depending on the size of your DataFrame, this may take a while"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_stns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Analyze DataFrames produced by the S/A/C technique"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>Tip:</b> This is just another Pandas DataFrame, but with a multiple (i.e. <i>hierarchical</i>) column index (\"under the hood\", the Multiindexing we demonstrated in Pandas_06 uses `groupby`), and thus we could subset on selected rows and columns / sub-columns from it.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Five-minute and daily accumulated precip:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df_stns[['PRCP','PTOT']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Temperature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_stns ['TMPC']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What's the maximum 5-minute precip value for the entire NYSM network for that day?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_stns['PRCP']['max'].max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we'll select just a couple of rows. We'll re-set the Pandas' defaults so all rows and columns are displayed as a result of this selection; scroll to the right to see all columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_rows', None)\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_stns.loc[['VOOR','WOLC']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, *split* by station again, *apply* the max/min functions, and *combine* into two DataFrames which contain max and min values for all measured variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_maxes = nysm_data.groupby(['STID']).max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_mins = nysm_data.groupby(['STID']).min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_maxes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Tip:</b> Assign new names for these two series; else they will both be labeled TMPC!</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT = df_maxes['TMPC'].rename('TMax')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "minT = df_mins['TMPC'].rename('TMin')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a DataFrame out of these two Series, using Pandas' `concat` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.concat([maxT,minT],axis='columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Group by* date/time instead of station id, and then apply/combine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = nysm_data.groupby(['TIME'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_Maxes = grouped.max()\n",
    "df_Mins = grouped.min()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_Maxes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Note:</b> This is a bit deceptive since it looks like YORK had the maximum values for all parameters. But that is because the station ID, <b>STID</b> got sorted as well.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could construct a DataFrame containing just the max/mins for some selected parameters for each time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT = df_Maxes[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Time to re-sort!:</b> If we do not re-sort the index of this DataFrame, the next cell won't work right!</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT = maxT.sort_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "columnMap = {'TMPC': 'TMax',\n",
    "             'DWPC': 'TdMax',\n",
    "             'SPED': 'SPDMax',\n",
    "             'GUMS': 'GUSTMax',\n",
    "             'PRCP': 'P5MinMax',\n",
    "             'PTOT': 'PDayMax',\n",
    "             'PRAT': 'PRateMax'}\n",
    "maxT.rename(columns=columnMap, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do the same but for minima (obviously, most of the columns will have 0 as a minimum)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "minT = df_Mins[['TMPC','DWPC','SPED','GUMS','PRCP','PTOT','PRAT']].sort_index()\n",
    "columnMap = {'TMPC': 'TMin',\n",
    "             'DWPC': 'TdMin',\n",
    "             'SPED': 'SPDMin',\n",
    "             'GUMS': 'GUSTMin',\n",
    "             'PRCP': 'P5MinMin',\n",
    "             'PTOT': 'PDayMin',\n",
    "             'PRAT': 'PRateMin'}\n",
    "minT.rename(columns=columnMap, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "minT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Note:</b> We don't see a station ID column, but that's because each max and min likely comes from a different station.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Summary\n",
    "* Pandas' `groupby` method forms the basis of this library's *split-apply-combine* methodology.\n",
    "\n",
    "### What's Next?\n",
    "Next, we will look at `Dataframe`s with multiple indices.\n",
    "## Resources and References\n",
    "1. [MetPy Monday 99](https://www.youtube.com/watch?v=31k53iHE-yw&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=90&ab_channel=Unidata) \n",
    "1. [Split/Apply/Combine (Wickham, 2011)](http://dx.doi.org/10.18637/jss.v040.i01)\n",
    "1. [Ch. 10 notebook from Github Repository for Wes McKinney's Pandas for Data Analysis book](https://github.com/wesm/pydata-book) "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 August 2022 Environment",
   "language": "python",
   "name": "aug22"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}