{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><img src=\"https://github.com/pandas-dev/pandas/raw/main/web/pandas/static/img/pandas.svg\" alt=\"pandas Logo\" style=\"width: 800px;\"/></center>\n",
    "\n",
    "# Pandas Notebook 7: Merging DataFrames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook will combine into one dataframe observations from two separate data sources: the NYS Mesonet, and ASOS sites located in New York. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "1. Examine the NYSM and ASOS-derived datafiles for a particular day.\n",
    "1. Remove (*drop*) unwanted columns from a DataFrame.\n",
    "1. Use a dictionary to rename columns.\n",
    "1. Merge columns from one DataFrame into another and re-index.\n",
    "1. Add columns to a DataFrame\n",
    "1. Concatenate one DataFrame into another.\n",
    "1. Analyze the merged DataFrame.\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "| Concepts | Importance | Notes |\n",
    "| --- | --- | --- |\n",
    "| Pandas notebooks 1-7 | Necessary | Intro to `dict` |\n",
    "\n",
    "* **Time to learn**: 30 minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Examine the NYSM and ASOS-derived datafiles for a particular day"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in NYSM 5-minute data from 8/4/2020 (during the passage of Hurricane Isaias through the region) and also read in the NYSM site table so we can extract latitude, longitude, and elevation. Also read in a table that has hourly METAR observations from ASOS sites in New York State. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data_file = '/spare11/atm533/data/nysm_data_20200804.csv'\n",
    "nysm_sites_file = '/spare11/atm533/data/nysm_sites.csv'\n",
    "nyasos_data_file = '/spare11/atm533/data/nyasos_data_20200804.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data = pd.read_csv(nysm_data_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_sites = pd.read_csv(nysm_sites_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the ASOS data. Note the following differences in this dataset's format as compared to how the NYSM data is structured:\n",
    "1. Its columns are named differently from those of the NYSM\n",
    "1. Its columns are separated using whitespace, not tabs\n",
    "1. Missing values are denoted by -9999.0, not null strings\n",
    "1. The date/time is formatted differently."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! head $nyasos_data_file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data = pd.read_csv(nyasos_data_file,delim_whitespace=True,na_values=[-9999.0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at the first couple of rows in all three DataFrames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_sites.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Remove (*drop*) unwanted columns from a DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our merged data file will contain only a subset of the columns from each data source. First, let's remove the unwanted columns from the NYSM file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `inplace` so the Dataframe object is updated, and specify that we're referring to columns, not the default axis, which is row indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.drop(['temp_9m [degC]','precip_incremental [mm]','precip_local [mm]','precip_max_intensity [mm/min]',\n",
    "       'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',\n",
    "       'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',\n",
    "       'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',\n",
    "       'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',\n",
    "       'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',\n",
    "       'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine what columns remain post-drop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, drop any unwanted columns from the ASOS data file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data.drop(['P01M'],axis='columns',inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the columns remaining in the ASOS data file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Examine the columns in the NYSM Sites file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_sites.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Use a dictionary to rename columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each dataframe has varying column names. Let's standardize by creating a `dictionary` that will map current column names to common (and in some cases, much shorter) names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "column_mapping = {'station' : 'STID',\n",
    "                  'time': 'TIME',\n",
    "                  'temp_2m [degC]': 'TMPC',\n",
    "                  'relative_humidity [percent]': 'RELH',\n",
    "                  'precip_incremental [mm]': 'PRCP',\n",
    "                  'precip_local [mm]': 'PTOT',\n",
    "                  'avg_wind_speed_sonic [m/s]': 'SPED',\n",
    "                  'max_wind_speed_sonic [m/s]': 'GUMS',\n",
    "                  'wind_direction_sonic [degrees]': 'DRCT', \n",
    "                  'station_pressure [mbar]': 'PRES',\n",
    "                  'stid': 'STID',\n",
    "                  'name': 'NAME',\n",
    "                  'lat': 'SLAT',\n",
    "                  'lon': 'SLON',\n",
    "                  'elevation': 'SELV',\n",
    "                  'STN':  'STID',\n",
    "                  'YYMMDD/HHMM': 'TIME'}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each of the three Dataframes, rename the columns according to our dictionary. Then examine each Dataframe to see how they look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.rename(columns=column_mapping, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_sites.rename(columns=column_mapping, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_sites.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data.rename(columns=column_mapping, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Merge columns from one DataFrame into another, and then re-index.</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `merge` to add the latitude, longitude, and elevation from the NYSM Sites dataframe to the NYSM Data one.\n",
    "To do this, we take a subset of the nysm_sites dataframe containing the ones we want to merge, plus the station id column, since `merge` requires a common column for both dataframes. Note also that `merge` does not have an `inplace` option so we just redefine the `nysm_data` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data = pd.merge(nysm_data,nysm_sites[['STID','SLAT','SLON','SELV']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Make the NYSM and ASOS dataframes *multi-indexed* according to *site* and *time*. Convert the date/time from string to Datetime objects. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First set the indexes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.set_index(['STID', 'TIME'], inplace = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NYSM date/time strings can be automatically parsed into Datetime objects, recognized as in the UTC time zone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ASOS data set needs to have its station id and date/times sorted in order for the multi-indexing to work properly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data.sort_values(['STID','TIME'],inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data.set_index(['STID', 'TIME'], inplace = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert the date/time strings. Here, we need to specify how the string is formatted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format=\"%y%m%d/%H%M\",utc=True)],inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nyasos_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Add columns to a DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create columns for dewpoint and sea-level pressure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmpc = nysm_data['TMPC']\n",
    "rh = nysm_data['RELH']\n",
    "pres = nysm_data['PRES']\n",
    "elev = nysm_data['SELV']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sensorHeight = .5\n",
    "# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf \n",
    "nysm_data['PMSL'] = pres/np.exp(-1*(elev+sensorHeight)/((tmpc+273.15) * 29.263))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have calculated sea-level pressure, let's drop station pressure from the NYSM DataFrame since it does not appear in the ASOS one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data.drop('PRES',inplace=True,axis='columns')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Concatenate one DataFrame into another"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, the two dataframes can be merged. We'll use the `concat` method to \"glue\" the rows from the ASOS table on to the NYSM one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nymerge_data = pd.concat([nysm_data,nyasos_data])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nymerge_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's rearrange the columns into a more logical order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colOrder = ['SLAT','SLON','SELV','TMPC','DWPC','RELH','PMSL','DRCT','SPED','GUMS']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nymerge_data = nymerge_data.reindex(columns=colOrder)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-warning\">\n",
    "<b>Note:</b> Whenever you combine DataFrames, it's a very good idea to re-sort the indices. Although the new DataFrame may \"look\" correct, its indices may not be in what Pandas calls <b>lexical order</b>. We'll use Pandas <code>sort_index</code> for this purpose.</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nymerge_data.sort_index(ascending=True, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "nymerge_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Analyze the merged DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we have what we wanted ... one Dataframe that contains the same variables for both datasets. We can make selections on the multi-index Dataframe as we've done previously."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get some general information about the `DataFrame`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nymerge_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select one station"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nymerge_data.loc['ROC']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select multiple stations and multiple times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nymerge_data.loc[(('ALB','VOOR'),('2020-08-04 17:45:00 UTC','2020-08-04 18:00:00 UTC')),('TMPC','DWPC')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the *cross-section* method to select one index value and then display all values of the other index for selected columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nymerge_data.xs('2020-08-04 18:00:00 UTC',level='TIME')[['TMPC','DWPC','RELH']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>Tip:</b> We could, of course, take this merged <code>DataFrame</code> and create a station plot with it ... which would show not only the NYSM data, but also the traditional ASOS network of stations!</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Summary\n",
    "* Pandas can efficiently combine datasets with different attributes into a single `DataFrame`.\n",
    "* Pay attention to the row indexing in the resulting merged `DataFrame`.\n",
    "\n",
    "### What's Next?\n",
    "In the next notebook, we will explore Pandas' `groupby` function.\n",
    "\n",
    "## Resources and References\n",
    "1. [MetPy Monday Episode 94](https://www.youtube.com/watch?v=ncpYohRYG3I&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=85&t=4s)\n",
    "1. [MetPy Monday Episode 97](https://www.youtube.com/watch?v=rj2ZEAIbg1k&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=88&ab_channel=Unidata)\n",
    "1. [MetPy Monday Episode 98](https://www.youtube.com/watch?v=slUGaLyLJX0&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=89&ab_channel=Unidata)\n",
    "1. [National Weather Service Automated Surface Observing Systems (ASOS)](https://www.weather.gov/asos/)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 August 2022 Environment",
   "language": "python",
   "name": "aug22"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}