{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\"pandas
\n", "\n", "# Pandas Notebook 7: Merging DataFrames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook will combine into one dataframe observations from two separate data sources: the NYS Mesonet, and ASOS sites located in New York. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "\n", "1. Examine the NYSM and ASOS-derived datafiles for a particular day.\n", "1. Remove (*drop*) unwanted columns from a DataFrame.\n", "1. Use a dictionary to rename columns.\n", "1. Merge columns from one DataFrame into another and re-index.\n", "1. Add columns to a DataFrame\n", "1. Concatenate one DataFrame into another.\n", "1. Analyze the merged DataFrame.\n", "\n", "## Prerequisites\n", "\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "| Pandas notebooks 1-7 | Necessary | Intro to `dict` |\n", "\n", "* **Time to learn**: 30 minutes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Examine the NYSM and ASOS-derived datafiles for a particular day" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in NYSM 5-minute data from 8/4/2020 (during the passage of Hurricane Isaias through the region) and also read in the NYSM site table so we can extract latitude, longitude, and elevation. Also read in a table that has hourly METAR observations from ASOS sites in New York State. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data_file = '/spare11/atm533/data/nysm_data_20200804.csv'\n", "nysm_sites_file = '/spare11/atm533/data/nysm_sites.csv'\n", "nyasos_data_file = '/spare11/atm533/data/nyasos_data_20200804.txt'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data = pd.read_csv(nysm_data_file)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_sites = pd.read_csv(nysm_sites_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examine the ASOS data. Note the following differences in this dataset's format as compared to how the NYSM data is structured:\n", "1. Its columns are named differently from those of the NYSM\n", "1. Its columns are separated using whitespace, not tabs\n", "1. Missing values are denoted by -9999.0, not null strings\n", "1. The date/time is formatted differently." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "! head $nyasos_data_file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data = pd.read_csv(nyasos_data_file,delim_whitespace=True,na_values=[-9999.0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the first couple of rows in all three DataFrames." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.head(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_sites.head(2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Remove (*drop*) unwanted columns from a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our merged data file will contain only a subset of the columns from each data source. First, let's remove the unwanted columns from the NYSM file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `inplace` so the Dataframe object is updated, and specify that we're referring to columns, not the default axis, which is row indices." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.drop(['temp_9m [degC]','precip_incremental [mm]','precip_local [mm]','precip_max_intensity [mm/min]',\n", " 'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',\n", " 'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',\n", " 'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',\n", " 'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',\n", " 'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',\n", " 'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examine what columns remain post-drop." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, drop any unwanted columns from the ASOS data file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data.drop(['P01M'],axis='columns',inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examine the columns remaining in the ASOS data file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examine the columns in the NYSM Sites file" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_sites.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use a dictionary to rename columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each dataframe has varying column names. Let's standardize by creating a `dictionary` that will map current column names to common (and in some cases, much shorter) names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "column_mapping = {'station' : 'STID',\n", " 'time': 'TIME',\n", " 'temp_2m [degC]': 'TMPC',\n", " 'relative_humidity [percent]': 'RELH',\n", " 'precip_incremental [mm]': 'PRCP',\n", " 'precip_local [mm]': 'PTOT',\n", " 'avg_wind_speed_sonic [m/s]': 'SPED',\n", " 'max_wind_speed_sonic [m/s]': 'GUMS',\n", " 'wind_direction_sonic [degrees]': 'DRCT', \n", " 'station_pressure [mbar]': 'PRES',\n", " 'stid': 'STID',\n", " 'name': 'NAME',\n", " 'lat': 'SLAT',\n", " 'lon': 'SLON',\n", " 'elevation': 'SELV',\n", " 'STN': 'STID',\n", " 'YYMMDD/HHMM': 'TIME'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each of the three Dataframes, rename the columns according to our dictionary. Then examine each Dataframe to see how they look." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.rename(columns=column_mapping, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_sites.rename(columns=column_mapping, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_sites.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data.rename(columns=column_mapping, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Merge columns from one DataFrame into another, and then re-index." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `merge` to add the latitude, longitude, and elevation from the NYSM Sites dataframe to the NYSM Data one.\n", "To do this, we take a subset of the nysm_sites dataframe containing the ones we want to merge, plus the station id column, since `merge` requires a common column for both dataframes. Note also that `merge` does not have an `inplace` option so we just redefine the `nysm_data` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data = pd.merge(nysm_data,nysm_sites[['STID','SLAT','SLON','SELV']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make the NYSM and ASOS dataframes *multi-indexed* according to *site* and *time*. Convert the date/time from string to Datetime objects. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First set the indexes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.set_index(['STID', 'TIME'], inplace = True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NYSM date/time strings can be automatically parsed into Datetime objects, recognized as in the UTC time zone." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ASOS data set needs to have its station id and date/times sorted in order for the multi-indexing to work properly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data.sort_values(['STID','TIME'],inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data.set_index(['STID', 'TIME'], inplace = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert the date/time strings. Here, we need to specify how the string is formatted." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format=\"%y%m%d/%H%M\",utc=True)],inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nyasos_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Add columns to a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create columns for dewpoint and sea-level pressure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tmpc = nysm_data['TMPC']\n", "rh = nysm_data['RELH']\n", "pres = nysm_data['PRES']\n", "elev = nysm_data['SELV']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sensorHeight = .5\n", "# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf \n", "nysm_data['PMSL'] = pres/np.exp(-1*(elev+sensorHeight)/((tmpc+273.15) * 29.263))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have calculated sea-level pressure, let's drop station pressure from the NYSM DataFrame since it does not appear in the ASOS one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data.drop('PRES',inplace=True,axis='columns')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Concatenate one DataFrame into another" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the two dataframes can be merged. We'll use the `concat` method to \"glue\" the rows from the ASOS table on to the NYSM one." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nymerge_data = pd.concat([nysm_data,nyasos_data])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nymerge_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's rearrange the columns into a more logical order." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "colOrder = ['SLAT','SLON','SELV','TMPC','DWPC','RELH','PMSL','DRCT','SPED','GUMS']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nymerge_data = nymerge_data.reindex(columns=colOrder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Note: Whenever you combine DataFrames, it's a very good idea to re-sort the indices. Although the new DataFrame may \"look\" correct, its indices may not be in what Pandas calls lexical order. We'll use Pandas sort_index for this purpose.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nymerge_data.sort_index(ascending=True, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "nymerge_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Analyze the merged DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we have what we wanted ... one Dataframe that contains the same variables for both datasets. We can make selections on the multi-index Dataframe as we've done previously." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get some general information about the `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nymerge_data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select one station" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nymerge_data.loc['ROC']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select multiple stations and multiple times." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nymerge_data.loc[(('ALB','VOOR'),('2020-08-04 17:45:00 UTC','2020-08-04 18:00:00 UTC')),('TMPC','DWPC')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the *cross-section* method to select one index value and then display all values of the other index for selected columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nymerge_data.xs('2020-08-04 18:00:00 UTC',level='TIME')[['TMPC','DWPC','RELH']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tip: We could, of course, take this merged DataFrame and create a station plot with it ... which would show not only the NYSM data, but also the traditional ASOS network of stations!
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Summary\n", "* Pandas can efficiently combine datasets with different attributes into a single `DataFrame`.\n", "* Pay attention to the row indexing in the resulting merged `DataFrame`.\n", "\n", "### What's Next?\n", "In the next notebook, we will explore Pandas' `groupby` function.\n", "\n", "## Resources and References\n", "1. [MetPy Monday Episode 94](https://www.youtube.com/watch?v=ncpYohRYG3I&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=85&t=4s)\n", "1. [MetPy Monday Episode 97](https://www.youtube.com/watch?v=rj2ZEAIbg1k&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=88&ab_channel=Unidata)\n", "1. [MetPy Monday Episode 98](https://www.youtube.com/watch?v=slUGaLyLJX0&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=89&ab_channel=Unidata)\n", "1. [National Weather Service Automated Surface Observing Systems (ASOS)](https://www.weather.gov/asos/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 August 2022 Environment", "language": "python", "name": "aug22" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }