\n",
"\n",
"# Pandas Notebook 7: Merging DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook will combine into one dataframe observations from two separate data sources: the NYS Mesonet, and ASOS sites located in New York. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview\n",
"\n",
"1. Examine the NYSM and ASOS-derived datafiles for a particular day.\n",
"1. Remove (*drop*) unwanted columns from a DataFrame.\n",
"1. Use a dictionary to rename columns.\n",
"1. Merge columns from one DataFrame into another and re-index.\n",
"1. Add columns to a DataFrame\n",
"1. Concatenate one DataFrame into another.\n",
"1. Analyze the merged DataFrame.\n",
"\n",
"## Prerequisites\n",
"\n",
"| Concepts | Importance | Notes |\n",
"| --- | --- | --- |\n",
"| Pandas notebooks 1-7 | Necessary | Intro to `dict` |\n",
"\n",
"* **Time to learn**: 30 minutes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Examine the NYSM and ASOS-derived datafiles for a particular day"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read in NYSM 5-minute data from 8/4/2020 (during the passage of Hurricane Isaias through the region) and also read in the NYSM site table so we can extract latitude, longitude, and elevation. Also read in a table that has hourly METAR observations from ASOS sites in New York State. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data_file = '/spare11/atm533/data/nysm_data_20200804.csv'\n",
"nysm_sites_file = '/spare11/atm533/data/nysm_sites.csv'\n",
"nyasos_data_file = '/spare11/atm533/data/nyasos_data_20200804.txt'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data = pd.read_csv(nysm_data_file)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_sites = pd.read_csv(nysm_sites_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the ASOS data. Note the following differences in this dataset's format as compared to how the NYSM data is structured:\n",
"1. Its columns are named differently from those of the NYSM\n",
"1. Its columns are separated using whitespace, not tabs\n",
"1. Missing values are denoted by -9999.0, not null strings\n",
"1. The date/time is formatted differently."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! head $nyasos_data_file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data = pd.read_csv(nyasos_data_file,delim_whitespace=True,na_values=[-9999.0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Look at the first couple of rows in all three DataFrames."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_sites.head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Remove (*drop*) unwanted columns from a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our merged data file will contain only a subset of the columns from each data source. First, let's remove the unwanted columns from the NYSM file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `inplace` so the Dataframe object is updated, and specify that we're referring to columns, not the default axis, which is row indices."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.drop(['temp_9m [degC]','precip_incremental [mm]','precip_local [mm]','precip_max_intensity [mm/min]',\n",
" 'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',\n",
" 'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',\n",
" 'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',\n",
" 'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',\n",
" 'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',\n",
" 'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine what columns remain post-drop."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, drop any unwanted columns from the ASOS data file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data.drop(['P01M'],axis='columns',inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the columns remaining in the ASOS data file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Examine the columns in the NYSM Sites file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_sites.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Use a dictionary to rename columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each dataframe has varying column names. Let's standardize by creating a `dictionary` that will map current column names to common (and in some cases, much shorter) names."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"column_mapping = {'station' : 'STID',\n",
" 'time': 'TIME',\n",
" 'temp_2m [degC]': 'TMPC',\n",
" 'relative_humidity [percent]': 'RELH',\n",
" 'precip_incremental [mm]': 'PRCP',\n",
" 'precip_local [mm]': 'PTOT',\n",
" 'avg_wind_speed_sonic [m/s]': 'SPED',\n",
" 'max_wind_speed_sonic [m/s]': 'GUMS',\n",
" 'wind_direction_sonic [degrees]': 'DRCT', \n",
" 'station_pressure [mbar]': 'PRES',\n",
" 'stid': 'STID',\n",
" 'name': 'NAME',\n",
" 'lat': 'SLAT',\n",
" 'lon': 'SLON',\n",
" 'elevation': 'SELV',\n",
" 'STN': 'STID',\n",
" 'YYMMDD/HHMM': 'TIME'}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For each of the three Dataframes, rename the columns according to our dictionary. Then examine each Dataframe to see how they look."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.rename(columns=column_mapping, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_sites.rename(columns=column_mapping, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_sites.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data.rename(columns=column_mapping, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Merge columns from one DataFrame into another, and then re-index."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `merge` to add the latitude, longitude, and elevation from the NYSM Sites dataframe to the NYSM Data one.\n",
"To do this, we take a subset of the nysm_sites dataframe containing the ones we want to merge, plus the station id column, since `merge` requires a common column for both dataframes. Note also that `merge` does not have an `inplace` option so we just redefine the `nysm_data` object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data = pd.merge(nysm_data,nysm_sites[['STID','SLAT','SLON','SELV']])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Make the NYSM and ASOS dataframes *multi-indexed* according to *site* and *time*. Convert the date/time from string to Datetime objects. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First set the indexes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.set_index(['STID', 'TIME'], inplace = True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The NYSM date/time strings can be automatically parsed into Datetime objects, recognized as in the UTC time zone."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ASOS data set needs to have its station id and date/times sorted in order for the multi-indexing to work properly."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data.sort_values(['STID','TIME'],inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data.set_index(['STID', 'TIME'], inplace = True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Convert the date/time strings. Here, we need to specify how the string is formatted."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format=\"%y%m%d/%H%M\",utc=True)],inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nyasos_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Add columns to a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create columns for dewpoint and sea-level pressure."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tmpc = nysm_data['TMPC']\n",
"rh = nysm_data['RELH']\n",
"pres = nysm_data['PRES']\n",
"elev = nysm_data['SELV']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sensorHeight = .5\n",
"# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf \n",
"nysm_data['PMSL'] = pres/np.exp(-1*(elev+sensorHeight)/((tmpc+273.15) * 29.263))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have calculated sea-level pressure, let's drop station pressure from the NYSM DataFrame since it does not appear in the ASOS one."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data.drop('PRES',inplace=True,axis='columns')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Concatenate one DataFrame into another"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, the two dataframes can be merged. We'll use the `concat` method to \"glue\" the rows from the ASOS table on to the NYSM one."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nymerge_data = pd.concat([nysm_data,nyasos_data])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nymerge_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's rearrange the columns into a more logical order."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"colOrder = ['SLAT','SLON','SELV','TMPC','DWPC','RELH','PMSL','DRCT','SPED','GUMS']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nymerge_data = nymerge_data.reindex(columns=colOrder)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"Note: Whenever you combine DataFrames, it's a very good idea to re-sort the indices. Although the new DataFrame may \"look\" correct, its indices may not be in what Pandas calls lexical order. We'll use Pandas sort_index for this purpose.
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nymerge_data.sort_index(ascending=True, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"nymerge_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyze the merged DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we have what we wanted ... one Dataframe that contains the same variables for both datasets. We can make selections on the multi-index Dataframe as we've done previously."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get some general information about the `DataFrame`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nymerge_data.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select one station"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nymerge_data.loc['ROC']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Select multiple stations and multiple times."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nymerge_data.loc[(('ALB','VOOR'),('2020-08-04 17:45:00 UTC','2020-08-04 18:00:00 UTC')),('TMPC','DWPC')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the *cross-section* method to select one index value and then display all values of the other index for selected columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nymerge_data.xs('2020-08-04 18:00:00 UTC',level='TIME')[['TMPC','DWPC','RELH']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" Tip: We could, of course, take this merged DataFrame and create a station plot with it ... which would show not only the NYSM data, but also the traditional ASOS network of stations!
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Summary\n",
"* Pandas can efficiently combine datasets with different attributes into a single `DataFrame`.\n",
"* Pay attention to the row indexing in the resulting merged `DataFrame`.\n",
"\n",
"### What's Next?\n",
"In the next notebook, we will explore Pandas' `groupby` function.\n",
"\n",
"## Resources and References\n",
"1. [MetPy Monday Episode 94](https://www.youtube.com/watch?v=ncpYohRYG3I&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=85&t=4s)\n",
"1. [MetPy Monday Episode 97](https://www.youtube.com/watch?v=rj2ZEAIbg1k&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=88&ab_channel=Unidata)\n",
"1. [MetPy Monday Episode 98](https://www.youtube.com/watch?v=slUGaLyLJX0&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=89&ab_channel=Unidata)\n",
"1. [National Weather Service Automated Surface Observing Systems (ASOS)](https://www.weather.gov/asos/)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 August 2022 Environment",
"language": "python",
"name": "aug22"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}