Merging Pandas DataFrames
Merging Pandas DataFrames¶
This notebook will combine into one dataframe observations from two separate data sources: the NYS Mesonet, and ASOS sites located in or near New York State.
Overview¶
Often, we wish to analyze and visualize datasets from multiple sources. Outwardly, the datasets may seem similar, but they may be formatted differently; they may share some common variables, but differ in others; and/or they may use different physical units.
In this notebook, we will read in two separate sources of surface meteorological observations. One is from the New York State Mesonet; the other is from a worldwide network of surface observation sites; mostly though not always located at airports. These latter observations follow an international standard, known as METAR. While some METAR sites are staffed by humans who add additional information to the observations, they all report observations using Automated Surface Observing Systems (ASOS).
Once we read in an example set of hourly observations from both sources, we will work towards outputting a merged Pandas DataFrame. We will need to standardize different column names, and convert units so they are all in Metric.
Prerequisites¶
Concepts | Importance | Notes |
---|---|---|
Pandas | Necessary | |
MetPy | Necessary | |
Datetime | Necessary |
- Time to learn: 30 minutes
Imports¶
import pandas as pd
import numpy as np
import metpy.calc as mpcalc
from metpy.units import units
Read in and examine the NYSM and ASOS-derived datafiles for a particular hour¶
Also read in the NYSM site table so we can extract latitude, longitude, and elevation.
nysm_data_file = '/spare11/atm533/data/nysm_2025090320.csv'
nysm_sites_file = '/spare11/atm533/data/nysm_sites.csv'
asos_data_file = '/spare11/atm533/data/asos_2025090320.csv'
nysm_data = pd.read_csv(nysm_data_file)
nysm_sites = pd.read_csv(nysm_sites_file)
Examine the ASOS data, using a Jupyterlab feature that allows us to execute Linux commands, as if we were typing them in the terminal. Note the following differences in this dataset’s format as compared to how the NYSM data is structured:
- Its columns are named differently from those of the NYSM
- Its columns are separated using whitespace, not tabs
- Missing values are denoted by -9999.0, not null strings
- The date/time is formatted differently.
Execute a Linux command by prefacing the next code line with an !
, and enclosing the Python variable that corresponds to the file we will open.¶
! head $asos_data_file
STN YYMMDD/HHMM SLAT SLON SELV PMSL ALTI TMPC DWPC SKNT DRCT GUST WNUM CHC1 CHC2 CHC3 VSBY P03D P03I MSUN SNOW WEQS P24I TDXC TDNC P03C CTYL CTYM CTYH P06I T6XC T6NC CEIL P01I SNEW
CTCK 250903/2000 45.25 -74.97 85.00 1008.30 -9999.00 27.00 9.00 5.00 240.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
CWBZ 250903/2000 45.12 -74.28 49.00 1008.90 -9999.00 27.00 13.00 2.00 40.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
CWFQ 250903/2000 45.05 -72.83 152.00 1010.20 -9999.00 25.00 10.00 3.00 150.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
CWGH 250903/2000 44.42 -75.85 82.00 1009.30 -9999.00 26.00 12.00 4.00 240.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
CWGL 250903/2000 44.53 -79.22 221.00 1005.60 -9999.00 23.00 14.00 15.00 180.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
CWIT 250903/2000 45.17 -73.68 52.00 1009.50 -9999.00 25.00 12.00 4.00 120.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
CWIZ 250903/2000 45.30 -73.35 45.00 -9999.00 -9999.00 26.00 12.00 4.00 180.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
CWNC 250903/2000 43.95 -78.17 78.00 1008.30 -9999.00 22.00 16.00 2.00 120.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
CWPC 250903/2000 42.87 -79.25 184.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 7021.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -2.10 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00 -9999.00
Open the file with Pandas; as part of the call to read_csv
, specify that columns are separated by one or more blank spaces, unlike the default (comma-separated values). Treat any cell with certain values as missing.¶
asos_data = pd.read_csv(asos_data_file, sep='\\s+',na_values=['-9999.0', '-9999.00'])
Look at the first couple of rows in all three DataFrames.
nysm_data.head(2)
nysm_sites.head(2)
asos_data.head(2)
Remove (drop) unwanted columns from a DataFrame¶
Our merged data file will contain only a subset of the columns from each data source. First, let’s remove the unwanted columns from the NYSM file.
Use inplace
so the Dataframe object is updated
nysm_data.drop(columns=['temp_9m [degC]','precip_incremental [mm]', 'precip_local [mm]', 'precip_max_intensity [mm/min]',
'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',
'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',
'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',
'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',
'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',
'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True)
Examine what columns remain post-drop.
nysm_data.columns
Index(['station', 'time', 'temp_2m [degC]', 'relative_humidity [percent]',
'avg_wind_speed_sonic [m/s]', 'max_wind_speed_sonic [m/s]',
'wind_direction_sonic [degrees]', 'station_pressure [mbar]', 'name'],
dtype='object')
Now, drop any unwanted columns from the ASOS data file. The file has 35 columns, so rather than creating a long list of column names to drop, let’s create a list of only the columns we wish to keep.
asos_keep = ['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'TMPC', 'DWPC', 'PMSL', 'ALTI', 'SKNT', 'GUST', 'DRCT']
asos_data.columns
Index(['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'PMSL', 'ALTI', 'TMPC',
'DWPC', 'SKNT', 'DRCT', 'GUST', 'WNUM', 'CHC1', 'CHC2', 'CHC3', 'VSBY',
'P03D', 'P03I', 'MSUN', 'SNOW', 'WEQS', 'P24I', 'TDXC', 'TDNC', 'P03C',
'CTYL', 'CTYM', 'CTYH', 'P06I', 'T6XC', 'T6NC', 'CEIL', 'P01I', 'SNEW'],
dtype='object')
Redefine the ASOS dataframe, which will now contain only the desired columns.
asos_data = asos_data[asos_keep]
Examine the columns remaining in the ASOS data file
asos_data.columns
Index(['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'TMPC', 'DWPC', 'PMSL',
'ALTI', 'SKNT', 'GUST', 'DRCT'],
dtype='object')
These columns represent the following:¶
- Station ID
- Date and Time
- Latitude
- Longitude
- Elevation
- 2-meter temperature
- 2-meter dewpoint
- Sea-level pressure in hPa (not all stations report this)
- Altimeter setting in inches of mercury (not all stations report this)
- Wind speed in knots
- Peak wind gust in knots
- Wind direction in degrees
Use a Python dictionary to rename columns¶
Each dataframe has varying column names. Let’s standardize by creating a dictionary
that will map current column names to common (and in some cases, much shorter) names.
column_mapping = {'station' : 'STID',
'time': 'TIME',
'temp_2m [degC]': 'TMPC',
'relative_humidity [percent]': 'RELH',
'precip_incremental [mm]': 'PRCP',
'precip_local [mm]': 'P24M',
'avg_wind_speed_sonic [m/s]': 'SPED',
'max_wind_speed_sonic [m/s]': 'GUMS',
'wind_direction_sonic [degrees]': 'DRCT',
'station_pressure [mbar]': 'PRES',
'stid': 'STID',
'name': 'NAME',
'lat': 'SLAT',
'lon': 'SLON',
'elevation': 'SELV',
'STN': 'STID',
'YYMMDD/HHMM': 'TIME'}
For each of the three Dataframes, rename the columns according to our dictionary. Then examine each Dataframe to see how they look.
nysm_data.rename(columns=column_mapping, inplace=True)
nysm_data.head()
nysm_sites.rename(columns=column_mapping, inplace=True)
nysm_sites.head()
asos_data.rename(columns=column_mapping, inplace=True)
asos_data.head()
Parse the time strings in the NYSM and ASOS dataframes into their corresponding datetime
representations.
nysm_data['TIME'] = pd.to_datetime(nysm_data['TIME'],format="%Y-%m-%d %H:%M:%S UTC", utc=True)
asos_data['TIME'] = pd.to_datetime(asos_data['TIME'],format="%y%m%d/%H%M", utc=True)
For the ASOS data, convert non-Metric variables to Metric and then remove the non-metric columns in the DataFrame
¶
asos_data['SPED'] = (asos_data['SKNT'] * units('kts').to('m/s'))
asos_data['GUMS'] = (asos_data['GUST'] * units('kts').to('m/s'))
asos_data['ALTM'] = (asos_data['ALTI'] * units('inHg').to('hPa'))
asos_data.drop(columns=['SKNT', 'GUST', 'ALTI'], inplace=True)
Examine the ASOS data again
asos_data
Merge columns from one DataFrame into another, and then re-index.¶
Use merge
to add the latitude, longitude, and elevation from the NYSM Sites dataframe to the NYSM Data one.
To do this, we take a subset of the nysm_sites dataframe containing the ones we want to merge, plus the station id column, since merge
requires a common column for both dataframes. Note also that merge
does not have an inplace
option so we just redefine the nysm_data
object.
nysm_data = pd.merge(nysm_data,nysm_sites[['STID','SLAT','SLON','SELV']])
nysm_data.head()
Make the NYSM and ASOS dataframes multi-indexed according to site and time. Convert the date/time from string to Datetime objects.¶
First set the indexes
nysm_data.set_index(['STID', 'TIME'], inplace = True)
asos_data.set_index(['STID', 'TIME'], inplace = True)
nysm_data.head()
asos_data.head()
Add columns to a DataFrame¶
The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create columns for dewpoint and sea-level pressure.
# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf
elev = nysm_data['SELV']
sensorHeight = .5
nysm_data['PMSL'] = nysm_data['PRES']/np.exp(-1*(nysm_data['SELV']+sensorHeight)/((nysm_data['TMPC']+273.15) * 29.263))
We’ll use MetPy to calculate dewpoint from temperature and relative humidity. Since MetPy’s calculation library typically requires units to be attached to variables used in the calculations, do so next.
tmpc = nysm_data['TMPC'].values * units ('degC')
rh = nysm_data['RELH'].values * units('percent')
nysm_data['DWPC'] = mpcalc.dewpoint_from_relative_humidity(tmpc, rh)
Now that we have calculated sea-level pressure and dewpoint, let’s drop station pressure and relative humidity from the NYSM DataFrame since they do not appear in the ASOS one.
nysm_data.drop(columns=['PRES','RELH'],inplace=True,axis='columns')
nysm_data
Concatenate one DataFrame into another¶
Now, the two dataframes can be merged. We’ll use the concat
method to “glue” the rows from the ASOS table on to the NYSM one.
nymerge_data = pd.concat([nysm_data,asos_data])
nymerge_data
Let’s rearrange the columns into a more logical order, and eliminate any that are not common to both of the original NYSM and ASOS dataframes.
colOrder = ['SLAT','SLON','SELV','TMPC','DWPC','PMSL','DRCT','SPED','GUMS']
nymerge_data = nymerge_data.reindex(columns=colOrder)
sort_index
for this purpose.nymerge_data.sort_index(ascending=True, inplace=True)
nymerge_data
Analyze the merged DataFrame¶
Now, we have what we wanted ... one Dataframe that contains the same variables ... in the same units ... for both datasets. We can make selections on this multi-index Dataframe.
Get some general information about the DataFrame
:
nymerge_data.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 280 entries, ('12N', Timestamp('2025-09-03 20:00:00+0000', tz='UTC')) to ('ZER', Timestamp('2025-09-03 20:00:00+0000', tz='UTC'))
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SLAT 280 non-null float64
1 SLON 280 non-null float64
2 SELV 280 non-null float64
3 TMPC 275 non-null float64
4 DWPC 275 non-null float64
5 PMSL 239 non-null float64
6 DRCT 252 non-null float64
7 SPED 270 non-null float64
8 GUMS 154 non-null float64
dtypes: float64(9)
memory usage: 31.0+ KB
Select one station
nymerge_data.loc['VOOR']
Select multiple stations and one time
nymerge_data.loc[(('ALB','VOOR'),('2025-09-03 20:00:00')),('TMPC','DWPC')]
Use the cross-section method to select one index value and then display all values of the other index for selected columns.
nymerge_data.xs('2025-09-03 20:00:00 UTC',level='TIME')[['TMPC','DWPC','SPED']]
Index
instead of a mere column, selections based on multiple times or even a time range could easily be done.DataFrame
and create a station plot with it ... which would show not only the NYSM data, but also the traditional ASOS network of stations!