pandas Logo

Merging Pandas DataFrames¶

This notebook will combine into one dataframe observations from two separate data sources: the NYS Mesonet, and ASOS sites located in or near New York State.

Overview¶

Often, we wish to analyze and visualize datasets from multiple sources. Outwardly, the datasets may seem similar, but they may be formatted differently; they may share some common variables, but differ in others; and/or they may use different physical units.

In this notebook, we will read in two separate sources of surface meteorological observations. One is from the New York State Mesonet; the other is from a worldwide network of surface observation sites; mostly though not always located at airports. These latter observations follow an international standard, known as METAR. While some METAR sites are staffed by humans who add additional information to the observations, they all report observations using Automated Surface Observing Systems (ASOS).

Once we read in an example set of hourly observations from both sources, we will work towards outputting a merged Pandas DataFrame. We will need to standardize different column names, and convert units so they are all in Metric.

Prerequisites¶

Concepts	Importance	Notes
Pandas	Necessary
MetPy	Necessary
Datetime	Necessary

Time to learn: 30 minutes

Imports¶

import pandas as pd
import numpy as np
import metpy.calc as mpcalc
from metpy.units import units

Read in and examine the NYSM and ASOS-derived datafiles for a particular hour¶

Also read in the NYSM site table so we can extract latitude, longitude, and elevation.

nysm_data_file = '/spare11/atm533/data/nysm_2025090320.csv'
nysm_sites_file = '/spare11/atm533/data/nysm_sites.csv'
asos_data_file = '/spare11/atm533/data/asos_2025090320.csv'

nysm_data = pd.read_csv(nysm_data_file)
nysm_sites = pd.read_csv(nysm_sites_file)

Examine the ASOS data, using a Jupyterlab feature that allows us to execute Linux commands, as if we were typing them in the terminal. Note the following differences in this dataset’s format as compared to how the NYSM data is structured:

Its columns are named differently from those of the NYSM
Its columns are separated using whitespace, not tabs
Missing values are denoted by -9999.0, not null strings
The date/time is formatted differently.

Execute a Linux command by prefacing the next code line with an `!`, and enclosing the Python variable that corresponds to the file we will open.¶

! head $asos_data_file

    STN    YYMMDD/HHMM      SLAT     SLON     SELV     PMSL     ALTI     TMPC     DWPC     SKNT     DRCT     GUST     WNUM     CHC1     CHC2     CHC3     VSBY     P03D     P03I     MSUN     SNOW     WEQS     P24I     TDXC     TDNC     P03C     CTYL     CTYM     CTYH     P06I     T6XC     T6NC     CEIL     P01I     SNEW
   CTCK      250903/2000       45.25     -74.97      85.00    1008.30   -9999.00      27.00       9.00       5.00     240.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00
   CWBZ      250903/2000       45.12     -74.28      49.00    1008.90   -9999.00      27.00      13.00       2.00      40.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00
   CWFQ      250903/2000       45.05     -72.83     152.00    1010.20   -9999.00      25.00      10.00       3.00     150.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00
   CWGH      250903/2000       44.42     -75.85      82.00    1009.30   -9999.00      26.00      12.00       4.00     240.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00
   CWGL      250903/2000       44.53     -79.22     221.00    1005.60   -9999.00      23.00      14.00      15.00     180.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00
   CWIT      250903/2000       45.17     -73.68      52.00    1009.50   -9999.00      25.00      12.00       4.00     120.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00
   CWIZ      250903/2000       45.30     -73.35      45.00   -9999.00   -9999.00      26.00      12.00       4.00     180.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00
   CWNC      250903/2000       43.95     -78.17      78.00    1008.30   -9999.00      22.00      16.00       2.00     120.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00
   CWPC      250903/2000       42.87     -79.25     184.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00    7021.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00      -2.10   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00   -9999.00

Open the file with Pandas; as part of the call to `read_csv`, specify that columns are separated by one or more blank spaces, unlike the default (comma-separated values). Treat any cell with certain values as missing.¶

asos_data = pd.read_csv(asos_data_file, sep='\\s+',na_values=['-9999.0', '-9999.00'])

Look at the first couple of rows in all three DataFrames.

nysm_data.head(2)

nysm_sites.head(2)

asos_data.head(2)

Remove (drop) unwanted columns from a DataFrame¶

Our merged data file will contain only a subset of the columns from each data source. First, let’s remove the unwanted columns from the NYSM file.

Use inplace so the Dataframe object is updated

nysm_data.drop(columns=['temp_9m [degC]','precip_incremental [mm]', 'precip_local [mm]', 'precip_max_intensity [mm/min]',
       'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',
       'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',
       'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',
       'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',
       'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',
       'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True)

Examine what columns remain post-drop.

nysm_data.columns

Index(['station', 'time', 'temp_2m [degC]', 'relative_humidity [percent]',
       'avg_wind_speed_sonic [m/s]', 'max_wind_speed_sonic [m/s]',
       'wind_direction_sonic [degrees]', 'station_pressure [mbar]', 'name'],
      dtype='object')

Now, drop any unwanted columns from the ASOS data file. The file has 35 columns, so rather than creating a long list of column names to drop, let’s create a list of only the columns we wish to keep.

asos_keep = ['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'TMPC', 'DWPC', 'PMSL', 'ALTI', 'SKNT', 'GUST', 'DRCT']

asos_data.columns

Index(['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'PMSL', 'ALTI', 'TMPC',
       'DWPC', 'SKNT', 'DRCT', 'GUST', 'WNUM', 'CHC1', 'CHC2', 'CHC3', 'VSBY',
       'P03D', 'P03I', 'MSUN', 'SNOW', 'WEQS', 'P24I', 'TDXC', 'TDNC', 'P03C',
       'CTYL', 'CTYM', 'CTYH', 'P06I', 'T6XC', 'T6NC', 'CEIL', 'P01I', 'SNEW'],
      dtype='object')

Redefine the ASOS dataframe, which will now contain only the desired columns.

asos_data = asos_data[asos_keep]

Examine the columns remaining in the ASOS data file

asos_data.columns

Index(['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'TMPC', 'DWPC', 'PMSL',
       'ALTI', 'SKNT', 'GUST', 'DRCT'],
      dtype='object')

These columns represent the following:¶

Station ID
Date and Time
Latitude
Longitude
Elevation
2-meter temperature
2-meter dewpoint
Sea-level pressure in hPa (not all stations report this)
Altimeter setting in inches of mercury (not all stations report this)
Wind speed in knots
Peak wind gust in knots
Wind direction in degrees

Use a Python dictionary to rename columns¶

Each dataframe has varying column names. Let’s standardize by creating a dictionary that will map current column names to common (and in some cases, much shorter) names.

column_mapping = {'station' : 'STID',
                  'time': 'TIME',
                  'temp_2m [degC]': 'TMPC',
                  'relative_humidity [percent]': 'RELH',
                  'precip_incremental [mm]': 'PRCP',
                  'precip_local [mm]': 'P24M',
                  'avg_wind_speed_sonic [m/s]': 'SPED',
                  'max_wind_speed_sonic [m/s]': 'GUMS',
                  'wind_direction_sonic [degrees]': 'DRCT', 
                  'station_pressure [mbar]': 'PRES',
                  'stid': 'STID',
                  'name': 'NAME',
                  'lat': 'SLAT',
                  'lon': 'SLON',
                  'elevation': 'SELV',
                  'STN':  'STID',
                  'YYMMDD/HHMM': 'TIME'}

For each of the three Dataframes, rename the columns according to our dictionary. Then examine each Dataframe to see how they look.

nysm_data.rename(columns=column_mapping, inplace=True)

nysm_data.head()

nysm_sites.rename(columns=column_mapping, inplace=True)

nysm_sites.head()

asos_data.rename(columns=column_mapping, inplace=True)

asos_data.head()

Parse the time strings in the NYSM and ASOS dataframes into their corresponding datetime representations.

nysm_data['TIME'] = pd.to_datetime(nysm_data['TIME'],format="%Y-%m-%d %H:%M:%S UTC", utc=True)
asos_data['TIME'] = pd.to_datetime(asos_data['TIME'],format="%y%m%d/%H%M", utc=True)

For the ASOS data, convert non-Metric variables to Metric and then remove the non-metric columns in the `DataFrame`¶

asos_data['SPED'] = (asos_data['SKNT'] * units('kts').to('m/s'))
asos_data['GUMS'] = (asos_data['GUST'] * units('kts').to('m/s'))
asos_data['ALTM'] = (asos_data['ALTI'] * units('inHg').to('hPa'))

asos_data.drop(columns=['SKNT', 'GUST', 'ALTI'], inplace=True)

Examine the ASOS data again

asos_data

Merge columns from one DataFrame into another, and then re-index.¶

Use merge to add the latitude, longitude, and elevation from the NYSM Sites dataframe to the NYSM Data one. To do this, we take a subset of the nysm_sites dataframe containing the ones we want to merge, plus the station id column, since merge requires a common column for both dataframes. Note also that merge does not have an inplace option so we just redefine the nysm_data object.

nysm_data = pd.merge(nysm_data,nysm_sites[['STID','SLAT','SLON','SELV']])

nysm_data.head()

Make the NYSM and ASOS dataframes multi-indexed according to site and time. Convert the date/time from string to Datetime objects.¶

First set the indexes

nysm_data.set_index(['STID', 'TIME'], inplace = True)
asos_data.set_index(['STID', 'TIME'], inplace = True)

nysm_data.head()

asos_data.head()

Add columns to a DataFrame¶

The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create columns for dewpoint and sea-level pressure.

# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf 
elev = nysm_data['SELV']
sensorHeight = .5
nysm_data['PMSL'] = nysm_data['PRES']/np.exp(-1*(nysm_data['SELV']+sensorHeight)/((nysm_data['TMPC']+273.15) * 29.263))

We’ll use MetPy to calculate dewpoint from temperature and relative humidity. Since MetPy’s calculation library typically requires units to be attached to variables used in the calculations, do so next.

tmpc = nysm_data['TMPC'].values * units ('degC')
rh = nysm_data['RELH'].values * units('percent')

nysm_data['DWPC'] = mpcalc.dewpoint_from_relative_humidity(tmpc, rh)

Now that we have calculated sea-level pressure and dewpoint, let’s drop station pressure and relative humidity from the NYSM DataFrame since they do not appear in the ASOS one.

nysm_data.drop(columns=['PRES','RELH'],inplace=True,axis='columns')

nysm_data

Concatenate one DataFrame into another¶

Now, the two dataframes can be merged. We’ll use the concat method to “glue” the rows from the ASOS table on to the NYSM one.

nymerge_data = pd.concat([nysm_data,asos_data])

nymerge_data

Let’s rearrange the columns into a more logical order, and eliminate any that are not common to both of the original NYSM and ASOS dataframes.

colOrder = ['SLAT','SLON','SELV','TMPC','DWPC','PMSL','DRCT','SPED','GUMS']

nymerge_data = nymerge_data.reindex(columns=colOrder)

Note: Whenever you combine DataFrames, it's a very good idea to re-sort the indices. Although the new DataFrame may "look" correct, its indices may not be in what Pandas calls lexical order. We'll use Pandas sort_index for this purpose.

nymerge_data.sort_index(ascending=True, inplace=True)

nymerge_data

Analyze the merged DataFrame¶

Now, we have what we wanted ... one Dataframe that contains the same variables ... in the same units ... for both datasets. We can make selections on this multi-index Dataframe.

Get some general information about the DataFrame:

nymerge_data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 280 entries, ('12N', Timestamp('2025-09-03 20:00:00+0000', tz='UTC')) to ('ZER', Timestamp('2025-09-03 20:00:00+0000', tz='UTC'))
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SLAT    280 non-null    float64
 1   SLON    280 non-null    float64
 2   SELV    280 non-null    float64
 3   TMPC    275 non-null    float64
 4   DWPC    275 non-null    float64
 5   PMSL    239 non-null    float64
 6   DRCT    252 non-null    float64
 7   SPED    270 non-null    float64
 8   GUMS    154 non-null    float64
dtypes: float64(9)
memory usage: 31.0+ KB

Select one station

nymerge_data.loc['VOOR']

Select multiple stations and one time

nymerge_data.loc[(('ALB','VOOR'),('2025-09-03 20:00:00')),('TMPC','DWPC')]

Use the cross-section method to select one index value and then display all values of the other index for selected columns.

nymerge_data.xs('2025-09-03 20:00:00 UTC',level='TIME')[['TMPC','DWPC','SPED']]

Note: As currently written, this notebook reads in data for just a single time. Since we have made TIME a Pandas Indexinstead of a mere column, selections based on multiple times or even a time range could easily be done.

Next step: We could, of course, take this merged DataFrame and create a station plot with it ... which would show not only the NYSM data, but also the traditional ASOS network of stations!

Summary¶

Pandas can efficiently combine datasets with different attributes into a single DataFrame.
A Pandas DataFrame can have, and benefit from, multiple indexes.

Resources and References¶

Week 3 (Sep. 8-14)

Week 3 overview