pandas Logo

Pandas Notebook 7: Merging DataFrames

This notebook will combine into one dataframe observations from two separate data sources: the NYS Mesonet, and ASOS sites located in New York.

Overview

  1. Examine the NYSM and ASOS-derived datafiles for a particular day.

  2. Remove (drop) unwanted columns from a DataFrame.

  3. Use a dictionary to rename columns.

  4. Merge columns from one DataFrame into another and re-index.

  5. Add columns to a DataFrame

  6. Concatenate one DataFrame into another.

  7. Analyze the merged DataFrame.

Prerequisites

Concepts

Importance

Notes

Pandas notebooks 1-7

Necessary

Intro to dict

  • Time to learn: 30 minutes


Imports

import pandas as pd
import numpy as np

Examine the NYSM and ASOS-derived datafiles for a particular day

Read in NYSM 5-minute data from 8/4/2020 (during the passage of Hurricane Isaias through the region) and also read in the NYSM site table so we can extract latitude, longitude, and elevation. Also read in a table that has hourly METAR observations from ASOS sites in New York State.

nysm_data_file = '/spare11/atm533/data/nysm_data_20200804.csv'
nysm_sites_file = '/spare11/atm533/data/nysm_sites.csv'
nyasos_data_file = '/spare11/atm533/data/nyasos_data_20200804.txt'
nysm_data = pd.read_csv(nysm_data_file)
nysm_sites = pd.read_csv(nysm_sites_file)

Examine the ASOS data. Note the following differences in this dataset’s format as compared to how the NYSM data is structured:

  1. Its columns are named differently from those of the NYSM

  2. Its columns are separated using whitespace, not tabs

  3. Missing values are denoted by -9999.0, not null strings

  4. The date/time is formatted differently.

! head $nyasos_data_file
STN    YYMMDD/HHMM      SLAT     SLON     SELV     TMPC     DWPC     RELH     PMSL    SPED     GUMS     DRCT     P01M
ALB    200804/0000     42.75   -73.80    89.00    25.00    15.60    55.91 1016.50     2.06 -9999.00    10.00 -9999.00
ART    200804/0000     44.00   -76.02    99.00    22.80    16.70    68.47 1015.50     0.00 -9999.00     0.00 -9999.00
BGM    200804/0000     42.22   -75.98   497.00    23.90    15.00    57.47 1016.40     2.57 -9999.00   160.00 -9999.00
BUF    200804/0000     42.93   -78.73   215.00    23.30    17.80    71.22 1015.00     3.60 -9999.00   150.00     0.00
DKK    200804/0000     42.50   -79.28   203.00    20.60    18.30    86.66 1015.10     3.09 -9999.00   150.00     0.00
DSV    200804/0000     42.57   -77.72   209.00    21.70    18.90    84.10 1016.30     4.12 -9999.00   330.00     0.25
ELM    200804/0000     42.17   -76.90   291.00    27.80    15.00    45.60 1014.30     2.57 -9999.00   180.00 -9999.00
ELZ    200804/0000     42.11   -77.99   647.00    22.20    15.00    63.70 1016.40     2.06 -9999.00   160.00 -9999.00
FOK    200804/0000     40.85   -72.63    20.00    25.60    22.80    84.53 1018.00     2.57 -9999.00   220.00 -9999.00
nyasos_data = pd.read_csv(nyasos_data_file,delim_whitespace=True,na_values=[-9999.0])

Look at the first couple of rows in all three DataFrames.

nysm_data.head(2)
station time temp_2m [degC] temp_9m [degC] relative_humidity [percent] precip_incremental [mm] precip_local [mm] precip_max_intensity [mm/min] avg_wind_speed_prop [m/s] max_wind_speed_prop [m/s] ... snow_depth [cm] frozen_soil_05cm [bit] frozen_soil_25cm [bit] frozen_soil_50cm [bit] soil_temp_05cm [degC] soil_temp_25cm [degC] soil_temp_50cm [degC] soil_moisture_05cm [m^3/m^3] soil_moisture_25cm [m^3/m^3] soil_moisture_50cm [m^3/m^3]
0 ADDI 2020-08-04 00:00:00 UTC 23.1 23.9 63.7 0.0 0.0 0.0 1.6 3.0 ... NaN 0.0 0.0 0.0 22.6 20.3 19.1 0.07 0.2 0.26
1 ADDI 2020-08-04 00:05:00 UTC 23.4 24.1 61.4 0.0 0.0 0.0 2.2 3.1 ... NaN 0.0 0.0 0.0 22.6 20.3 19.2 0.08 0.2 0.26

2 rows × 30 columns

nysm_sites.head(2)
stid number name lat lon elevation county nearest_city state distance_from_town [km] direction_from_town [degrees] climate_division climate_division_name wfo commissioned decommissioned
0 ADDI 107 Addison 42.04036 -77.23726 507.614 Steuben Addison NY 6.9 S 1 Western Plateau BGM 2016-08-10 18:15:00 UTC NaN
1 ANDE 111 Andes 42.18227 -74.80139 518.282 Delaware Andes NY 1.5 WSW 2 Eastern Plateau BGM 2016-08-04 15:55:00 UTC NaN
nyasos_data.head(2)
STN YYMMDD/HHMM SLAT SLON SELV TMPC DWPC RELH PMSL SPED GUMS DRCT P01M
0 ALB 200804/0000 42.75 -73.80 89.0 25.0 15.6 55.91 1016.5 2.06 NaN 10.0 NaN
1 ART 200804/0000 44.00 -76.02 99.0 22.8 16.7 68.47 1015.5 0.00 NaN 0.0 NaN

Remove (drop) unwanted columns from a DataFrame

Our merged data file will contain only a subset of the columns from each data source. First, let’s remove the unwanted columns from the NYSM file.

Use inplace so the Dataframe object is updated, and specify that we’re referring to columns, not the default axis, which is row indices.

nysm_data.drop(['temp_9m [degC]','precip_incremental [mm]','precip_local [mm]','precip_max_intensity [mm/min]',
       'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',
       'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',
       'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',
       'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',
       'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',
       'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')

Examine what columns remain post-drop.

nysm_data.columns
Index(['station', 'time', 'temp_2m [degC]', 'relative_humidity [percent]',
       'avg_wind_speed_sonic [m/s]', 'max_wind_speed_sonic [m/s]',
       'wind_direction_sonic [degrees]', 'station_pressure [mbar]'],
      dtype='object')

Now, drop any unwanted columns from the ASOS data file.

nyasos_data.drop(['P01M'],axis='columns',inplace=True)

Examine the columns remaining in the ASOS data file

nyasos_data.columns
Index(['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'TMPC', 'DWPC', 'RELH',
       'PMSL', 'SPED', 'GUMS', 'DRCT'],
      dtype='object')

Examine the columns in the NYSM Sites file

nysm_sites.columns
Index(['stid', 'number', 'name', 'lat', 'lon', 'elevation', 'county',
       'nearest_city', 'state', 'distance_from_town [km]',
       'direction_from_town [degrees]', 'climate_division',
       'climate_division_name', 'wfo', 'commissioned', 'decommissioned'],
      dtype='object')

Use a dictionary to rename columns

Each dataframe has varying column names. Let’s standardize by creating a dictionary that will map current column names to common (and in some cases, much shorter) names.

column_mapping = {'station' : 'STID',
                  'time': 'TIME',
                  'temp_2m [degC]': 'TMPC',
                  'relative_humidity [percent]': 'RELH',
                  'precip_incremental [mm]': 'PRCP',
                  'precip_local [mm]': 'PTOT',
                  'avg_wind_speed_sonic [m/s]': 'SPED',
                  'max_wind_speed_sonic [m/s]': 'GUMS',
                  'wind_direction_sonic [degrees]': 'DRCT', 
                  'station_pressure [mbar]': 'PRES',
                  'stid': 'STID',
                  'name': 'NAME',
                  'lat': 'SLAT',
                  'lon': 'SLON',
                  'elevation': 'SELV',
                  'STN':  'STID',
                  'YYMMDD/HHMM': 'TIME'}

For each of the three Dataframes, rename the columns according to our dictionary. Then examine each Dataframe to see how they look.

nysm_data.rename(columns=column_mapping, inplace=True)
nysm_data.head()
STID TIME TMPC RELH SPED GUMS DRCT PRES
0 ADDI 2020-08-04 00:00:00 UTC 23.1 63.7 1.8 3.0 175.0 957.00
1 ADDI 2020-08-04 00:05:00 UTC 23.4 61.4 2.4 3.6 169.0 957.03
2 ADDI 2020-08-04 00:10:00 UTC 23.1 62.2 2.4 3.7 167.0 956.98
3 ADDI 2020-08-04 00:15:00 UTC 23.2 62.6 2.6 4.3 170.0 957.01
4 ADDI 2020-08-04 00:20:00 UTC 23.0 62.8 2.0 3.7 172.0 956.97
nysm_sites.rename(columns=column_mapping, inplace=True)
nysm_sites.head()
STID number NAME SLAT SLON SELV county nearest_city state distance_from_town [km] direction_from_town [degrees] climate_division climate_division_name wfo commissioned decommissioned
0 ADDI 107 Addison 42.04036 -77.23726 507.6140 Steuben Addison NY 6.9 S 1 Western Plateau BGM 2016-08-10 18:15:00 UTC NaN
1 ANDE 111 Andes 42.18227 -74.80139 518.2820 Delaware Andes NY 1.5 WSW 2 Eastern Plateau BGM 2016-08-04 15:55:00 UTC NaN
2 BATA 24 Batavia 43.01994 -78.13566 276.1200 Genesee Batavia NY 4.9 ENE 9 Great Lakes BUF 2016-02-18 18:40:00 UTC NaN
3 BEAC 76 Beacon 41.52875 -73.94527 90.1598 Dutchess Beacon NY 3.3 NE 5 Hudson Valley ALY 2016-08-22 16:45:00 UTC NaN
4 BELD 90 Belden 42.22322 -75.66852 470.3700 Broome Belden NY 2.2 NNE 2 Eastern Plateau BGM 2015-11-30 20:20:00 UTC NaN
nyasos_data.rename(columns=column_mapping, inplace=True)
nyasos_data.head()
STID TIME SLAT SLON SELV TMPC DWPC RELH PMSL SPED GUMS DRCT
0 ALB 200804/0000 42.75 -73.80 89.0 25.0 15.6 55.91 1016.5 2.06 NaN 10.0
1 ART 200804/0000 44.00 -76.02 99.0 22.8 16.7 68.47 1015.5 0.00 NaN 0.0
2 BGM 200804/0000 42.22 -75.98 497.0 23.9 15.0 57.47 1016.4 2.57 NaN 160.0
3 BUF 200804/0000 42.93 -78.73 215.0 23.3 17.8 71.22 1015.0 3.60 NaN 150.0
4 DKK 200804/0000 42.50 -79.28 203.0 20.6 18.3 86.66 1015.1 3.09 NaN 150.0

Merge columns from one DataFrame into another, and then re-index.

Use merge to add the latitude, longitude, and elevation from the NYSM Sites dataframe to the NYSM Data one. To do this, we take a subset of the nysm_sites dataframe containing the ones we want to merge, plus the station id column, since merge requires a common column for both dataframes. Note also that merge does not have an inplace option so we just redefine the nysm_data object.

nysm_data = pd.merge(nysm_data,nysm_sites[['STID','SLAT','SLON','SELV']])
nysm_data.head()
STID TIME TMPC RELH SPED GUMS DRCT PRES SLAT SLON SELV
0 ADDI 2020-08-04 00:00:00 UTC 23.1 63.7 1.8 3.0 175.0 957.00 42.04036 -77.23726 507.614
1 ADDI 2020-08-04 00:05:00 UTC 23.4 61.4 2.4 3.6 169.0 957.03 42.04036 -77.23726 507.614
2 ADDI 2020-08-04 00:10:00 UTC 23.1 62.2 2.4 3.7 167.0 956.98 42.04036 -77.23726 507.614
3 ADDI 2020-08-04 00:15:00 UTC 23.2 62.6 2.6 4.3 170.0 957.01 42.04036 -77.23726 507.614
4 ADDI 2020-08-04 00:20:00 UTC 23.0 62.8 2.0 3.7 172.0 956.97 42.04036 -77.23726 507.614

Make the NYSM and ASOS dataframes multi-indexed according to site and time. Convert the date/time from string to Datetime objects.

First set the indexes

nysm_data.set_index(['STID', 'TIME'], inplace = True)
nysm_data.head()
TMPC RELH SPED GUMS DRCT PRES SLAT SLON SELV
STID TIME
ADDI 2020-08-04 00:00:00 UTC 23.1 63.7 1.8 3.0 175.0 957.00 42.04036 -77.23726 507.614
2020-08-04 00:05:00 UTC 23.4 61.4 2.4 3.6 169.0 957.03 42.04036 -77.23726 507.614
2020-08-04 00:10:00 UTC 23.1 62.2 2.4 3.7 167.0 956.98 42.04036 -77.23726 507.614
2020-08-04 00:15:00 UTC 23.2 62.6 2.6 4.3 170.0 957.01 42.04036 -77.23726 507.614
2020-08-04 00:20:00 UTC 23.0 62.8 2.0 3.7 172.0 956.97 42.04036 -77.23726 507.614

The NYSM date/time strings can be automatically parsed into Datetime objects, recognized as in the UTC time zone.

nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)
/tmp/ipykernel_4157708/1677587967.py:1: FutureWarning: inplace is deprecated and will be removed in a future version.
  nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)
nysm_data.head()
TMPC RELH SPED GUMS DRCT PRES SLAT SLON SELV
STID TIME
ADDI 2020-08-04 00:00:00+00:00 23.1 63.7 1.8 3.0 175.0 957.00 42.04036 -77.23726 507.614
2020-08-04 00:05:00+00:00 23.4 61.4 2.4 3.6 169.0 957.03 42.04036 -77.23726 507.614
2020-08-04 00:10:00+00:00 23.1 62.2 2.4 3.7 167.0 956.98 42.04036 -77.23726 507.614
2020-08-04 00:15:00+00:00 23.2 62.6 2.6 4.3 170.0 957.01 42.04036 -77.23726 507.614
2020-08-04 00:20:00+00:00 23.0 62.8 2.0 3.7 172.0 956.97 42.04036 -77.23726 507.614

The ASOS data set needs to have its station id and date/times sorted in order for the multi-indexing to work properly.

nyasos_data.sort_values(['STID','TIME'],inplace=True)
nyasos_data.set_index(['STID', 'TIME'], inplace = True)

Convert the date/time strings. Here, we need to specify how the string is formatted.

nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format="%y%m%d/%H%M",utc=True)],inplace=True)
/tmp/ipykernel_4157708/889004120.py:1: FutureWarning: inplace is deprecated and will be removed in a future version.
  nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format="%y%m%d/%H%M",utc=True)],inplace=True)
nyasos_data
SLAT SLON SELV TMPC DWPC RELH PMSL SPED GUMS DRCT
STID TIME
ALB 2020-08-04 00:00:00+00:00 42.75 -73.80 89.0 25.0 15.6 55.91 1016.5 2.06 NaN 10.0
2020-08-04 01:00:00+00:00 42.75 -73.80 89.0 23.9 16.1 61.66 1016.6 1.54 NaN 60.0
2020-08-04 02:00:00+00:00 42.75 -73.80 89.0 21.7 16.7 73.22 1017.0 0.00 NaN 0.0
2020-08-04 03:00:00+00:00 42.75 -73.80 89.0 21.7 16.7 73.22 1017.0 0.00 NaN 0.0
2020-08-04 04:00:00+00:00 42.75 -73.80 89.0 22.2 16.7 71.01 1017.0 0.00 NaN 0.0
... ... ... ... ... ... ... ... ... ... ... ...
SYR 2020-08-04 19:00:00+00:00 43.12 -76.12 124.0 21.1 19.4 90.02 1008.2 4.12 NaN 70.0
2020-08-04 20:00:00+00:00 43.12 -76.12 124.0 21.1 19.4 90.02 1006.5 2.57 NaN 60.0
2020-08-04 21:00:00+00:00 43.12 -76.12 124.0 21.1 18.9 87.26 1006.1 3.09 NaN 50.0
2020-08-04 22:00:00+00:00 43.12 -76.12 124.0 22.8 18.9 78.65 1006.5 2.06 NaN 270.0
2020-08-04 23:00:00+00:00 43.12 -76.12 124.0 22.8 18.9 78.65 1006.9 1.54 NaN 280.0

837 rows × 10 columns

Add columns to a DataFrame

The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create columns for dewpoint and sea-level pressure.

tmpc = nysm_data['TMPC']
rh = nysm_data['RELH']
pres = nysm_data['PRES']
elev = nysm_data['SELV']
sensorHeight = .5
# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf 
nysm_data['PMSL'] = pres/np.exp(-1*(elev+sensorHeight)/((tmpc+273.15) * 29.263))
nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))

Now that we have calculated sea-level pressure, let’s drop station pressure from the NYSM DataFrame since it does not appear in the ASOS one.

nysm_data.drop('PRES',inplace=True,axis='columns')
nysm_data
TMPC RELH SPED GUMS DRCT SLAT SLON SELV PMSL DWPC
STID TIME
ADDI 2020-08-04 00:00:00+00:00 23.1 63.7 1.8 3.0 175.0 42.04036 -77.23726 507.614 1014.767745 15.846049
2020-08-04 00:05:00+00:00 23.4 61.4 2.4 3.6 169.0 42.04036 -77.23726 507.614 1014.739386 15.554540
2020-08-04 00:10:00+00:00 23.1 62.2 2.4 3.7 167.0 42.04036 -77.23726 507.614 1014.746538 15.473742
2020-08-04 00:15:00+00:00 23.2 62.6 2.6 4.3 170.0 42.04036 -77.23726 507.614 1014.758278 15.668212
2020-08-04 00:20:00+00:00 23.0 62.8 2.0 3.7 172.0 42.04036 -77.23726 507.614 1014.756017 15.529141
... ... ... ... ... ... ... ... ... ... ... ...
YORK 2020-08-04 23:35:00+00:00 21.7 89.7 1.3 2.3 246.0 42.85504 -77.84776 177.942 1009.391100 19.933345
2020-08-04 23:40:00+00:00 21.7 90.0 1.0 1.6 237.0 42.85504 -77.84776 177.942 1009.452354 19.987260
2020-08-04 23:45:00+00:00 21.7 90.9 1.2 2.2 222.0 42.85504 -77.84776 177.942 1009.574861 20.148065
2020-08-04 23:50:00+00:00 21.8 90.8 1.1 2.1 231.0 42.85504 -77.84776 177.942 1009.639244 20.229084
2020-08-04 23:55:00+00:00 21.7 90.0 0.8 1.9 217.0 42.85504 -77.84776 177.942 1009.687160 19.987260

36288 rows × 10 columns

Concatenate one DataFrame into another

Now, the two dataframes can be merged. We’ll use the concat method to “glue” the rows from the ASOS table on to the NYSM one.

nymerge_data = pd.concat([nysm_data,nyasos_data])
nymerge_data
TMPC RELH SPED GUMS DRCT SLAT SLON SELV PMSL DWPC
STID TIME
ADDI 2020-08-04 00:00:00+00:00 23.1 63.70 1.80 3.0 175.0 42.04036 -77.23726 507.614 1014.767745 15.846049
2020-08-04 00:05:00+00:00 23.4 61.40 2.40 3.6 169.0 42.04036 -77.23726 507.614 1014.739386 15.554540
2020-08-04 00:10:00+00:00 23.1 62.20 2.40 3.7 167.0 42.04036 -77.23726 507.614 1014.746538 15.473742
2020-08-04 00:15:00+00:00 23.2 62.60 2.60 4.3 170.0 42.04036 -77.23726 507.614 1014.758278 15.668212
2020-08-04 00:20:00+00:00 23.0 62.80 2.00 3.7 172.0 42.04036 -77.23726 507.614 1014.756017 15.529141
... ... ... ... ... ... ... ... ... ... ... ...
SYR 2020-08-04 19:00:00+00:00 21.1 90.02 4.12 NaN 70.0 43.12000 -76.12000 124.000 1008.200000 19.400000
2020-08-04 20:00:00+00:00 21.1 90.02 2.57 NaN 60.0 43.12000 -76.12000 124.000 1006.500000 19.400000
2020-08-04 21:00:00+00:00 21.1 87.26 3.09 NaN 50.0 43.12000 -76.12000 124.000 1006.100000 18.900000
2020-08-04 22:00:00+00:00 22.8 78.65 2.06 NaN 270.0 43.12000 -76.12000 124.000 1006.500000 18.900000
2020-08-04 23:00:00+00:00 22.8 78.65 1.54 NaN 280.0 43.12000 -76.12000 124.000 1006.900000 18.900000

37125 rows × 10 columns

Let’s rearrange the columns into a more logical order.

colOrder = ['SLAT','SLON','SELV','TMPC','DWPC','RELH','PMSL','DRCT','SPED','GUMS']
nymerge_data = nymerge_data.reindex(columns=colOrder)
Note: Whenever you combine DataFrames, it's a very good idea to re-sort the indices. Although the new DataFrame may "look" correct, its indices may not be in what Pandas calls lexical order. We'll use Pandas sort_index for this purpose.
nymerge_data.sort_index(ascending=True, inplace=True)
nymerge_data
SLAT SLON SELV TMPC DWPC RELH PMSL DRCT SPED GUMS
STID TIME
ADDI 2020-08-04 00:00:00+00:00 42.04036 -77.23726 507.614 23.1 15.846049 63.7 1014.767745 175.0 1.8 3.0
2020-08-04 00:05:00+00:00 42.04036 -77.23726 507.614 23.4 15.554540 61.4 1014.739386 169.0 2.4 3.6
2020-08-04 00:10:00+00:00 42.04036 -77.23726 507.614 23.1 15.473742 62.2 1014.746538 167.0 2.4 3.7
2020-08-04 00:15:00+00:00 42.04036 -77.23726 507.614 23.2 15.668212 62.6 1014.758278 170.0 2.6 4.3
2020-08-04 00:20:00+00:00 42.04036 -77.23726 507.614 23.0 15.529141 62.8 1014.756017 172.0 2.0 3.7
... ... ... ... ... ... ... ... ... ... ... ...
YORK 2020-08-04 23:35:00+00:00 42.85504 -77.84776 177.942 21.7 19.933345 89.7 1009.391100 246.0 1.3 2.3
2020-08-04 23:40:00+00:00 42.85504 -77.84776 177.942 21.7 19.987260 90.0 1009.452354 237.0 1.0 1.6
2020-08-04 23:45:00+00:00 42.85504 -77.84776 177.942 21.7 20.148065 90.9 1009.574861 222.0 1.2 2.2
2020-08-04 23:50:00+00:00 42.85504 -77.84776 177.942 21.8 20.229084 90.8 1009.639244 231.0 1.1 2.1
2020-08-04 23:55:00+00:00 42.85504 -77.84776 177.942 21.7 19.987260 90.0 1009.687160 217.0 0.8 1.9

37125 rows × 10 columns

Analyze the merged DataFrame

Now, we have what we wanted … one Dataframe that contains the same variables for both datasets. We can make selections on the multi-index Dataframe as we’ve done previously.

Get some general information about the DataFrame:

nymerge_data.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 37125 entries, ('ADDI', Timestamp('2020-08-04 00:00:00+0000', tz='UTC')) to ('YORK', Timestamp('2020-08-04 23:55:00+0000', tz='UTC'))
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SLAT    37125 non-null  float64
 1   SLON    37125 non-null  float64
 2   SELV    37125 non-null  float64
 3   TMPC    37125 non-null  float64
 4   DWPC    36141 non-null  float64
 5   RELH    36141 non-null  float64
 6   PMSL    37033 non-null  float64
 7   DRCT    37079 non-null  float64
 8   SPED    37123 non-null  float64
 9   GUMS    36415 non-null  float64
dtypes: float64(10)
memory usage: 3.0+ MB

Select one station

nymerge_data.loc['ROC']
SLAT SLON SELV TMPC DWPC RELH PMSL DRCT SPED GUMS
TIME
2020-08-04 00:00:00+00:00 43.12 -77.67 169.0 21.1 20.6 96.97 1015.7 70.0 3.60 NaN
2020-08-04 01:00:00+00:00 43.12 -77.67 169.0 21.1 20.0 93.44 1015.5 50.0 2.06 NaN
2020-08-04 02:00:00+00:00 43.12 -77.67 169.0 20.0 19.4 96.34 1016.0 90.0 4.12 NaN
2020-08-04 03:00:00+00:00 43.12 -77.67 169.0 20.0 19.4 96.34 1015.9 240.0 1.54 NaN
2020-08-04 04:00:00+00:00 43.12 -77.67 169.0 20.0 19.4 96.34 1015.4 100.0 1.54 NaN
2020-08-04 05:00:00+00:00 43.12 -77.67 169.0 20.0 19.4 96.34 1014.7 130.0 2.06 NaN
2020-08-04 06:00:00+00:00 43.12 -77.67 169.0 20.0 19.4 96.34 1014.1 230.0 1.54 NaN
2020-08-04 07:00:00+00:00 43.12 -77.67 169.0 19.4 18.9 96.93 1013.4 0.0 0.00 NaN
2020-08-04 08:00:00+00:00 43.12 -77.67 169.0 19.4 18.9 96.93 1013.3 0.0 0.00 NaN
2020-08-04 09:00:00+00:00 43.12 -77.67 169.0 19.4 19.4 100.00 1012.8 0.0 0.00 NaN
2020-08-04 10:00:00+00:00 43.12 -77.67 169.0 20.0 19.4 96.34 1012.6 0.0 0.00 NaN
2020-08-04 11:00:00+00:00 43.12 -77.67 169.0 19.4 18.9 96.93 1012.4 280.0 1.54 NaN
2020-08-04 12:00:00+00:00 43.12 -77.67 169.0 20.6 19.4 92.83 1012.0 0.0 0.00 NaN
2020-08-04 13:00:00+00:00 43.12 -77.67 169.0 21.1 20.0 93.44 1011.9 0.0 0.00 NaN
2020-08-04 14:00:00+00:00 43.12 -77.67 169.0 21.1 20.0 93.44 1011.7 0.0 0.00 NaN
2020-08-04 15:00:00+00:00 43.12 -77.67 169.0 21.7 18.9 84.10 1011.4 0.0 0.00 NaN
2020-08-04 16:00:00+00:00 43.12 -77.67 169.0 21.7 20.0 90.06 1010.9 170.0 2.57 NaN
2020-08-04 17:00:00+00:00 43.12 -77.67 169.0 21.7 20.6 93.47 1010.3 150.0 2.06 NaN
2020-08-04 18:00:00+00:00 43.12 -77.67 169.0 22.2 20.0 87.35 1009.6 0.0 0.00 NaN
2020-08-04 19:00:00+00:00 43.12 -77.67 169.0 23.3 20.0 81.71 1008.8 10.0 2.06 NaN
2020-08-04 20:00:00+00:00 43.12 -77.67 169.0 25.0 20.0 73.78 1007.5 20.0 4.12 NaN
2020-08-04 21:00:00+00:00 43.12 -77.67 169.0 21.7 21.1 96.39 1008.3 320.0 3.60 NaN
2020-08-04 22:00:00+00:00 43.12 -77.67 169.0 22.2 20.6 90.66 1008.7 340.0 3.09 NaN
2020-08-04 23:00:00+00:00 43.12 -77.67 169.0 22.8 19.4 81.14 1009.0 320.0 3.09 NaN

Select multiple stations and multiple times.

nymerge_data.loc[(('ALB','VOOR'),('2020-08-04 17:45:00 UTC','2020-08-04 18:00:00 UTC')),('TMPC','DWPC')]
TMPC DWPC
STID TIME
ALB 2020-08-04 18:00:00+00:00 19.4 19.400000
VOOR 2020-08-04 17:45:00+00:00 20.2 19.988491
2020-08-04 18:00:00+00:00 20.3 20.071947

Use the cross-section method to select one index value and then display all values of the other index for selected columns.

nymerge_data.xs('2020-08-04 18:00:00 UTC',level='TIME')[['TMPC','DWPC','RELH']]
TMPC DWPC RELH
STID
ADDI 20.0 19.425792 96.50
ALB 19.4 19.400000 100.00
ANDE 19.0 18.578391 97.40
ART 21.1 20.000000 93.44
BATA 23.2 19.517692 79.80
... ... ... ...
WFMB 17.6 NaN NaN
WGAT 19.3 18.844511 97.20
WHIT 21.0 20.588584 97.50
WOLC 21.1 NaN NaN
YORK 22.8 20.315945 85.90

161 rows × 3 columns

Tip: We could, of course, take this merged DataFrame and create a station plot with it ... which would show not only the NYSM data, but also the traditional ASOS network of stations!

Summary

  • Pandas can efficiently combine datasets with different attributes into a single DataFrame.

  • Pay attention to the row indexing in the resulting merged DataFrame.

What’s Next?

In the next notebook, we will explore Pandas’ groupby function.