Pandas Notebook 7: Merging DataFrames¶

This notebook will combine into one dataframe observations from two separate data sources: the NYS Mesonet, and ASOS sites located in New York.

Overview¶

Examine the NYSM and ASOS-derived datafiles for a particular day.
Remove (drop) unwanted columns from a DataFrame.
Use a dictionary to rename columns.
Merge columns from one DataFrame into another and re-index.
Add columns to a DataFrame
Concatenate one DataFrame into another.
Analyze the merged DataFrame.

Prerequisites¶

Concepts	Importance	Notes
Pandas notebooks 1-7	Necessary	Intro to `dict`

Time to learn: 30 minutes

Imports¶

import pandas as pd
import numpy as np

Examine the NYSM and ASOS-derived datafiles for a particular day¶

Read in NYSM 5-minute data from 8/4/2020 (during the passage of Hurricane Isaias through the region) and also read in the NYSM site table so we can extract latitude, longitude, and elevation. Also read in a table that has hourly METAR observations from ASOS sites in New York State.

nysm_data_file = '/spare11/atm533/data/nysm_data_20200804.csv'
nysm_sites_file = '/spare11/atm533/data/nysm_sites.csv'
nyasos_data_file = '/spare11/atm533/data/nyasos_data_20200804.txt'

nysm_data = pd.read_csv(nysm_data_file)

nysm_sites = pd.read_csv(nysm_sites_file)

Examine the ASOS data. Note the following differences in this dataset’s format as compared to how the NYSM data is structured:

Its columns are named differently from those of the NYSM
Its columns are separated using whitespace, not tabs
Missing values are denoted by -9999.0, not null strings
The date/time is formatted differently.

! head $nyasos_data_file

STN    YYMMDD/HHMM      SLAT     SLON     SELV     TMPC     DWPC     RELH     PMSL    SPED     GUMS     DRCT     P01M
ALB    200804/0000     42.75   -73.80    89.00    25.00    15.60    55.91 1016.50     2.06 -9999.00    10.00 -9999.00
ART    200804/0000     44.00   -76.02    99.00    22.80    16.70    68.47 1015.50     0.00 -9999.00     0.00 -9999.00
BGM    200804/0000     42.22   -75.98   497.00    23.90    15.00    57.47 1016.40     2.57 -9999.00   160.00 -9999.00
BUF    200804/0000     42.93   -78.73   215.00    23.30    17.80    71.22 1015.00     3.60 -9999.00   150.00     0.00
DKK    200804/0000     42.50   -79.28   203.00    20.60    18.30    86.66 1015.10     3.09 -9999.00   150.00     0.00
DSV    200804/0000     42.57   -77.72   209.00    21.70    18.90    84.10 1016.30     4.12 -9999.00   330.00     0.25
ELM    200804/0000     42.17   -76.90   291.00    27.80    15.00    45.60 1014.30     2.57 -9999.00   180.00 -9999.00
ELZ    200804/0000     42.11   -77.99   647.00    22.20    15.00    63.70 1016.40     2.06 -9999.00   160.00 -9999.00
FOK    200804/0000     40.85   -72.63    20.00    25.60    22.80    84.53 1018.00     2.57 -9999.00   220.00 -9999.00

nyasos_data = pd.read_csv(nyasos_data_file,delim_whitespace=True,na_values=[-9999.0])

Look at the first couple of rows in all three DataFrames.

nysm_data.head(2)

	station	time	temp_2m [degC]	temp_9m [degC]	relative_humidity [percent]	precip_incremental [mm]	precip_local [mm]	precip_max_intensity [mm/min]	avg_wind_speed_prop [m/s]	max_wind_speed_prop [m/s]	...	snow_depth [cm]	frozen_soil_05cm [bit]	frozen_soil_25cm [bit]	frozen_soil_50cm [bit]	soil_temp_05cm [degC]	soil_temp_25cm [degC]	soil_temp_50cm [degC]	soil_moisture_05cm [m^3/m^3]	soil_moisture_25cm [m^3/m^3]	soil_moisture_50cm [m^3/m^3]
0	ADDI	2020-08-04 00:00:00 UTC	23.1	23.9	63.7	0.0	0.0	0.0	1.6	3.0	...	NaN	0.0	0.0	0.0	22.6	20.3	19.1	0.07	0.2	0.26
1	ADDI	2020-08-04 00:05:00 UTC	23.4	24.1	61.4	0.0	0.0	0.0	2.2	3.1	...	NaN	0.0	0.0	0.0	22.6	20.3	19.2	0.08	0.2	0.26

2 rows × 30 columns

nysm_sites.head(2)

	stid	number	name	lat	lon	elevation	county	nearest_city	state	distance_from_town [km]	direction_from_town [degrees]	climate_division	climate_division_name	wfo	commissioned	decommissioned
0	ADDI	107	Addison	42.04036	-77.23726	507.614	Steuben	Addison	NY	6.9	S	1	Western Plateau	BGM	2016-08-10 18:15:00 UTC	NaN
1	ANDE	111	Andes	42.18227	-74.80139	518.282	Delaware	Andes	NY	1.5	WSW	2	Eastern Plateau	BGM	2016-08-04 15:55:00 UTC	NaN

nyasos_data.head(2)

	STN	YYMMDD/HHMM	SLAT	SLON	SELV	TMPC	DWPC	RELH	PMSL	SPED	GUMS	DRCT	P01M
0	ALB	200804/0000	42.75	-73.80	89.0	25.0	15.6	55.91	1016.5	2.06	NaN	10.0	NaN
1	ART	200804/0000	44.00	-76.02	99.0	22.8	16.7	68.47	1015.5	0.00	NaN	0.0	NaN

Remove (drop) unwanted columns from a DataFrame¶

Our merged data file will contain only a subset of the columns from each data source. First, let’s remove the unwanted columns from the NYSM file.

Use inplace so the Dataframe object is updated, and specify that we’re referring to columns, not the default axis, which is row indices.

nysm_data.drop(['temp_9m [degC]','precip_incremental [mm]','precip_local [mm]','precip_max_intensity [mm/min]',
       'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',
       'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',
       'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',
       'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',
       'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',
       'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')

Examine what columns remain post-drop.

nysm_data.columns

Index(['station', 'time', 'temp_2m [degC]', 'relative_humidity [percent]',
       'avg_wind_speed_sonic [m/s]', 'max_wind_speed_sonic [m/s]',
       'wind_direction_sonic [degrees]', 'station_pressure [mbar]'],
      dtype='object')

Now, drop any unwanted columns from the ASOS data file.

nyasos_data.drop(['P01M'],axis='columns',inplace=True)

Examine the columns remaining in the ASOS data file

nyasos_data.columns

Index(['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'TMPC', 'DWPC', 'RELH',
       'PMSL', 'SPED', 'GUMS', 'DRCT'],
      dtype='object')

Examine the columns in the NYSM Sites file

nysm_sites.columns

Index(['stid', 'number', 'name', 'lat', 'lon', 'elevation', 'county',
       'nearest_city', 'state', 'distance_from_town [km]',
       'direction_from_town [degrees]', 'climate_division',
       'climate_division_name', 'wfo', 'commissioned', 'decommissioned'],
      dtype='object')

Use a dictionary to rename columns¶

Each dataframe has varying column names. Let’s standardize by creating a dictionary that will map current column names to common (and in some cases, much shorter) names.

column_mapping = {'station' : 'STID',
                  'time': 'TIME',
                  'temp_2m [degC]': 'TMPC',
                  'relative_humidity [percent]': 'RELH',
                  'precip_incremental [mm]': 'PRCP',
                  'precip_local [mm]': 'PTOT',
                  'avg_wind_speed_sonic [m/s]': 'SPED',
                  'max_wind_speed_sonic [m/s]': 'GUMS',
                  'wind_direction_sonic [degrees]': 'DRCT', 
                  'station_pressure [mbar]': 'PRES',
                  'stid': 'STID',
                  'name': 'NAME',
                  'lat': 'SLAT',
                  'lon': 'SLON',
                  'elevation': 'SELV',
                  'STN':  'STID',
                  'YYMMDD/HHMM': 'TIME'}

For each of the three Dataframes, rename the columns according to our dictionary. Then examine each Dataframe to see how they look.

nysm_data.rename(columns=column_mapping, inplace=True)

nysm_data.head()

	STID	TIME	TMPC	RELH	SPED	GUMS	DRCT	PRES
0	ADDI	2020-08-04 00:00:00 UTC	23.1	63.7	1.8	3.0	175.0	957.00
1	ADDI	2020-08-04 00:05:00 UTC	23.4	61.4	2.4	3.6	169.0	957.03
2	ADDI	2020-08-04 00:10:00 UTC	23.1	62.2	2.4	3.7	167.0	956.98
3	ADDI	2020-08-04 00:15:00 UTC	23.2	62.6	2.6	4.3	170.0	957.01
4	ADDI	2020-08-04 00:20:00 UTC	23.0	62.8	2.0	3.7	172.0	956.97

nysm_sites.rename(columns=column_mapping, inplace=True)

nysm_sites.head()

	STID	number	NAME	SLAT	SLON	SELV	county	nearest_city	state	distance_from_town [km]	direction_from_town [degrees]	climate_division	climate_division_name	wfo	commissioned	decommissioned
0	ADDI	107	Addison	42.04036	-77.23726	507.6140	Steuben	Addison	NY	6.9	S	1	Western Plateau	BGM	2016-08-10 18:15:00 UTC	NaN
1	ANDE	111	Andes	42.18227	-74.80139	518.2820	Delaware	Andes	NY	1.5	WSW	2	Eastern Plateau	BGM	2016-08-04 15:55:00 UTC	NaN
2	BATA	24	Batavia	43.01994	-78.13566	276.1200	Genesee	Batavia	NY	4.9	ENE	9	Great Lakes	BUF	2016-02-18 18:40:00 UTC	NaN
3	BEAC	76	Beacon	41.52875	-73.94527	90.1598	Dutchess	Beacon	NY	3.3	NE	5	Hudson Valley	ALY	2016-08-22 16:45:00 UTC	NaN
4	BELD	90	Belden	42.22322	-75.66852	470.3700	Broome	Belden	NY	2.2	NNE	2	Eastern Plateau	BGM	2015-11-30 20:20:00 UTC	NaN

nyasos_data.rename(columns=column_mapping, inplace=True)

nyasos_data.head()

	STID	TIME	SLAT	SLON	SELV	TMPC	DWPC	RELH	PMSL	SPED	GUMS	DRCT
0	ALB	200804/0000	42.75	-73.80	89.0	25.0	15.6	55.91	1016.5	2.06	NaN	10.0
1	ART	200804/0000	44.00	-76.02	99.0	22.8	16.7	68.47	1015.5	0.00	NaN	0.0
2	BGM	200804/0000	42.22	-75.98	497.0	23.9	15.0	57.47	1016.4	2.57	NaN	160.0
3	BUF	200804/0000	42.93	-78.73	215.0	23.3	17.8	71.22	1015.0	3.60	NaN	150.0
4	DKK	200804/0000	42.50	-79.28	203.0	20.6	18.3	86.66	1015.1	3.09	NaN	150.0

Merge columns from one DataFrame into another, and then re-index.¶

Use merge to add the latitude, longitude, and elevation from the NYSM Sites dataframe to the NYSM Data one. To do this, we take a subset of the nysm_sites dataframe containing the ones we want to merge, plus the station id column, since merge requires a common column for both dataframes. Note also that merge does not have an inplace option so we just redefine the nysm_data object.

nysm_data = pd.merge(nysm_data,nysm_sites[['STID','SLAT','SLON','SELV']])

nysm_data.head()

	STID	TIME	TMPC	RELH	SPED	GUMS	DRCT	PRES	SLAT	SLON	SELV
0	ADDI	2020-08-04 00:00:00 UTC	23.1	63.7	1.8	3.0	175.0	957.00	42.04036	-77.23726	507.614
1	ADDI	2020-08-04 00:05:00 UTC	23.4	61.4	2.4	3.6	169.0	957.03	42.04036	-77.23726	507.614
2	ADDI	2020-08-04 00:10:00 UTC	23.1	62.2	2.4	3.7	167.0	956.98	42.04036	-77.23726	507.614
3	ADDI	2020-08-04 00:15:00 UTC	23.2	62.6	2.6	4.3	170.0	957.01	42.04036	-77.23726	507.614
4	ADDI	2020-08-04 00:20:00 UTC	23.0	62.8	2.0	3.7	172.0	956.97	42.04036	-77.23726	507.614

Make the NYSM and ASOS dataframes multi-indexed according to site and time. Convert the date/time from string to Datetime objects.¶

First set the indexes

nysm_data.set_index(['STID', 'TIME'], inplace = True)

nysm_data.head()

		TMPC	RELH	SPED	GUMS	DRCT	PRES	SLAT	SLON	SELV
STID	TIME
ADDI	2020-08-04 00:00:00 UTC	23.1	63.7	1.8	3.0	175.0	957.00	42.04036	-77.23726	507.614
	2020-08-04 00:05:00 UTC	23.4	61.4	2.4	3.6	169.0	957.03	42.04036	-77.23726	507.614
	2020-08-04 00:10:00 UTC	23.1	62.2	2.4	3.7	167.0	956.98	42.04036	-77.23726	507.614
	2020-08-04 00:15:00 UTC	23.2	62.6	2.6	4.3	170.0	957.01	42.04036	-77.23726	507.614
	2020-08-04 00:20:00 UTC	23.0	62.8	2.0	3.7	172.0	956.97	42.04036	-77.23726	507.614

The NYSM date/time strings can be automatically parsed into Datetime objects, recognized as in the UTC time zone.

nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)

/tmp/ipykernel_4157708/1677587967.py:1: FutureWarning: inplace is deprecated and will be removed in a future version.
  nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)

nysm_data.head()

		TMPC	RELH	SPED	GUMS	DRCT	PRES	SLAT	SLON	SELV
STID	TIME
ADDI	2020-08-04 00:00:00+00:00	23.1	63.7	1.8	3.0	175.0	957.00	42.04036	-77.23726	507.614
	2020-08-04 00:05:00+00:00	23.4	61.4	2.4	3.6	169.0	957.03	42.04036	-77.23726	507.614
	2020-08-04 00:10:00+00:00	23.1	62.2	2.4	3.7	167.0	956.98	42.04036	-77.23726	507.614
	2020-08-04 00:15:00+00:00	23.2	62.6	2.6	4.3	170.0	957.01	42.04036	-77.23726	507.614
	2020-08-04 00:20:00+00:00	23.0	62.8	2.0	3.7	172.0	956.97	42.04036	-77.23726	507.614

The ASOS data set needs to have its station id and date/times sorted in order for the multi-indexing to work properly.

nyasos_data.sort_values(['STID','TIME'],inplace=True)

nyasos_data.set_index(['STID', 'TIME'], inplace = True)

Convert the date/time strings. Here, we need to specify how the string is formatted.¶

nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format="%y%m%d/%H%M",utc=True)],inplace=True)

/tmp/ipykernel_4157708/889004120.py:1: FutureWarning: inplace is deprecated and will be removed in a future version.
  nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format="%y%m%d/%H%M",utc=True)],inplace=True)

nyasos_data

		SLAT	SLON	SELV	TMPC	DWPC	RELH	PMSL	SPED	GUMS	DRCT
STID	TIME
ALB	2020-08-04 00:00:00+00:00	42.75	-73.80	89.0	25.0	15.6	55.91	1016.5	2.06	NaN	10.0
	2020-08-04 01:00:00+00:00	42.75	-73.80	89.0	23.9	16.1	61.66	1016.6	1.54	NaN	60.0
	2020-08-04 02:00:00+00:00	42.75	-73.80	89.0	21.7	16.7	73.22	1017.0	0.00	NaN	0.0
	2020-08-04 03:00:00+00:00	42.75	-73.80	89.0	21.7	16.7	73.22	1017.0	0.00	NaN	0.0
	2020-08-04 04:00:00+00:00	42.75	-73.80	89.0	22.2	16.7	71.01	1017.0	0.00	NaN	0.0
...	...	...	...	...	...	...	...	...	...	...	...
SYR	2020-08-04 19:00:00+00:00	43.12	-76.12	124.0	21.1	19.4	90.02	1008.2	4.12	NaN	70.0
	2020-08-04 20:00:00+00:00	43.12	-76.12	124.0	21.1	19.4	90.02	1006.5	2.57	NaN	60.0
	2020-08-04 21:00:00+00:00	43.12	-76.12	124.0	21.1	18.9	87.26	1006.1	3.09	NaN	50.0
	2020-08-04 22:00:00+00:00	43.12	-76.12	124.0	22.8	18.9	78.65	1006.5	2.06	NaN	270.0
	2020-08-04 23:00:00+00:00	43.12	-76.12	124.0	22.8	18.9	78.65	1006.9	1.54	NaN	280.0

837 rows × 10 columns

Add columns to a DataFrame¶

The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create columns for dewpoint and sea-level pressure.

tmpc = nysm_data['TMPC']
rh = nysm_data['RELH']
pres = nysm_data['PRES']
elev = nysm_data['SELV']

sensorHeight = .5
# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf 
nysm_data['PMSL'] = pres/np.exp(-1*(elev+sensorHeight)/((tmpc+273.15) * 29.263))

nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))

Now that we have calculated sea-level pressure, let’s drop station pressure from the NYSM DataFrame since it does not appear in the ASOS one.

nysm_data.drop('PRES',inplace=True,axis='columns')

nysm_data

		TMPC	RELH	SPED	GUMS	DRCT	SLAT	SLON	SELV	PMSL	DWPC
STID	TIME
ADDI	2020-08-04 00:00:00+00:00	23.1	63.7	1.8	3.0	175.0	42.04036	-77.23726	507.614	1014.767745	15.846049
	2020-08-04 00:05:00+00:00	23.4	61.4	2.4	3.6	169.0	42.04036	-77.23726	507.614	1014.739386	15.554540
	2020-08-04 00:10:00+00:00	23.1	62.2	2.4	3.7	167.0	42.04036	-77.23726	507.614	1014.746538	15.473742
	2020-08-04 00:15:00+00:00	23.2	62.6	2.6	4.3	170.0	42.04036	-77.23726	507.614	1014.758278	15.668212
	2020-08-04 00:20:00+00:00	23.0	62.8	2.0	3.7	172.0	42.04036	-77.23726	507.614	1014.756017	15.529141
...	...	...	...	...	...	...	...	...	...	...	...
YORK	2020-08-04 23:35:00+00:00	21.7	89.7	1.3	2.3	246.0	42.85504	-77.84776	177.942	1009.391100	19.933345
	2020-08-04 23:40:00+00:00	21.7	90.0	1.0	1.6	237.0	42.85504	-77.84776	177.942	1009.452354	19.987260
	2020-08-04 23:45:00+00:00	21.7	90.9	1.2	2.2	222.0	42.85504	-77.84776	177.942	1009.574861	20.148065
	2020-08-04 23:50:00+00:00	21.8	90.8	1.1	2.1	231.0	42.85504	-77.84776	177.942	1009.639244	20.229084
	2020-08-04 23:55:00+00:00	21.7	90.0	0.8	1.9	217.0	42.85504	-77.84776	177.942	1009.687160	19.987260

36288 rows × 10 columns

Concatenate one DataFrame into another¶

Now, the two dataframes can be merged. We’ll use the concat method to “glue” the rows from the ASOS table on to the NYSM one.

nymerge_data = pd.concat([nysm_data,nyasos_data])

nymerge_data

		TMPC	RELH	SPED	GUMS	DRCT	SLAT	SLON	SELV	PMSL	DWPC
STID	TIME
ADDI	2020-08-04 00:00:00+00:00	23.1	63.70	1.80	3.0	175.0	42.04036	-77.23726	507.614	1014.767745	15.846049
	2020-08-04 00:05:00+00:00	23.4	61.40	2.40	3.6	169.0	42.04036	-77.23726	507.614	1014.739386	15.554540
	2020-08-04 00:10:00+00:00	23.1	62.20	2.40	3.7	167.0	42.04036	-77.23726	507.614	1014.746538	15.473742
	2020-08-04 00:15:00+00:00	23.2	62.60	2.60	4.3	170.0	42.04036	-77.23726	507.614	1014.758278	15.668212
	2020-08-04 00:20:00+00:00	23.0	62.80	2.00	3.7	172.0	42.04036	-77.23726	507.614	1014.756017	15.529141
...	...	...	...	...	...	...	...	...	...	...	...
SYR	2020-08-04 19:00:00+00:00	21.1	90.02	4.12	NaN	70.0	43.12000	-76.12000	124.000	1008.200000	19.400000
	2020-08-04 20:00:00+00:00	21.1	90.02	2.57	NaN	60.0	43.12000	-76.12000	124.000	1006.500000	19.400000
	2020-08-04 21:00:00+00:00	21.1	87.26	3.09	NaN	50.0	43.12000	-76.12000	124.000	1006.100000	18.900000
	2020-08-04 22:00:00+00:00	22.8	78.65	2.06	NaN	270.0	43.12000	-76.12000	124.000	1006.500000	18.900000
	2020-08-04 23:00:00+00:00	22.8	78.65	1.54	NaN	280.0	43.12000	-76.12000	124.000	1006.900000	18.900000

37125 rows × 10 columns

Let’s rearrange the columns into a more logical order.

colOrder = ['SLAT','SLON','SELV','TMPC','DWPC','RELH','PMSL','DRCT','SPED','GUMS']

nymerge_data = nymerge_data.reindex(columns=colOrder)

Note: Whenever you combine DataFrames, it's a very good idea to re-sort the indices. Although the new DataFrame may "look" correct, its indices may not be in what Pandas calls lexical order. We'll use Pandas sort_index for this purpose.

nymerge_data.sort_index(ascending=True, inplace=True)

nymerge_data

		SLAT	SLON	SELV	TMPC	DWPC	RELH	PMSL	DRCT	SPED	GUMS
STID	TIME
ADDI	2020-08-04 00:00:00+00:00	42.04036	-77.23726	507.614	23.1	15.846049	63.7	1014.767745	175.0	1.8	3.0
	2020-08-04 00:05:00+00:00	42.04036	-77.23726	507.614	23.4	15.554540	61.4	1014.739386	169.0	2.4	3.6
	2020-08-04 00:10:00+00:00	42.04036	-77.23726	507.614	23.1	15.473742	62.2	1014.746538	167.0	2.4	3.7
	2020-08-04 00:15:00+00:00	42.04036	-77.23726	507.614	23.2	15.668212	62.6	1014.758278	170.0	2.6	4.3
	2020-08-04 00:20:00+00:00	42.04036	-77.23726	507.614	23.0	15.529141	62.8	1014.756017	172.0	2.0	3.7
...	...	...	...	...	...	...	...	...	...	...	...
YORK	2020-08-04 23:35:00+00:00	42.85504	-77.84776	177.942	21.7	19.933345	89.7	1009.391100	246.0	1.3	2.3
	2020-08-04 23:40:00+00:00	42.85504	-77.84776	177.942	21.7	19.987260	90.0	1009.452354	237.0	1.0	1.6
	2020-08-04 23:45:00+00:00	42.85504	-77.84776	177.942	21.7	20.148065	90.9	1009.574861	222.0	1.2	2.2
	2020-08-04 23:50:00+00:00	42.85504	-77.84776	177.942	21.8	20.229084	90.8	1009.639244	231.0	1.1	2.1
	2020-08-04 23:55:00+00:00	42.85504	-77.84776	177.942	21.7	19.987260	90.0	1009.687160	217.0	0.8	1.9

37125 rows × 10 columns

Analyze the merged DataFrame¶

Now, we have what we wanted … one Dataframe that contains the same variables for both datasets. We can make selections on the multi-index Dataframe as we’ve done previously.

Get some general information about the DataFrame:

nymerge_data.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 37125 entries, ('ADDI', Timestamp('2020-08-04 00:00:00+0000', tz='UTC')) to ('YORK', Timestamp('2020-08-04 23:55:00+0000', tz='UTC'))
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   SLAT    37125 non-null  float64
 1   SLON    37125 non-null  float64
 2   SELV    37125 non-null  float64
 3   TMPC    37125 non-null  float64
 4   DWPC    36141 non-null  float64
 5   RELH    36141 non-null  float64
 6   PMSL    37033 non-null  float64
 7   DRCT    37079 non-null  float64
 8   SPED    37123 non-null  float64
 9   GUMS    36415 non-null  float64
dtypes: float64(10)
memory usage: 3.0+ MB

Select one station

nymerge_data.loc['ROC']

	SLAT	SLON	SELV	TMPC	DWPC	RELH	PMSL	DRCT	SPED	GUMS
TIME
2020-08-04 00:00:00+00:00	43.12	-77.67	169.0	21.1	20.6	96.97	1015.7	70.0	3.60	NaN
2020-08-04 01:00:00+00:00	43.12	-77.67	169.0	21.1	20.0	93.44	1015.5	50.0	2.06	NaN
2020-08-04 02:00:00+00:00	43.12	-77.67	169.0	20.0	19.4	96.34	1016.0	90.0	4.12	NaN
2020-08-04 03:00:00+00:00	43.12	-77.67	169.0	20.0	19.4	96.34	1015.9	240.0	1.54	NaN
2020-08-04 04:00:00+00:00	43.12	-77.67	169.0	20.0	19.4	96.34	1015.4	100.0	1.54	NaN
2020-08-04 05:00:00+00:00	43.12	-77.67	169.0	20.0	19.4	96.34	1014.7	130.0	2.06	NaN
2020-08-04 06:00:00+00:00	43.12	-77.67	169.0	20.0	19.4	96.34	1014.1	230.0	1.54	NaN
2020-08-04 07:00:00+00:00	43.12	-77.67	169.0	19.4	18.9	96.93	1013.4	0.0	0.00	NaN
2020-08-04 08:00:00+00:00	43.12	-77.67	169.0	19.4	18.9	96.93	1013.3	0.0	0.00	NaN
2020-08-04 09:00:00+00:00	43.12	-77.67	169.0	19.4	19.4	100.00	1012.8	0.0	0.00	NaN
2020-08-04 10:00:00+00:00	43.12	-77.67	169.0	20.0	19.4	96.34	1012.6	0.0	0.00	NaN
2020-08-04 11:00:00+00:00	43.12	-77.67	169.0	19.4	18.9	96.93	1012.4	280.0	1.54	NaN
2020-08-04 12:00:00+00:00	43.12	-77.67	169.0	20.6	19.4	92.83	1012.0	0.0	0.00	NaN
2020-08-04 13:00:00+00:00	43.12	-77.67	169.0	21.1	20.0	93.44	1011.9	0.0	0.00	NaN
2020-08-04 14:00:00+00:00	43.12	-77.67	169.0	21.1	20.0	93.44	1011.7	0.0	0.00	NaN
2020-08-04 15:00:00+00:00	43.12	-77.67	169.0	21.7	18.9	84.10	1011.4	0.0	0.00	NaN
2020-08-04 16:00:00+00:00	43.12	-77.67	169.0	21.7	20.0	90.06	1010.9	170.0	2.57	NaN
2020-08-04 17:00:00+00:00	43.12	-77.67	169.0	21.7	20.6	93.47	1010.3	150.0	2.06	NaN
2020-08-04 18:00:00+00:00	43.12	-77.67	169.0	22.2	20.0	87.35	1009.6	0.0	0.00	NaN
2020-08-04 19:00:00+00:00	43.12	-77.67	169.0	23.3	20.0	81.71	1008.8	10.0	2.06	NaN
2020-08-04 20:00:00+00:00	43.12	-77.67	169.0	25.0	20.0	73.78	1007.5	20.0	4.12	NaN
2020-08-04 21:00:00+00:00	43.12	-77.67	169.0	21.7	21.1	96.39	1008.3	320.0	3.60	NaN
2020-08-04 22:00:00+00:00	43.12	-77.67	169.0	22.2	20.6	90.66	1008.7	340.0	3.09	NaN
2020-08-04 23:00:00+00:00	43.12	-77.67	169.0	22.8	19.4	81.14	1009.0	320.0	3.09	NaN

Select multiple stations and multiple times.

nymerge_data.loc[(('ALB','VOOR'),('2020-08-04 17:45:00 UTC','2020-08-04 18:00:00 UTC')),('TMPC','DWPC')]

		TMPC	DWPC
STID	TIME
ALB	2020-08-04 18:00:00+00:00	19.4	19.400000
VOOR	2020-08-04 17:45:00+00:00	20.2	19.988491
VOOR	2020-08-04 18:00:00+00:00	20.3	20.071947

Use the cross-section method to select one index value and then display all values of the other index for selected columns.

nymerge_data.xs('2020-08-04 18:00:00 UTC',level='TIME')[['TMPC','DWPC','RELH']]

	TMPC	DWPC	RELH
STID
ADDI	20.0	19.425792	96.50
ALB	19.4	19.400000	100.00
ANDE	19.0	18.578391	97.40
ART	21.1	20.000000	93.44
BATA	23.2	19.517692	79.80
...	...	...	...
WFMB	17.6	NaN	NaN
WGAT	19.3	18.844511	97.20
WHIT	21.0	20.588584	97.50
WOLC	21.1	NaN	NaN
YORK	22.8	20.315945	85.90

161 rows × 3 columns

Tip: We could, of course, take this merged DataFrame and create a station plot with it ... which would show not only the NYSM data, but also the traditional ASOS network of stations!

Summary¶

Pandas can efficiently combine datasets with different attributes into a single DataFrame.
Pay attention to the row indexing in the resulting merged DataFrame.

What’s Next?¶

In the next notebook, we will explore Pandas’ groupby function.

ATM433/533 Fall 2023

Pandas Notebook 7: Merging DataFrames

Contents