Pandas Notebook 7: Merging DataFrames
Contents
Pandas Notebook 7: Merging DataFrames¶
This notebook will combine into one dataframe observations from two separate data sources: the NYS Mesonet, and ASOS sites located in New York.
Overview¶
Examine the NYSM and ASOS-derived datafiles for a particular day.
Remove (drop) unwanted columns from a DataFrame.
Use a dictionary to rename columns.
Merge columns from one DataFrame into another and re-index.
Add columns to a DataFrame
Concatenate one DataFrame into another.
Analyze the merged DataFrame.
Prerequisites¶
Concepts |
Importance |
Notes |
---|---|---|
Pandas notebooks 1-7 |
Necessary |
Intro to |
Time to learn: 30 minutes
Imports¶
import pandas as pd
import numpy as np
Examine the NYSM and ASOS-derived datafiles for a particular day¶
Read in NYSM 5-minute data from 8/4/2020 (during the passage of Hurricane Isaias through the region) and also read in the NYSM site table so we can extract latitude, longitude, and elevation. Also read in a table that has hourly METAR observations from ASOS sites in New York State.
nysm_data_file = '/spare11/atm533/data/nysm_data_20200804.csv'
nysm_sites_file = '/spare11/atm533/data/nysm_sites.csv'
nyasos_data_file = '/spare11/atm533/data/nyasos_data_20200804.txt'
nysm_data = pd.read_csv(nysm_data_file)
nysm_sites = pd.read_csv(nysm_sites_file)
Examine the ASOS data. Note the following differences in this dataset’s format as compared to how the NYSM data is structured:
Its columns are named differently from those of the NYSM
Its columns are separated using whitespace, not tabs
Missing values are denoted by -9999.0, not null strings
The date/time is formatted differently.
! head $nyasos_data_file
STN YYMMDD/HHMM SLAT SLON SELV TMPC DWPC RELH PMSL SPED GUMS DRCT P01M
ALB 200804/0000 42.75 -73.80 89.00 25.00 15.60 55.91 1016.50 2.06 -9999.00 10.00 -9999.00
ART 200804/0000 44.00 -76.02 99.00 22.80 16.70 68.47 1015.50 0.00 -9999.00 0.00 -9999.00
BGM 200804/0000 42.22 -75.98 497.00 23.90 15.00 57.47 1016.40 2.57 -9999.00 160.00 -9999.00
BUF 200804/0000 42.93 -78.73 215.00 23.30 17.80 71.22 1015.00 3.60 -9999.00 150.00 0.00
DKK 200804/0000 42.50 -79.28 203.00 20.60 18.30 86.66 1015.10 3.09 -9999.00 150.00 0.00
DSV 200804/0000 42.57 -77.72 209.00 21.70 18.90 84.10 1016.30 4.12 -9999.00 330.00 0.25
ELM 200804/0000 42.17 -76.90 291.00 27.80 15.00 45.60 1014.30 2.57 -9999.00 180.00 -9999.00
ELZ 200804/0000 42.11 -77.99 647.00 22.20 15.00 63.70 1016.40 2.06 -9999.00 160.00 -9999.00
FOK 200804/0000 40.85 -72.63 20.00 25.60 22.80 84.53 1018.00 2.57 -9999.00 220.00 -9999.00
nyasos_data = pd.read_csv(nyasos_data_file,delim_whitespace=True,na_values=[-9999.0])
Look at the first couple of rows in all three DataFrames.
nysm_data.head(2)
station | time | temp_2m [degC] | temp_9m [degC] | relative_humidity [percent] | precip_incremental [mm] | precip_local [mm] | precip_max_intensity [mm/min] | avg_wind_speed_prop [m/s] | max_wind_speed_prop [m/s] | ... | snow_depth [cm] | frozen_soil_05cm [bit] | frozen_soil_25cm [bit] | frozen_soil_50cm [bit] | soil_temp_05cm [degC] | soil_temp_25cm [degC] | soil_temp_50cm [degC] | soil_moisture_05cm [m^3/m^3] | soil_moisture_25cm [m^3/m^3] | soil_moisture_50cm [m^3/m^3] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ADDI | 2020-08-04 00:00:00 UTC | 23.1 | 23.9 | 63.7 | 0.0 | 0.0 | 0.0 | 1.6 | 3.0 | ... | NaN | 0.0 | 0.0 | 0.0 | 22.6 | 20.3 | 19.1 | 0.07 | 0.2 | 0.26 |
1 | ADDI | 2020-08-04 00:05:00 UTC | 23.4 | 24.1 | 61.4 | 0.0 | 0.0 | 0.0 | 2.2 | 3.1 | ... | NaN | 0.0 | 0.0 | 0.0 | 22.6 | 20.3 | 19.2 | 0.08 | 0.2 | 0.26 |
2 rows × 30 columns
nysm_sites.head(2)
stid | number | name | lat | lon | elevation | county | nearest_city | state | distance_from_town [km] | direction_from_town [degrees] | climate_division | climate_division_name | wfo | commissioned | decommissioned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ADDI | 107 | Addison | 42.04036 | -77.23726 | 507.614 | Steuben | Addison | NY | 6.9 | S | 1 | Western Plateau | BGM | 2016-08-10 18:15:00 UTC | NaN |
1 | ANDE | 111 | Andes | 42.18227 | -74.80139 | 518.282 | Delaware | Andes | NY | 1.5 | WSW | 2 | Eastern Plateau | BGM | 2016-08-04 15:55:00 UTC | NaN |
nyasos_data.head(2)
STN | YYMMDD/HHMM | SLAT | SLON | SELV | TMPC | DWPC | RELH | PMSL | SPED | GUMS | DRCT | P01M | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ALB | 200804/0000 | 42.75 | -73.80 | 89.0 | 25.0 | 15.6 | 55.91 | 1016.5 | 2.06 | NaN | 10.0 | NaN |
1 | ART | 200804/0000 | 44.00 | -76.02 | 99.0 | 22.8 | 16.7 | 68.47 | 1015.5 | 0.00 | NaN | 0.0 | NaN |
Remove (drop) unwanted columns from a DataFrame¶
Our merged data file will contain only a subset of the columns from each data source. First, let’s remove the unwanted columns from the NYSM file.
Use inplace
so the Dataframe object is updated, and specify that we’re referring to columns, not the default axis, which is row indices.
nysm_data.drop(['temp_9m [degC]','precip_incremental [mm]','precip_local [mm]','precip_max_intensity [mm/min]',
'avg_wind_speed_prop [m/s]', 'max_wind_speed_prop [m/s]','wind_speed_stddev_prop [m/s]', 'wind_direction_prop [degrees]',
'wind_direction_stddev_prop [degrees]','wind_speed_stddev_sonic [m/s]','wind_direction_stddev_sonic [degrees]', 'solar_insolation [W/m^2]','snow_depth [cm]', 'frozen_soil_05cm [bit]',
'frozen_soil_25cm [bit]', 'frozen_soil_50cm [bit]',
'soil_temp_05cm [degC]', 'soil_temp_25cm [degC]',
'soil_temp_50cm [degC]', 'soil_moisture_05cm [m^3/m^3]',
'soil_moisture_25cm [m^3/m^3]', 'soil_moisture_50cm [m^3/m^3]'],inplace=True,axis='columns')
Examine what columns remain post-drop.
nysm_data.columns
Index(['station', 'time', 'temp_2m [degC]', 'relative_humidity [percent]',
'avg_wind_speed_sonic [m/s]', 'max_wind_speed_sonic [m/s]',
'wind_direction_sonic [degrees]', 'station_pressure [mbar]'],
dtype='object')
Now, drop any unwanted columns from the ASOS data file.
nyasos_data.drop(['P01M'],axis='columns',inplace=True)
Examine the columns remaining in the ASOS data file
nyasos_data.columns
Index(['STN', 'YYMMDD/HHMM', 'SLAT', 'SLON', 'SELV', 'TMPC', 'DWPC', 'RELH',
'PMSL', 'SPED', 'GUMS', 'DRCT'],
dtype='object')
Examine the columns in the NYSM Sites file
nysm_sites.columns
Index(['stid', 'number', 'name', 'lat', 'lon', 'elevation', 'county',
'nearest_city', 'state', 'distance_from_town [km]',
'direction_from_town [degrees]', 'climate_division',
'climate_division_name', 'wfo', 'commissioned', 'decommissioned'],
dtype='object')
Use a dictionary to rename columns¶
Each dataframe has varying column names. Let’s standardize by creating a dictionary
that will map current column names to common (and in some cases, much shorter) names.
column_mapping = {'station' : 'STID',
'time': 'TIME',
'temp_2m [degC]': 'TMPC',
'relative_humidity [percent]': 'RELH',
'precip_incremental [mm]': 'PRCP',
'precip_local [mm]': 'PTOT',
'avg_wind_speed_sonic [m/s]': 'SPED',
'max_wind_speed_sonic [m/s]': 'GUMS',
'wind_direction_sonic [degrees]': 'DRCT',
'station_pressure [mbar]': 'PRES',
'stid': 'STID',
'name': 'NAME',
'lat': 'SLAT',
'lon': 'SLON',
'elevation': 'SELV',
'STN': 'STID',
'YYMMDD/HHMM': 'TIME'}
For each of the three Dataframes, rename the columns according to our dictionary. Then examine each Dataframe to see how they look.
nysm_data.rename(columns=column_mapping, inplace=True)
nysm_data.head()
STID | TIME | TMPC | RELH | SPED | GUMS | DRCT | PRES | |
---|---|---|---|---|---|---|---|---|
0 | ADDI | 2020-08-04 00:00:00 UTC | 23.1 | 63.7 | 1.8 | 3.0 | 175.0 | 957.00 |
1 | ADDI | 2020-08-04 00:05:00 UTC | 23.4 | 61.4 | 2.4 | 3.6 | 169.0 | 957.03 |
2 | ADDI | 2020-08-04 00:10:00 UTC | 23.1 | 62.2 | 2.4 | 3.7 | 167.0 | 956.98 |
3 | ADDI | 2020-08-04 00:15:00 UTC | 23.2 | 62.6 | 2.6 | 4.3 | 170.0 | 957.01 |
4 | ADDI | 2020-08-04 00:20:00 UTC | 23.0 | 62.8 | 2.0 | 3.7 | 172.0 | 956.97 |
nysm_sites.rename(columns=column_mapping, inplace=True)
nysm_sites.head()
STID | number | NAME | SLAT | SLON | SELV | county | nearest_city | state | distance_from_town [km] | direction_from_town [degrees] | climate_division | climate_division_name | wfo | commissioned | decommissioned | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ADDI | 107 | Addison | 42.04036 | -77.23726 | 507.6140 | Steuben | Addison | NY | 6.9 | S | 1 | Western Plateau | BGM | 2016-08-10 18:15:00 UTC | NaN |
1 | ANDE | 111 | Andes | 42.18227 | -74.80139 | 518.2820 | Delaware | Andes | NY | 1.5 | WSW | 2 | Eastern Plateau | BGM | 2016-08-04 15:55:00 UTC | NaN |
2 | BATA | 24 | Batavia | 43.01994 | -78.13566 | 276.1200 | Genesee | Batavia | NY | 4.9 | ENE | 9 | Great Lakes | BUF | 2016-02-18 18:40:00 UTC | NaN |
3 | BEAC | 76 | Beacon | 41.52875 | -73.94527 | 90.1598 | Dutchess | Beacon | NY | 3.3 | NE | 5 | Hudson Valley | ALY | 2016-08-22 16:45:00 UTC | NaN |
4 | BELD | 90 | Belden | 42.22322 | -75.66852 | 470.3700 | Broome | Belden | NY | 2.2 | NNE | 2 | Eastern Plateau | BGM | 2015-11-30 20:20:00 UTC | NaN |
nyasos_data.rename(columns=column_mapping, inplace=True)
nyasos_data.head()
STID | TIME | SLAT | SLON | SELV | TMPC | DWPC | RELH | PMSL | SPED | GUMS | DRCT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ALB | 200804/0000 | 42.75 | -73.80 | 89.0 | 25.0 | 15.6 | 55.91 | 1016.5 | 2.06 | NaN | 10.0 |
1 | ART | 200804/0000 | 44.00 | -76.02 | 99.0 | 22.8 | 16.7 | 68.47 | 1015.5 | 0.00 | NaN | 0.0 |
2 | BGM | 200804/0000 | 42.22 | -75.98 | 497.0 | 23.9 | 15.0 | 57.47 | 1016.4 | 2.57 | NaN | 160.0 |
3 | BUF | 200804/0000 | 42.93 | -78.73 | 215.0 | 23.3 | 17.8 | 71.22 | 1015.0 | 3.60 | NaN | 150.0 |
4 | DKK | 200804/0000 | 42.50 | -79.28 | 203.0 | 20.6 | 18.3 | 86.66 | 1015.1 | 3.09 | NaN | 150.0 |
Merge columns from one DataFrame into another, and then re-index.¶
Use merge
to add the latitude, longitude, and elevation from the NYSM Sites dataframe to the NYSM Data one.
To do this, we take a subset of the nysm_sites dataframe containing the ones we want to merge, plus the station id column, since merge
requires a common column for both dataframes. Note also that merge
does not have an inplace
option so we just redefine the nysm_data
object.
nysm_data = pd.merge(nysm_data,nysm_sites[['STID','SLAT','SLON','SELV']])
nysm_data.head()
STID | TIME | TMPC | RELH | SPED | GUMS | DRCT | PRES | SLAT | SLON | SELV | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ADDI | 2020-08-04 00:00:00 UTC | 23.1 | 63.7 | 1.8 | 3.0 | 175.0 | 957.00 | 42.04036 | -77.23726 | 507.614 |
1 | ADDI | 2020-08-04 00:05:00 UTC | 23.4 | 61.4 | 2.4 | 3.6 | 169.0 | 957.03 | 42.04036 | -77.23726 | 507.614 |
2 | ADDI | 2020-08-04 00:10:00 UTC | 23.1 | 62.2 | 2.4 | 3.7 | 167.0 | 956.98 | 42.04036 | -77.23726 | 507.614 |
3 | ADDI | 2020-08-04 00:15:00 UTC | 23.2 | 62.6 | 2.6 | 4.3 | 170.0 | 957.01 | 42.04036 | -77.23726 | 507.614 |
4 | ADDI | 2020-08-04 00:20:00 UTC | 23.0 | 62.8 | 2.0 | 3.7 | 172.0 | 956.97 | 42.04036 | -77.23726 | 507.614 |
Make the NYSM and ASOS dataframes multi-indexed according to site and time. Convert the date/time from string to Datetime objects.¶
First set the indexes
nysm_data.set_index(['STID', 'TIME'], inplace = True)
nysm_data.head()
TMPC | RELH | SPED | GUMS | DRCT | PRES | SLAT | SLON | SELV | ||
---|---|---|---|---|---|---|---|---|---|---|
STID | TIME | |||||||||
ADDI | 2020-08-04 00:00:00 UTC | 23.1 | 63.7 | 1.8 | 3.0 | 175.0 | 957.00 | 42.04036 | -77.23726 | 507.614 |
2020-08-04 00:05:00 UTC | 23.4 | 61.4 | 2.4 | 3.6 | 169.0 | 957.03 | 42.04036 | -77.23726 | 507.614 | |
2020-08-04 00:10:00 UTC | 23.1 | 62.2 | 2.4 | 3.7 | 167.0 | 956.98 | 42.04036 | -77.23726 | 507.614 | |
2020-08-04 00:15:00 UTC | 23.2 | 62.6 | 2.6 | 4.3 | 170.0 | 957.01 | 42.04036 | -77.23726 | 507.614 | |
2020-08-04 00:20:00 UTC | 23.0 | 62.8 | 2.0 | 3.7 | 172.0 | 956.97 | 42.04036 | -77.23726 | 507.614 |
The NYSM date/time strings can be automatically parsed into Datetime objects, recognized as in the UTC time zone.
nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)
/tmp/ipykernel_4157708/1677587967.py:1: FutureWarning: inplace is deprecated and will be removed in a future version.
nysm_data.index.set_levels([nysm_data.index.levels[0], pd.to_datetime(nysm_data.index.levels[1])],inplace=True)
nysm_data.head()
TMPC | RELH | SPED | GUMS | DRCT | PRES | SLAT | SLON | SELV | ||
---|---|---|---|---|---|---|---|---|---|---|
STID | TIME | |||||||||
ADDI | 2020-08-04 00:00:00+00:00 | 23.1 | 63.7 | 1.8 | 3.0 | 175.0 | 957.00 | 42.04036 | -77.23726 | 507.614 |
2020-08-04 00:05:00+00:00 | 23.4 | 61.4 | 2.4 | 3.6 | 169.0 | 957.03 | 42.04036 | -77.23726 | 507.614 | |
2020-08-04 00:10:00+00:00 | 23.1 | 62.2 | 2.4 | 3.7 | 167.0 | 956.98 | 42.04036 | -77.23726 | 507.614 | |
2020-08-04 00:15:00+00:00 | 23.2 | 62.6 | 2.6 | 4.3 | 170.0 | 957.01 | 42.04036 | -77.23726 | 507.614 | |
2020-08-04 00:20:00+00:00 | 23.0 | 62.8 | 2.0 | 3.7 | 172.0 | 956.97 | 42.04036 | -77.23726 | 507.614 |
The ASOS data set needs to have its station id and date/times sorted in order for the multi-indexing to work properly.
nyasos_data.sort_values(['STID','TIME'],inplace=True)
nyasos_data.set_index(['STID', 'TIME'], inplace = True)
Convert the date/time strings. Here, we need to specify how the string is formatted.¶
nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format="%y%m%d/%H%M",utc=True)],inplace=True)
/tmp/ipykernel_4157708/889004120.py:1: FutureWarning: inplace is deprecated and will be removed in a future version.
nyasos_data.index.set_levels([nyasos_data.index.levels[0], pd.to_datetime(nyasos_data.index.levels[1],format="%y%m%d/%H%M",utc=True)],inplace=True)
nyasos_data
SLAT | SLON | SELV | TMPC | DWPC | RELH | PMSL | SPED | GUMS | DRCT | ||
---|---|---|---|---|---|---|---|---|---|---|---|
STID | TIME | ||||||||||
ALB | 2020-08-04 00:00:00+00:00 | 42.75 | -73.80 | 89.0 | 25.0 | 15.6 | 55.91 | 1016.5 | 2.06 | NaN | 10.0 |
2020-08-04 01:00:00+00:00 | 42.75 | -73.80 | 89.0 | 23.9 | 16.1 | 61.66 | 1016.6 | 1.54 | NaN | 60.0 | |
2020-08-04 02:00:00+00:00 | 42.75 | -73.80 | 89.0 | 21.7 | 16.7 | 73.22 | 1017.0 | 0.00 | NaN | 0.0 | |
2020-08-04 03:00:00+00:00 | 42.75 | -73.80 | 89.0 | 21.7 | 16.7 | 73.22 | 1017.0 | 0.00 | NaN | 0.0 | |
2020-08-04 04:00:00+00:00 | 42.75 | -73.80 | 89.0 | 22.2 | 16.7 | 71.01 | 1017.0 | 0.00 | NaN | 0.0 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
SYR | 2020-08-04 19:00:00+00:00 | 43.12 | -76.12 | 124.0 | 21.1 | 19.4 | 90.02 | 1008.2 | 4.12 | NaN | 70.0 |
2020-08-04 20:00:00+00:00 | 43.12 | -76.12 | 124.0 | 21.1 | 19.4 | 90.02 | 1006.5 | 2.57 | NaN | 60.0 | |
2020-08-04 21:00:00+00:00 | 43.12 | -76.12 | 124.0 | 21.1 | 18.9 | 87.26 | 1006.1 | 3.09 | NaN | 50.0 | |
2020-08-04 22:00:00+00:00 | 43.12 | -76.12 | 124.0 | 22.8 | 18.9 | 78.65 | 1006.5 | 2.06 | NaN | 270.0 | |
2020-08-04 23:00:00+00:00 | 43.12 | -76.12 | 124.0 | 22.8 | 18.9 | 78.65 | 1006.9 | 1.54 | NaN | 280.0 |
837 rows × 10 columns
Add columns to a DataFrame¶
The NYSM data does not contain dewpoint nor sea-level pressure yet. Calculate and create columns for dewpoint and sea-level pressure.
tmpc = nysm_data['TMPC']
rh = nysm_data['RELH']
pres = nysm_data['PRES']
elev = nysm_data['SELV']
sensorHeight = .5
# Reduce station pressure to SLP. Source: https://www.sandhurstweather.org.uk/barometric.pdf
nysm_data['PMSL'] = pres/np.exp(-1*(elev+sensorHeight)/((tmpc+273.15) * 29.263))
nysm_data['DWPC'] = 243.04*(np.log(rh/100)+((17.625*tmpc)/(243.04+tmpc)))/(17.625-np.log(rh/100)-((17.625*tmpc)/(243.04+tmpc)))
Now that we have calculated sea-level pressure, let’s drop station pressure from the NYSM DataFrame since it does not appear in the ASOS one.
nysm_data.drop('PRES',inplace=True,axis='columns')
nysm_data
TMPC | RELH | SPED | GUMS | DRCT | SLAT | SLON | SELV | PMSL | DWPC | ||
---|---|---|---|---|---|---|---|---|---|---|---|
STID | TIME | ||||||||||
ADDI | 2020-08-04 00:00:00+00:00 | 23.1 | 63.7 | 1.8 | 3.0 | 175.0 | 42.04036 | -77.23726 | 507.614 | 1014.767745 | 15.846049 |
2020-08-04 00:05:00+00:00 | 23.4 | 61.4 | 2.4 | 3.6 | 169.0 | 42.04036 | -77.23726 | 507.614 | 1014.739386 | 15.554540 | |
2020-08-04 00:10:00+00:00 | 23.1 | 62.2 | 2.4 | 3.7 | 167.0 | 42.04036 | -77.23726 | 507.614 | 1014.746538 | 15.473742 | |
2020-08-04 00:15:00+00:00 | 23.2 | 62.6 | 2.6 | 4.3 | 170.0 | 42.04036 | -77.23726 | 507.614 | 1014.758278 | 15.668212 | |
2020-08-04 00:20:00+00:00 | 23.0 | 62.8 | 2.0 | 3.7 | 172.0 | 42.04036 | -77.23726 | 507.614 | 1014.756017 | 15.529141 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
YORK | 2020-08-04 23:35:00+00:00 | 21.7 | 89.7 | 1.3 | 2.3 | 246.0 | 42.85504 | -77.84776 | 177.942 | 1009.391100 | 19.933345 |
2020-08-04 23:40:00+00:00 | 21.7 | 90.0 | 1.0 | 1.6 | 237.0 | 42.85504 | -77.84776 | 177.942 | 1009.452354 | 19.987260 | |
2020-08-04 23:45:00+00:00 | 21.7 | 90.9 | 1.2 | 2.2 | 222.0 | 42.85504 | -77.84776 | 177.942 | 1009.574861 | 20.148065 | |
2020-08-04 23:50:00+00:00 | 21.8 | 90.8 | 1.1 | 2.1 | 231.0 | 42.85504 | -77.84776 | 177.942 | 1009.639244 | 20.229084 | |
2020-08-04 23:55:00+00:00 | 21.7 | 90.0 | 0.8 | 1.9 | 217.0 | 42.85504 | -77.84776 | 177.942 | 1009.687160 | 19.987260 |
36288 rows × 10 columns
Concatenate one DataFrame into another¶
Now, the two dataframes can be merged. We’ll use the concat
method to “glue” the rows from the ASOS table on to the NYSM one.
nymerge_data = pd.concat([nysm_data,nyasos_data])
nymerge_data
TMPC | RELH | SPED | GUMS | DRCT | SLAT | SLON | SELV | PMSL | DWPC | ||
---|---|---|---|---|---|---|---|---|---|---|---|
STID | TIME | ||||||||||
ADDI | 2020-08-04 00:00:00+00:00 | 23.1 | 63.70 | 1.80 | 3.0 | 175.0 | 42.04036 | -77.23726 | 507.614 | 1014.767745 | 15.846049 |
2020-08-04 00:05:00+00:00 | 23.4 | 61.40 | 2.40 | 3.6 | 169.0 | 42.04036 | -77.23726 | 507.614 | 1014.739386 | 15.554540 | |
2020-08-04 00:10:00+00:00 | 23.1 | 62.20 | 2.40 | 3.7 | 167.0 | 42.04036 | -77.23726 | 507.614 | 1014.746538 | 15.473742 | |
2020-08-04 00:15:00+00:00 | 23.2 | 62.60 | 2.60 | 4.3 | 170.0 | 42.04036 | -77.23726 | 507.614 | 1014.758278 | 15.668212 | |
2020-08-04 00:20:00+00:00 | 23.0 | 62.80 | 2.00 | 3.7 | 172.0 | 42.04036 | -77.23726 | 507.614 | 1014.756017 | 15.529141 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
SYR | 2020-08-04 19:00:00+00:00 | 21.1 | 90.02 | 4.12 | NaN | 70.0 | 43.12000 | -76.12000 | 124.000 | 1008.200000 | 19.400000 |
2020-08-04 20:00:00+00:00 | 21.1 | 90.02 | 2.57 | NaN | 60.0 | 43.12000 | -76.12000 | 124.000 | 1006.500000 | 19.400000 | |
2020-08-04 21:00:00+00:00 | 21.1 | 87.26 | 3.09 | NaN | 50.0 | 43.12000 | -76.12000 | 124.000 | 1006.100000 | 18.900000 | |
2020-08-04 22:00:00+00:00 | 22.8 | 78.65 | 2.06 | NaN | 270.0 | 43.12000 | -76.12000 | 124.000 | 1006.500000 | 18.900000 | |
2020-08-04 23:00:00+00:00 | 22.8 | 78.65 | 1.54 | NaN | 280.0 | 43.12000 | -76.12000 | 124.000 | 1006.900000 | 18.900000 |
37125 rows × 10 columns
Let’s rearrange the columns into a more logical order.
colOrder = ['SLAT','SLON','SELV','TMPC','DWPC','RELH','PMSL','DRCT','SPED','GUMS']
nymerge_data = nymerge_data.reindex(columns=colOrder)
sort_index
for this purpose.nymerge_data.sort_index(ascending=True, inplace=True)
nymerge_data
SLAT | SLON | SELV | TMPC | DWPC | RELH | PMSL | DRCT | SPED | GUMS | ||
---|---|---|---|---|---|---|---|---|---|---|---|
STID | TIME | ||||||||||
ADDI | 2020-08-04 00:00:00+00:00 | 42.04036 | -77.23726 | 507.614 | 23.1 | 15.846049 | 63.7 | 1014.767745 | 175.0 | 1.8 | 3.0 |
2020-08-04 00:05:00+00:00 | 42.04036 | -77.23726 | 507.614 | 23.4 | 15.554540 | 61.4 | 1014.739386 | 169.0 | 2.4 | 3.6 | |
2020-08-04 00:10:00+00:00 | 42.04036 | -77.23726 | 507.614 | 23.1 | 15.473742 | 62.2 | 1014.746538 | 167.0 | 2.4 | 3.7 | |
2020-08-04 00:15:00+00:00 | 42.04036 | -77.23726 | 507.614 | 23.2 | 15.668212 | 62.6 | 1014.758278 | 170.0 | 2.6 | 4.3 | |
2020-08-04 00:20:00+00:00 | 42.04036 | -77.23726 | 507.614 | 23.0 | 15.529141 | 62.8 | 1014.756017 | 172.0 | 2.0 | 3.7 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
YORK | 2020-08-04 23:35:00+00:00 | 42.85504 | -77.84776 | 177.942 | 21.7 | 19.933345 | 89.7 | 1009.391100 | 246.0 | 1.3 | 2.3 |
2020-08-04 23:40:00+00:00 | 42.85504 | -77.84776 | 177.942 | 21.7 | 19.987260 | 90.0 | 1009.452354 | 237.0 | 1.0 | 1.6 | |
2020-08-04 23:45:00+00:00 | 42.85504 | -77.84776 | 177.942 | 21.7 | 20.148065 | 90.9 | 1009.574861 | 222.0 | 1.2 | 2.2 | |
2020-08-04 23:50:00+00:00 | 42.85504 | -77.84776 | 177.942 | 21.8 | 20.229084 | 90.8 | 1009.639244 | 231.0 | 1.1 | 2.1 | |
2020-08-04 23:55:00+00:00 | 42.85504 | -77.84776 | 177.942 | 21.7 | 19.987260 | 90.0 | 1009.687160 | 217.0 | 0.8 | 1.9 |
37125 rows × 10 columns
Analyze the merged DataFrame¶
Now, we have what we wanted … one Dataframe that contains the same variables for both datasets. We can make selections on the multi-index Dataframe as we’ve done previously.
Get some general information about the DataFrame
:
nymerge_data.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 37125 entries, ('ADDI', Timestamp('2020-08-04 00:00:00+0000', tz='UTC')) to ('YORK', Timestamp('2020-08-04 23:55:00+0000', tz='UTC'))
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SLAT 37125 non-null float64
1 SLON 37125 non-null float64
2 SELV 37125 non-null float64
3 TMPC 37125 non-null float64
4 DWPC 36141 non-null float64
5 RELH 36141 non-null float64
6 PMSL 37033 non-null float64
7 DRCT 37079 non-null float64
8 SPED 37123 non-null float64
9 GUMS 36415 non-null float64
dtypes: float64(10)
memory usage: 3.0+ MB
Select one station
nymerge_data.loc['ROC']
SLAT | SLON | SELV | TMPC | DWPC | RELH | PMSL | DRCT | SPED | GUMS | |
---|---|---|---|---|---|---|---|---|---|---|
TIME | ||||||||||
2020-08-04 00:00:00+00:00 | 43.12 | -77.67 | 169.0 | 21.1 | 20.6 | 96.97 | 1015.7 | 70.0 | 3.60 | NaN |
2020-08-04 01:00:00+00:00 | 43.12 | -77.67 | 169.0 | 21.1 | 20.0 | 93.44 | 1015.5 | 50.0 | 2.06 | NaN |
2020-08-04 02:00:00+00:00 | 43.12 | -77.67 | 169.0 | 20.0 | 19.4 | 96.34 | 1016.0 | 90.0 | 4.12 | NaN |
2020-08-04 03:00:00+00:00 | 43.12 | -77.67 | 169.0 | 20.0 | 19.4 | 96.34 | 1015.9 | 240.0 | 1.54 | NaN |
2020-08-04 04:00:00+00:00 | 43.12 | -77.67 | 169.0 | 20.0 | 19.4 | 96.34 | 1015.4 | 100.0 | 1.54 | NaN |
2020-08-04 05:00:00+00:00 | 43.12 | -77.67 | 169.0 | 20.0 | 19.4 | 96.34 | 1014.7 | 130.0 | 2.06 | NaN |
2020-08-04 06:00:00+00:00 | 43.12 | -77.67 | 169.0 | 20.0 | 19.4 | 96.34 | 1014.1 | 230.0 | 1.54 | NaN |
2020-08-04 07:00:00+00:00 | 43.12 | -77.67 | 169.0 | 19.4 | 18.9 | 96.93 | 1013.4 | 0.0 | 0.00 | NaN |
2020-08-04 08:00:00+00:00 | 43.12 | -77.67 | 169.0 | 19.4 | 18.9 | 96.93 | 1013.3 | 0.0 | 0.00 | NaN |
2020-08-04 09:00:00+00:00 | 43.12 | -77.67 | 169.0 | 19.4 | 19.4 | 100.00 | 1012.8 | 0.0 | 0.00 | NaN |
2020-08-04 10:00:00+00:00 | 43.12 | -77.67 | 169.0 | 20.0 | 19.4 | 96.34 | 1012.6 | 0.0 | 0.00 | NaN |
2020-08-04 11:00:00+00:00 | 43.12 | -77.67 | 169.0 | 19.4 | 18.9 | 96.93 | 1012.4 | 280.0 | 1.54 | NaN |
2020-08-04 12:00:00+00:00 | 43.12 | -77.67 | 169.0 | 20.6 | 19.4 | 92.83 | 1012.0 | 0.0 | 0.00 | NaN |
2020-08-04 13:00:00+00:00 | 43.12 | -77.67 | 169.0 | 21.1 | 20.0 | 93.44 | 1011.9 | 0.0 | 0.00 | NaN |
2020-08-04 14:00:00+00:00 | 43.12 | -77.67 | 169.0 | 21.1 | 20.0 | 93.44 | 1011.7 | 0.0 | 0.00 | NaN |
2020-08-04 15:00:00+00:00 | 43.12 | -77.67 | 169.0 | 21.7 | 18.9 | 84.10 | 1011.4 | 0.0 | 0.00 | NaN |
2020-08-04 16:00:00+00:00 | 43.12 | -77.67 | 169.0 | 21.7 | 20.0 | 90.06 | 1010.9 | 170.0 | 2.57 | NaN |
2020-08-04 17:00:00+00:00 | 43.12 | -77.67 | 169.0 | 21.7 | 20.6 | 93.47 | 1010.3 | 150.0 | 2.06 | NaN |
2020-08-04 18:00:00+00:00 | 43.12 | -77.67 | 169.0 | 22.2 | 20.0 | 87.35 | 1009.6 | 0.0 | 0.00 | NaN |
2020-08-04 19:00:00+00:00 | 43.12 | -77.67 | 169.0 | 23.3 | 20.0 | 81.71 | 1008.8 | 10.0 | 2.06 | NaN |
2020-08-04 20:00:00+00:00 | 43.12 | -77.67 | 169.0 | 25.0 | 20.0 | 73.78 | 1007.5 | 20.0 | 4.12 | NaN |
2020-08-04 21:00:00+00:00 | 43.12 | -77.67 | 169.0 | 21.7 | 21.1 | 96.39 | 1008.3 | 320.0 | 3.60 | NaN |
2020-08-04 22:00:00+00:00 | 43.12 | -77.67 | 169.0 | 22.2 | 20.6 | 90.66 | 1008.7 | 340.0 | 3.09 | NaN |
2020-08-04 23:00:00+00:00 | 43.12 | -77.67 | 169.0 | 22.8 | 19.4 | 81.14 | 1009.0 | 320.0 | 3.09 | NaN |
Select multiple stations and multiple times.
nymerge_data.loc[(('ALB','VOOR'),('2020-08-04 17:45:00 UTC','2020-08-04 18:00:00 UTC')),('TMPC','DWPC')]
TMPC | DWPC | ||
---|---|---|---|
STID | TIME | ||
ALB | 2020-08-04 18:00:00+00:00 | 19.4 | 19.400000 |
VOOR | 2020-08-04 17:45:00+00:00 | 20.2 | 19.988491 |
2020-08-04 18:00:00+00:00 | 20.3 | 20.071947 |
Use the cross-section method to select one index value and then display all values of the other index for selected columns.
nymerge_data.xs('2020-08-04 18:00:00 UTC',level='TIME')[['TMPC','DWPC','RELH']]
TMPC | DWPC | RELH | |
---|---|---|---|
STID | |||
ADDI | 20.0 | 19.425792 | 96.50 |
ALB | 19.4 | 19.400000 | 100.00 |
ANDE | 19.0 | 18.578391 | 97.40 |
ART | 21.1 | 20.000000 | 93.44 |
BATA | 23.2 | 19.517692 | 79.80 |
... | ... | ... | ... |
WFMB | 17.6 | NaN | NaN |
WGAT | 19.3 | 18.844511 | 97.20 |
WHIT | 21.0 | 20.588584 | 97.50 |
WOLC | 21.1 | NaN | NaN |
YORK | 22.8 | 20.315945 | 85.90 |
161 rows × 3 columns
DataFrame
and create a station plot with it ... which would show not only the NYSM data, but also the traditional ASOS network of stations!