Interactive data visualization in Python: Geopandas
Contents
Interactive data visualization in Python: Geopandas¶
Prerequisites¶
Concepts |
Importance |
Notes |
---|---|---|
Required |
Projections and Features |
|
Required |
Tabular Datasets |
Time to learn: 20 minutes
All of the graphics we have generated so far in the class have been static. In other words, they exist “as-is” … there is no way to interact with them. While this is fine, and even preferable, for traditional publication figures and websites, it would be nice to be able to produce dynamic figures … which one can zoom into/out of, pan around … similar to, say, Google Maps.
We have previously displayed tropical cyclone locations on a static map, using Cartopy, Matplotlib, and (in your HW1 assignment) Pandas. Now, let’s make an interactive map … for that, we will leverage the Geopandas Python package.
Imports¶
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from cartopy import crs as ccrs
from cartopy import feature as cfeature
import geopandas as gpd
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas
/tmp/ipykernel_4158155/2440184564.py:6: DeprecationWarning: Shapely 2.0 is installed, but because PyGEOS is also installed, GeoPandas still uses PyGEOS by default. However, starting with version 0.14, the default will switch to Shapely. To force to use Shapely 2.0 now, you can either uninstall PyGEOS or set the environment variable USE_PYGEOS=0. You can do this before starting the Python process, or in your code before importing geopandas:
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas
In the next release, GeoPandas will switch to using Shapely by default, even if PyGEOS is installed. If you only have PyGEOS installed to get speed-ups, this switch should be smooth. However, if you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
import geopandas as gpd
Read in the iBTrACS database¶
iBTrACS, aka the International Best Track Archive for Climate Stewardship, is a continually-updated database of worldwide tropical cyclones.
The iBTrACS datasets have over 100 columns. Let’s focus on just a subset of columns that have most relevance to the North Atlantic/Gulf of Mexico/Caribbean Sea basins.
keepCols = ['SEASON', 'NAME', 'ISO_TIME', 'LAT', 'LON', 'USA_STATUS', 'USA_WIND', 'USA_PRES']
Open a pandas Dataframe
from a URL pointing to the North Atlantic iBTrACS dataset.
df = pd.read_csv("https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.NA.list.v04r00.csv",low_memory=False, skiprows = [1], usecols=keepCols)
read_csv
function. You can explore the full suite of options via the Pandas API documentation.df
SEASON | NAME | ISO_TIME | LAT | LON | USA_STATUS | USA_WIND | USA_PRES | |
---|---|---|---|---|---|---|---|---|
0 | 1851 | NOT_NAMED | 1851-06-23 12:00:00 | 26.1000 | -90.4000 | |||
1 | 1851 | NOT_NAMED | 1851-06-23 15:00:00 | 26.2001 | -90.6999 | |||
2 | 1851 | NOT_NAMED | 1851-06-23 18:00:00 | 26.3000 | -91.0000 | |||
3 | 1851 | NOT_NAMED | 1851-06-23 21:00:00 | 26.3999 | -91.3001 | |||
4 | 1851 | NOT_NAMED | 1851-06-24 00:00:00 | 26.5000 | -91.6000 | |||
... | ... | ... | ... | ... | ... | ... | ... | ... |
126953 | 2023 | SEAN | 2023-10-14 12:00:00 | 15.8000 | -43.9000 | 29 | 1007 | |
126954 | 2023 | SEAN | 2023-10-14 15:00:00 | 16.0650 | -44.2092 | 29 | 1007 | |
126955 | 2023 | SEAN | 2023-10-14 18:00:00 | 16.3000 | -44.5000 | 29 | 1007 | |
126956 | 2023 | SEAN | 2023-10-14 21:00:00 | 16.5076 | -44.7974 | 29 | 1007 | |
126957 | 2023 | SEAN | 2023-10-15 00:00:00 | 16.7000 | -45.1000 | 29 | 1008 |
126958 rows × 8 columns
Create a new Dataframe
consisting only of tropical cyclone Franklin (2023).
dfFranklin = df.query('NAME == "FRANKLIN" & SEASON == 2023')
query
function allows for database-like queries on a Dataset
. For more info, check out the documentation for the df.query function.
dfFranklin
SEASON | NAME | ISO_TIME | LAT | LON | USA_STATUS | USA_WIND | USA_PRES | |
---|---|---|---|---|---|---|---|---|
125903 | 2023 | FRANKLIN | 2023-08-18 06:00:00 | 11.6000 | -53.5000 | DB | 15 | |
125904 | 2023 | FRANKLIN | 2023-08-18 09:00:00 | 11.5900 | -53.8225 | DB | 15 | |
125905 | 2023 | FRANKLIN | 2023-08-18 12:00:00 | 11.6000 | -54.2000 | DB | 15 | |
125906 | 2023 | FRANKLIN | 2023-08-18 15:00:00 | 11.6425 | -54.6700 | DB | 17 | |
125907 | 2023 | FRANKLIN | 2023-08-18 18:00:00 | 11.7000 | -55.2000 | DB | 20 | 1008 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
126015 | 2023 | FRANKLIN | 2023-09-01 06:00:00 | 37.4000 | -57.9000 | HU | 70 | 983 |
126016 | 2023 | FRANKLIN | 2023-09-01 09:00:00 | 37.8122 | -57.2549 | HU | 67 | 984 |
126017 | 2023 | FRANKLIN | 2023-09-01 12:00:00 | 38.2000 | -56.6000 | HU | 65 | 985 |
126018 | 2023 | FRANKLIN | 2023-09-01 15:00:00 | 38.5994 | -55.7943 | HU | 67 | 982 |
126019 | 2023 | FRANKLIN | 2023-09-01 18:00:00 | 39.0000 | -54.9000 | EX | 70 | 979 |
117 rows × 8 columns
Make our Pandas Dataframe
geo-aware. To do this, we create a Geopandas Dataframe. It adds a Geometry
column, which may consist of shapes or points. The TC locations are points, so that’s what we’ll use to instantiate the Geometry column.
gdf = gpd.GeoDataFrame(dfFranklin,geometry=gpd.points_from_xy(dfFranklin.LON,dfFranklin.LAT))
gdf
SEASON | NAME | ISO_TIME | LAT | LON | USA_STATUS | USA_WIND | USA_PRES | geometry | |
---|---|---|---|---|---|---|---|---|---|
125903 | 2023 | FRANKLIN | 2023-08-18 06:00:00 | 11.6000 | -53.5000 | DB | 15 | POINT (-53.50000 11.60000) | |
125904 | 2023 | FRANKLIN | 2023-08-18 09:00:00 | 11.5900 | -53.8225 | DB | 15 | POINT (-53.82250 11.59000) | |
125905 | 2023 | FRANKLIN | 2023-08-18 12:00:00 | 11.6000 | -54.2000 | DB | 15 | POINT (-54.20000 11.60000) | |
125906 | 2023 | FRANKLIN | 2023-08-18 15:00:00 | 11.6425 | -54.6700 | DB | 17 | POINT (-54.67000 11.64250) | |
125907 | 2023 | FRANKLIN | 2023-08-18 18:00:00 | 11.7000 | -55.2000 | DB | 20 | 1008 | POINT (-55.20000 11.70000) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
126015 | 2023 | FRANKLIN | 2023-09-01 06:00:00 | 37.4000 | -57.9000 | HU | 70 | 983 | POINT (-57.90000 37.40000) |
126016 | 2023 | FRANKLIN | 2023-09-01 09:00:00 | 37.8122 | -57.2549 | HU | 67 | 984 | POINT (-57.25490 37.81220) |
126017 | 2023 | FRANKLIN | 2023-09-01 12:00:00 | 38.2000 | -56.6000 | HU | 65 | 985 | POINT (-56.60000 38.20000) |
126018 | 2023 | FRANKLIN | 2023-09-01 15:00:00 | 38.5994 | -55.7943 | HU | 67 | 982 | POINT (-55.79430 38.59940) |
126019 | 2023 | FRANKLIN | 2023-09-01 18:00:00 | 39.0000 | -54.9000 | EX | 70 | 979 | POINT (-54.90000 39.00000) |
117 rows × 9 columns
Note that geometry
appears as a new column (Series
).
We can interactively explore
this Dataframe as a map in the browser!
gdf.explore()
Well … we have an interactive frame … and it looks like there’s a track in there … but where is the interactive map??
We still have a little more work to do:
While the points certainly look like latitude and longitudes, we need to explicitly assign a projection to the Dataframe before we can view it on a map. One way is to assign a coordinate reference system code, via EPSG … in this case, EPSG 4326.
Note the arguments to set_crs
:
epsg = 4326
: Assign the specified CRSinplace = True
: Thegdf
object is updated without the need to assign a new dataframe objectallow_override = True
: If a CRS had previously been applied, override with the EPSG value specified.
gdf.set_crs(epsg=4326, inplace=True, allow_override=True)
SEASON | NAME | ISO_TIME | LAT | LON | USA_STATUS | USA_WIND | USA_PRES | geometry | |
---|---|---|---|---|---|---|---|---|---|
125903 | 2023 | FRANKLIN | 2023-08-18 06:00:00 | 11.6000 | -53.5000 | DB | 15 | POINT (-53.50000 11.60000) | |
125904 | 2023 | FRANKLIN | 2023-08-18 09:00:00 | 11.5900 | -53.8225 | DB | 15 | POINT (-53.82250 11.59000) | |
125905 | 2023 | FRANKLIN | 2023-08-18 12:00:00 | 11.6000 | -54.2000 | DB | 15 | POINT (-54.20000 11.60000) | |
125906 | 2023 | FRANKLIN | 2023-08-18 15:00:00 | 11.6425 | -54.6700 | DB | 17 | POINT (-54.67000 11.64250) | |
125907 | 2023 | FRANKLIN | 2023-08-18 18:00:00 | 11.7000 | -55.2000 | DB | 20 | 1008 | POINT (-55.20000 11.70000) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
126015 | 2023 | FRANKLIN | 2023-09-01 06:00:00 | 37.4000 | -57.9000 | HU | 70 | 983 | POINT (-57.90000 37.40000) |
126016 | 2023 | FRANKLIN | 2023-09-01 09:00:00 | 37.8122 | -57.2549 | HU | 67 | 984 | POINT (-57.25490 37.81220) |
126017 | 2023 | FRANKLIN | 2023-09-01 12:00:00 | 38.2000 | -56.6000 | HU | 65 | 985 | POINT (-56.60000 38.20000) |
126018 | 2023 | FRANKLIN | 2023-09-01 15:00:00 | 38.5994 | -55.7943 | HU | 67 | 982 | POINT (-55.79430 38.59940) |
126019 | 2023 | FRANKLIN | 2023-09-01 18:00:00 | 39.0000 | -54.9000 | EX | 70 | 979 | POINT (-54.90000 39.00000) |
117 rows × 9 columns
Now, let’s try the explore
function again!
gdf.explore()
We can pan, zoom, and hover over each point … hovering shows the values of all the columns in the Dataframe
.
Now, let’s select just one column from the Dataframe and explore once again.
gdf.explore(column='USA_WIND')
Notice that the values aren’t in numerical order … they’re being treated as strings. Let’s explicitly set the USA_WIND
column to be an integer (a 16-bit size is fine).
gdf['USA_WIND'] = gdf['USA_WIND'].astype('int16')
gdf.explore('USA_WIND')
By default, passing in one column of numerical values will color-code each value!
Things to try¶
On your own, create Jupyter notebooks and do the following:¶
Select another storm from the historical record and plot its track.
Instead of just one specific named storm, create a
Dataframe
including all storms from a particular year … or multiple years. Then plot the tracks of all storms from that year. Can you find a way to clearly show what track goes with each storm?Examine the Geopandas website and experiment with different visualizations, either from the iBTrACS database or another dataset of interest to you.