Skip to article frontmatterSkip to article content

Shapefiles 1: Intro to Shapefiles

Overview:

  1. Read in the state/province Natural Earth shapefile
  2. Explore the shapefile using GeoPandas
  3. Plot the shapefile on a map and overlay with other Natural Earth shapefiles

Prerequisites

ConceptsImportanceNotes
PandasNecessary
CartopyNecessary
  • Time to learn: 30 minutes

What is a shapefile?

From Wikipedia:

The shapefile format is a geospatial vector data format for geographic information system (GIS) software. It is developed and regulated by Esri as a mostly open specification for data interoperability among Esri and other GIS software products. The shapefile format can spatially describe vector features: points, lines, and polygons, representing, for example, water wells, rivers, and lakes. Each item usually has attributes that describe it, such as name or temperature.

Rasters versus Shapefiles: By virtue of being a *vector data format*, shapefiles do not becoming fuzzy/pixelated as one zooms in on maps that display features defined by shapefiles. This is in contrast to raster-based data. See the first reference at the end of the notebook for a nice explanation.
Note: We've already been using shapefiles, whenever we've added various Natural Earth backgrounds with Cartopy's feature. We'll examine shapefiles in more detail here, using Cartopy's shapereader.

Imports

import cartopy.crs as ccrs
import cartopy.feature as cfeature
import geopandas as gp
import matplotlib.pyplot as plt
import pandas as pd
import cartopy.io.shapereader as shpreader
from metpy.plots import USCOUNTIES
import glob
import os

Read in a shapefile

Use Cartopy’s shpreader.natural_earth method to read in the highest (10m) resolution states and provinces shapefile. It will be automatically downloaded if not present, and is cached by default in a hidden subdirectory in your home directory.

res = '10m'
category = 'cultural'
name = 'admin_1_states_provinces'
shpfilename = shpreader.natural_earth(res, category, name)
shpfilename
PosixPath('/home11/staff/ktyle/.local/share/cartopy/shapefiles/natural_earth/cultural/ne_10m_admin_1_states_provinces.shp')

What files matching this particular political boundary classification do we have?

path = os.path.dirname(shpfilename)
glob.glob(path + "/ne_10m_admin_1_states_provinces.*")
['/home11/staff/ktyle/.local/share/cartopy/shapefiles/natural_earth/cultural/ne_10m_admin_1_states_provinces.shp', '/home11/staff/ktyle/.local/share/cartopy/shapefiles/natural_earth/cultural/ne_10m_admin_1_states_provinces.dbf', '/home11/staff/ktyle/.local/share/cartopy/shapefiles/natural_earth/cultural/ne_10m_admin_1_states_provinces.shx', '/home11/staff/ktyle/.local/share/cartopy/shapefiles/natural_earth/cultural/ne_10m_admin_1_states_provinces.prj', '/home11/staff/ktyle/.local/share/cartopy/shapefiles/natural_earth/cultural/ne_10m_admin_1_states_provinces.cpg']

We have five files with different suffixes. The first three must be present (see the second reference at the end of the notebook). Here, we will read in the .shp file, using a Pandas-like library, Geopandas, which creates a geo-referenced Pandas Dataframe.

Explore the shapefile using Geopandas

Display all columns in the dataframe. Then create the dataframe and take a look at the first and last five rows (scroll to see all 84 columns).

pd.set_option('display.max_columns', None)
df = gp.read_file(shpfilename)
df
Loading...

Examine the column names more concisely.

df.columns
Index(['featurecla', 'scalerank', 'adm1_code', 'diss_me', 'iso_3166_2', 'wikipedia', 'iso_a2', 'adm0_sr', 'name', 'name_alt', 'name_local', 'type', 'type_en', 'code_local', 'code_hasc', 'note', 'hasc_maybe', 'region', 'region_cod', 'provnum_ne', 'gadm_level', 'check_me', 'datarank', 'abbrev', 'postal', 'area_sqkm', 'sameascity', 'labelrank', 'name_len', 'mapcolor9', 'mapcolor13', 'fips', 'fips_alt', 'woe_id', 'woe_label', 'woe_name', 'latitude', 'longitude', 'sov_a3', 'adm0_a3', 'adm0_label', 'admin', 'geonunit', 'gu_a3', 'gn_id', 'gn_name', 'gns_id', 'gns_name', 'gn_level', 'gn_region', 'gn_a1_code', 'region_sub', 'sub_code', 'gns_level', 'gns_lang', 'gns_adm1', 'gns_region', 'min_label', 'max_label', 'min_zoom', 'wikidataid', 'name_ar', 'name_bn', 'name_de', 'name_en', 'name_es', 'name_fr', 'name_el', 'name_hi', 'name_hu', 'name_id', 'name_it', 'name_ja', 'name_ko', 'name_nl', 'name_pl', 'name_pt', 'name_ru', 'name_sv', 'name_tr', 'name_vi', 'name_zh', 'ne_id', 'geometry'], dtype='object')
Tip: Countries are in the DataFrame's admin column. We can chain together Pandas` methods such as sort_values and unique to show all of the countries included in the shapefile.
df.admin.sort_values().unique()
array(['Afghanistan', 'Akrotiri Sovereign Base Area', 'Aland', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Anguilla', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Ashmore and Cartier Islands', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Barbados', 'Baykonur Cosmodrome', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'British Indian Ocean Territory', 'British Virgin Islands', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Caribbean Netherlands', 'Cayman Islands', 'Central African Republic', 'Chad', 'Chile', 'China', 'Clipperton Island', 'Colombia', 'Comoros', 'Cook Islands', 'Coral Sea Islands', 'Costa Rica', 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', 'Democratic Republic of the Congo', 'Denmark', 'Dhekelia Sovereign Base Area', 'Djibouti', 'Dominica', 'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Falkland Islands', 'Faroe Islands', 'Federated States of Micronesia', 'Fiji', 'Finland', 'France', 'French Polynesia', 'French Southern and Antarctic Lands', 'Gabon', 'Gambia', 'Gaza Strip', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guam', 'Guatemala', 'Guernsey', 'Guinea', 'Guinea Bissau', 'Guyana', 'Haiti', 'Heard Island and McDonald Islands', 'Honduras', 'Hong Kong S.A.R.', 'Hungary', 'Iceland', 'India', 'Indian Ocean Territories', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Ivory Coast', 'Jamaica', 'Japan', 'Jersey', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macau S.A.R', 'Macedonia', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico', 'Moldova', 'Monaco', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Norfolk Island', 'North Korea', 'Northern Cyprus', 'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines', 'Pitcairn Islands', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Serbia', 'Republic of the Congo', 'Romania', 'Russia', 'Rwanda', 'S. Sudan', 'Saint Barthelemy', 'Saint Helena', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Martin', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'San Marino', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Seychelles', 'Siachen Glacier', 'Sierra Leone', 'Singapore', 'Sint Maarten', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'Somaliland', 'South Africa', 'South Georgia and the Islands', 'South Korea', 'Spain', 'Spratly Is.', 'Sri Lanka', 'Sudan', 'Suriname', 'Swaziland', 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Tajikistan', 'Thailand', 'The Bahamas', 'Togo', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'US Naval Base Guantanamo Bay', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United Republic of Tanzania', 'United States Minor Outlying Islands', 'United States Virgin Islands', 'United States of America', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Vatican', 'Venezuela', 'Vietnam', 'Wallis and Futuna', 'West Bank', 'Western Sahara', 'Yemen', 'Zambia', 'Zimbabwe'], dtype=object)

Let’s display all rows that are in the USA.

df.loc[df.admin == 'United States of America']
Loading...
Tip: The name column corresponds to the states.

Examine the Geometry column

When Geopandas creates a dataframe from a Shapefile, a key column that gets added is the geometry one. Here we can see that each of the states have Geometry features consisting of Polygons or Multipolygons

df.loc[df.name == 'Colorado']['geometry']
3277 POLYGON ((-109.04633 40.99983, -108.88932 40.9... Name: geometry, dtype: geometry
df.loc[df.name == 'New York']['geometry']
1307 MULTIPOLYGON (((-74.69588 44.99803, -74.59611 ... Name: geometry, dtype: geometry
Question: Notice that Colorado's geometry is POLYGON while New York is MULTIPOLYGON. Can you think of a reason for the distinction? (Hint: are any islands part of the state?)

Make a Geopandas dataframe for New York State

dfNYS = df.loc[df.name == 'New York']

As with any Pandas Dataframe, we can directly call its plot method for a quick view.

dfNYS.plot(figsize=(12,9))
<Axes: >
<Figure size 1200x900 with 1 Axes>

We can create a GeoPandas series consisting only of the Geometry column.

geoNY = df.loc[df.name == 'New York']['geometry']
geoNY
1307 MULTIPOLYGON (((-74.69588 44.99803, -74.59611 ... Name: geometry, dtype: geometry

Examine the values attribute of this Geopandas geometry object:

geoNY.values
<GeometryArray> [<MULTIPOLYGON (((-74.696 44.998, -74.596 44.998, -74.496 44.999, -74.397 44....>] Length: 1, dtype: geometry
Note: The object begins with shapely. It stems from the Shapely library. Although we did not explicitly list this library in our list of imports, it was imported as a result of the import cartopy.io.shapereader as shpreader statement.

If we take just the first (and only element) of this array, what do we get?

nysPoly = geoNY.values[0]
nysPoly
Loading...

Plot the shapefile

We can now add this object to a Cartopy-referenced Matplotlib plot using Matplotlib’s add_geometries method. First we’ll comment out all of the Natural Earth shapefiles we typically use. This will produce just a map of NYS.

fig = plt.figure(figsize=(15,10))
ax = plt.axes(projection=ccrs.PlateCarree())
plt.title('NYS')
ax.add_geometries(nysPoly, ccrs.PlateCarree(),
                  edgecolor='black', facecolor='tab:blue', alpha=0.5)
ax.set_extent([-80, -72,40.5,45.2], ccrs.PlateCarree());
<Figure size 1500x1000 with 1 Axes>

Add additional features to the plot

Now, add some of Cartopy’s Natural Earth features to the plot. We should see that with the exception of the counties (which defaults to a slightly lower resolution (20m) than the others) everything should line up nicely. We’ll slightly expand the map extent so we’ll see neighboring state/province outlines.

res = '10m'
fig = plt.figure(figsize=(15,10))
ax = plt.axes(projection=ccrs.PlateCarree())

plt.title('NYS')
ax.coastlines(resolution=res)
ax.add_geometries(nysPoly, ccrs.PlateCarree(),
                  edgecolor='black', facecolor='tab:blue', alpha=0.5)
ax.add_feature (cfeature.LAKES.with_scale(res), alpha = 0.5)
ax.add_feature (cfeature.STATES.with_scale(res))
ax.add_feature(USCOUNTIES,edgecolor='grey',linewidth=1 );
ax.set_extent([-85, -67,37.5,49.5], ccrs.PlateCarree());
<Figure size 1500x1000 with 1 Axes>
Tip: The USCOUNTIES shapefile also has a 5m version. To use it, append .with_scale('5m') to USCOUNTIES.

Summary

  • Matplotlib and Cartopy, via the Shapely library, provides support for reading and displaying shapefiles.
  • Shapefiles are a form of vector graphics; unlike rasters, they do not degrade as one zooms into them.
  • The Geopandas library treats shapefiles as geo-referenced Pandas dataframes.

What’s Next?

We’ll return to cloud-optimized geotiffs, specifically those that are tile-based, with more detail automatically displayed as one zooms in to a particular region.