NLDN CSV to Parquet

This notebook converts monthly CSV-formatted files of real-time lightning flashes/strikes as reported by the National Lightning Detection Network (NLDN, originally developed in our department in the 1980s!) into Parquet (a binary format ideally suited for tabular datasets, especially when hosted on cloud platforms) and then visualizes data from the Parquet file.¶

We will also compare the performance (i.e., time to execute) of reading in these files, as well as code cells in general, via the use of one of Jupyterlab’s cell magic directives.¶

Imports¶

import pandas as pd
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import geopandas as gpd
import polars as pl

Specify the first day of a month.

current = datetime(2025,8,1)
month = current.strftime("%Y%m")
month

'202508'

Define an object that points to the August 2025 NLDN data file in CSV format.

NLDN_csv = f'/spare11/atm533/data/NLDN_{month}.txt'

Take a peek at the first five lines of the file, using the ! directive to execute a Linux command as if we were typing it on the command line.

! head -5 $NLDN_csv

2025-08-01 00:00:00.002955264  28.2501  -81.1814    +5.4  0  C  0.5  0.2  83   0.7
2025-08-01 00:00:00.005760000  31.6814  -92.1643    -5.6  0  C  0.2  0.2 102   0.1
2025-08-01 00:00:00.019553280  31.6591  -92.1704   -14.5  0  G  0.2  0.2  14   0.6
2025-08-01 00:00:00.052648960  43.1471 -119.1931    -8.5  0  G  0.2  0.2 113   0.5
2025-08-01 00:00:00.054249472  31.3856 -109.6574    +6.4  0  C  0.3  0.2 175   0.2

We see that the file has several columns, each separated by one or more blank spaces (often termed whitespace). Since Pandas read_csv function’s defaults to expecting commas (,) as the column separators, we will need to explicitly pass in “one or more blank spaces” as the value of the sep argument. Although not intuitive, that value is '\\s+'.

There is no header row at the beginning of the file, so we will also need to define a list of abbreviated column names; one for each column.

colNames = ['Date', 'Time', 'Lat', 'Lon', 'Cur', 'Mul', 'CG', 'Chi', 'Maj', 'Axis', 'Min']

Use a Jupyterlab cell magic directive to determine how long it takes the cell to execute. As we will see, the file has many millions of rows, so it will take a little time!¶

%%time
# Read in the CSV file
df = pd.read_csv(f'/spare11/atm533/data/NLDN_{month}.txt',sep='\\s+', names = colNames)

CPU times: user 48.1 s, sys: 7.47 s, total: 55.6 s
Wall time: 55.8 s

df

Another library gaining traction in the Pangeo ecosystem is Polars. Examine how its `read_csv` method compares to and contrasts with Pandas.¶

pl.read_csv?

Let’s try reading in the file with Polars. ChatGPT recommended that we state the “one or more blank spaces” as `r"\s+", so let’s see if it succeeds. SPOILER ALERT: It will fail!

# Read in the CSV file
df = pl.read_csv(f'/nldn11/combined/archive/NLDN_{month}.txt' ,separator=r"\s+", new_columns = colNames, has_header=False)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 2
      1 # Read in the CSV file
----> 2 df = pl.read_csv(f'/nldn11/combined/archive/NLDN_{month}.txt' ,separator=r"\s+", new_columns = colNames, has_header=False)

File /knight/pixi/sep24/sep25_env-5711917176091715257/envs/default/lib/python3.13/site-packages/polars/_utils/deprecation.py:128, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    123 @wraps(function)
    124 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    125     _rename_keyword_argument(
    126         old_name, new_name, kwargs, function.__qualname__, version
    127     )
--> 128     return function(*args, **kwargs)

File /knight/pixi/sep24/sep25_env-5711917176091715257/envs/default/lib/python3.13/site-packages/polars/_utils/deprecation.py:128, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    123 @wraps(function)
    124 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    125     _rename_keyword_argument(
    126         old_name, new_name, kwargs, function.__qualname__, version
    127     )
--> 128     return function(*args, **kwargs)

File /knight/pixi/sep24/sep25_env-5711917176091715257/envs/default/lib/python3.13/site-packages/polars/_utils/deprecation.py:128, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    123 @wraps(function)
    124 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    125     _rename_keyword_argument(
    126         old_name, new_name, kwargs, function.__qualname__, version
    127     )
--> 128     return function(*args, **kwargs)

File /knight/pixi/sep24/sep25_env-5711917176091715257/envs/default/lib/python3.13/site-packages/polars/io/csv/functions.py:279, in read_csv(source, has_header, columns, new_columns, separator, comment_prefix, quote_char, skip_rows, skip_lines, schema, schema_overrides, null_values, missing_utf8_is_empty_string, ignore_errors, try_parse_dates, n_threads, infer_schema, infer_schema_length, batch_size, n_rows, encoding, low_memory, rechunk, use_pyarrow, storage_options, skip_rows_after_header, row_index_name, row_index_offset, sample_size, eol_char, raise_if_empty, truncate_ragged_lines, decimal_comma, glob)
     46 @deprecate_renamed_parameter("dtypes", "schema_overrides", version="0.20.31")
     47 @deprecate_renamed_parameter("row_count_name", "row_index_name", version="0.20.4")
     48 @deprecate_renamed_parameter("row_count_offset", "row_index_offset", version="0.20.4")
   (...)     86     glob: bool = True,
     87 ) -> DataFrame:
     88     r"""
     89     Read a CSV file into a DataFrame.
     90 
   (...)    277     └─────┴─────────┴────────────┘
    278     """
--> 279     _check_arg_is_1byte("separator", separator, can_be_empty=False)
    280     _check_arg_is_1byte("quote_char", quote_char, can_be_empty=True)
    281     _check_arg_is_1byte("eol_char", eol_char, can_be_empty=False)

File /knight/pixi/sep24/sep25_env-5711917176091715257/envs/default/lib/python3.13/site-packages/polars/io/csv/_utils.py:28, in _check_arg_is_1byte(arg_name, arg, can_be_empty)
     23 elif arg_byte_length != 1:
     24     msg = (
     25         f'{arg_name}="{arg}" should be a single byte character, but is'
     26         f" {arg_byte_length} bytes long"
     27     )
---> 28     raise ValueError(msg)

ValueError: separator="\s+" should be a single byte character, but is 3 bytes long

Why did it fail, and will it ever be an option? Check out this GitHub Issue thread with a response from one of the main developers of Polars.¶

For large tabular datasets, the convenience of having them represented in a human-readable format such as `csv` is outweighed by larger file sizes, slower read performance, greater system memory usage, and (as we have just seen) lack of support by Polars when columns are separated in non-standard ways. A binary format that solves these issues is called Parquet.¶

Convert to Parquet format. Set the output directory and file name.

outputDir = '.' # Use the current directory
parqFile = f'{outputDir}/NLDN_{month}_full.parquet'

Use one of Pandas file output functions (in this case, to_parquet) to perform the conversion and output to disk.

%%time
df.to_parquet(parqFile)

CPU times: user 20.4 s, sys: 1.63 s, total: 22 s
Wall time: 30.2 s

Read in the Parquet file to verify all looks good. First, we’ll read it in with Pandas. We’ll redefine the dataframe object so we avoid multiple instances of fairly large datasets remaining in system memory.

%%time 

df = pd.read_parquet(parqFile)

CPU times: user 17.6 s, sys: 8.66 s, total: 26.3 s
Wall time: 23.7 s

%%time 

df_polars = pl.read_parquet(parqFile)

CPU times: user 4.95 s, sys: 8.29 s, total: 13.2 s
Wall time: 679 ms

How did the load times compare? Try re-running the cells, and also try re-running in different order. Does the comparison change as a result?

Examine the representations of these two dataframes. What do you notice that’s similar and different between them?

df

df_polars

Next, let’s recast and combine some of the columns; particularly, those related to date and time.

%%time

# Combine the date and time columns into a single series, and then drop the original two series.

dateTimeObj = df['Date'] + ' ' + df['Time']

# Create a new column, now a complete DateTime series
df['DateTime'] = pd.to_datetime(dateTimeObj,format="%Y-%m-%d %H:%M:%S.%f")

# Drop the original two series, as well as some others we are not interested in. This also reduces memory usage!

df = df.drop(columns=['Date','Time','Mul','Chi','Maj','Axis','Min'])

# Reorder the columns

cols = df.columns.tolist()
cols = cols[-1:] + cols[:-1]
df = df[cols]

# Convert the `CG` column into booleans, following https://statisticsglobe.com/convert-string-boolean-pandas-dataframe-column-python .

CGstr = df['CG'].str.strip() # Strip off any leading / trailing whitespace
df['CG'] = CGstr.map({'G': True, 'C': False})

# View the modified dataframe
df

date_time = df['DateTime']

Let’s specify a time range over which we wish to sample lightning events.

sd =  datetime(2025, 8, 15, 21)
ed =  datetime(2025, 8, 15, 22)

time_range= pd.date_range(sd, ed)

From the dataframe, select only those times that are within that time range. We’ll use Pandas query method, which uses database-like syntax to quickly perform subsetting. We’ll further only include cloud-to-ground strikes (not intra/inter-cloud flashes).

%%time
df = df.query('@date_time >= @sd & @date_time <= @ed')
CG = df['CG']
df_subset = df.query('@CG == True')

CPU times: user 574 ms, sys: 15.2 ms, total: 589 ms
Wall time: 532 ms

df_subset

Now, georeference this dataframe, using Geopandas.

%%time
lons, lats = df_subset.Lon, df_subset.Lat

gdf = gpd.GeoDataFrame(df_subset,geometry=gpd.points_from_xy(lons,lats))

CPU times: user 78.5 ms, sys: 2.38 ms, total: 80.9 ms
Wall time: 80.6 ms

In order for data to properly render on an interactive map, we must assign a coordinate reference system to it. Since the coordinates represent latitude and longitude in degrees, we’ll use WGS84 lat-lon projection, aka EPSG:4326.¶

gdf.set_crs(epsg=4326, inplace=True, allow_override=True)

# Drop the Lat and Lon columns since they are now handled by the geometry column
gdf = gdf.drop(columns=['Lat', 'Lon'])

gdf

Note the transformed dataframe. What do you think the geometry column represents?

Note! It takes quite a while to render a large number of points on this interactive map. If your subsetted dataframe is more than 100,000 rows, you will definitely want to further restrict your time range, or you could also consider subsetting your geographical extent.¶

Finally, let’s interactively visualize the lightning strikes! You can zoom in/out, pan around, and mouse over individual strikes!

gdf.explore()

This is a fairly memory-intensive notebook. Please make sure you close and shutdown when done running it!¶