{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas 4: Working with date- and time-based data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview \n",
"\n",
"In this notebook, we'll work with Pandas `DataFrame` and `Series` objects to do the following:\n",
"1. Work with Pandas' implementation of methods and attributes from Python's `datetime` library\n",
"1. Relabel a Series from a column whose values are date and time strings\n",
"1. Employ a `lambda` function to convert date/time strings to `datetime` objects\n",
"1. Use Pandas' built-in `plot` function to generate a basic time series plot\n",
"1. Improve the look of the time series plot by using Matplotlib\n",
"\n",
"We'll once again use NYS Mesonet data, but for the entire day of 2 September 2021."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prerequisites\n",
"\n",
"| Concepts | Importance | Notes |\n",
"| --- | --- | --- |\n",
"| Matplotlib | Necessary | |\n",
"| Datetime | Helpful | |\n",
"| Pandas | Necessary | Notebooks 1-3 |\n",
"\n",
"* **Time to learn**: 30 minutes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"from datetime import datetime"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Create a `DataFrame` objects from a csv file that contains NYSM observational data. Choose the station ID as the row index."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dataFile = '/spare11/atm533/data/nysm_data_20210902.csv'\n",
"nysm_data = pd.read_csv(dataFile,index_col='station')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Examine the `nysm_data` object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Work with Pandas' implementation of methods and attributes from Python's `datetime` library"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" Tip: For a background on the use of datetime in Python, please check out this notebook from ATM350
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Relabel a Series from a column whose values are date and time strings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### First, let's load 5-minute accumulated precipitation for the Manhattan site."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Select the column and row of interest\n",
"prcpMANH = nysm_data['precip_incremental [mm]'].loc['MANH']\n",
"prcpMANH"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Next, let's inspect the column correpsonding to date and time from the DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"timeSer = nysm_data['time']\n",
"timeSer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The *dtype: object* signifies that the values for *time* are being treated as a *string*. When working with time-based arrays, we want to treat them differently than a generic string type ... instead, let's treat them as `datetime` objects (derived from NumPy: see reference at end of notebook)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's look at the output after converting the `Series` from string to `datetime`. To do that, we'll use the `to_datetime` method in Pandas. We pass in the Series, which consists of an array of strings, and then specify how the strings are *formatted*. See the reference at the end of the notebook for a guide to formatting date/time strings."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.to_datetime(timeSer, format = \"%Y-%m-%d %H:%M:%S UTC\", utc=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the `dtype` of the Series has changed to `datetime64`, with precision to the nanosecond level and a timezone of UTC."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the use of a `lambda` function, we can accomplish the string-->datetime conversion directly in the call to `read_csv`. We'll also now set the row index to be time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First define the format and then define the function\n",
"format = \"%Y-%m-%d %H:%M:%S UTC\"\n",
"# This function will iterate over each string in a 1-d array \n",
"# and use Pandas' implementation of strptime to convert the string into a datetime object.\n",
"parseTime = lambda x: datetime.strptime(x, format)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remind ourselves of how Pandas' `read_csv` method works:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.read_csv?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___\n",
"#### Re-create the *nysm_data* `DataFrame`, with appropriate additional arguments to `read_csv` (including our `lambda` function, via the `date_parser` argument)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data = pd.read_csv(dataFile,index_col=1,parse_dates=['time'], date_parser=parseTime)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"nysm_data.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Now *time* is the `DataFrame`'s row index. Let's inspect this index; it's much like a generic Pandas `RangeIndex`, but specific for date/time purposes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"timeIndx = nysm_data.index\n",
"timeIndx"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Note that the `timezone` is missing. The `read_csv` method does not provide a means to specify the timezone. We can take care of that though with the `tz_localize` method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"timeIndx = timeIndx.tz_localize(tz='UTC')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"timeIndx"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### If this were a `Series`, not an index, use this `Series`-specific method instead:\n",
"`timeIndx= timeIndx.dt.tz_localize(tz='UTC')`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Since it's a `datetime` object now, we can apply all sorts of time/date operations to it. For example, let's convert to Eastern time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"timeIndx = timeIndx.tz_convert(tz='US/Eastern')\n",
"timeIndx"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### (Yes, it automatically accounts for Standard or Daylight time!)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Use Pandas' built-in `plot` function ... which leverages Matplotlib:\n",
"#### Select all the rows for site MANH"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"condition = nysm_data['station'] == 'MANH'\n",
"MANH = nysm_data.loc[condition]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate a basic time series plot by passing the desired column to Pandas' `plot` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prcp = MANH['precip_incremental [mm]']\n",
"prcp.plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### That was a way to get a quick look at the data and verify it looks reasonable. Now, let's pretty it up by using Matplotlib functions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### We'll draw a line plot, passing in time and wind gust speed for the x- and y-axes, respectively. Follow the same procedure as we did in the Matplotlib notebooks from week 3."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.style.use(\"seaborn\")\n",
"fig = plt.figure(figsize=(11,8.5))\n",
"ax = fig.add_subplot(1,1,1) \n",
"ax.set_xlabel ('Date and Time')\n",
"ax.set_ylabel ('5-min accum. precip (mm)')\n",
"ax.set_title (\"Manhattan, NY 5-minute accumulated precip associated with the remnants of Hurricane Ida\")\n",
"ax.plot (timeIndx, prcp)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Didn't work!!! Look at the error message above!\n",
"#### This is a *mismatch* between array sizes. The time index is based on the entire Dataframe, which has 12 x 24 x 126 rows, while the Manhattan precip array is only 12 x 24!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Let's set a condition where we match only those times that are in the same row as the Manhattan station id."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"condition = nysm_data['station'] == 'MANH'\n",
"timeIndxMANH = timeIndx[condition]\n",
"timeIndxMANH"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.style.use(\"seaborn\")\n",
"fig = plt.figure(figsize=(11,8.5))\n",
"ax = fig.add_subplot(1,1,1) \n",
"ax.set_xlabel ('Date and Time')\n",
"ax.set_ylabel ('5-min accum. precip (mm)')\n",
"ax.set_title (\"Manhattan, NY 5-minute accumulated precip associated with the remnants of Hurricane Ida\")\n",
"ax.plot (timeIndxMANH, prcp);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### That's looking better! We still have work to do to improve the labeling of the x-axis tick marks, but we'll save that for another time."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
" Explore further: Try making plots of other NYSM variables, from different NYSM sites.
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Summary\n",
"\n",
"* Use a `lambda` function to convert Date/time strings into Python `datetime` objects\n",
"* Pandas' `plot` method allows for a quick visualization of `DataFrame` and `Series` objects. \n",
"* x- and y- arrays must be of the same size in order to be plotted.\n",
"\n",
"### What's Next?\n",
"Coming up next week, we will conclude our exploration of Pandas.\n",
"\n",
"## Resources and References\n",
"1. [`datetime`objects in NumPy arrays](https://numpy.org/doc/stable/reference/arrays.datetime.html)\n",
"1. [Date/time string formatting guide](https://strftime.org/)\n",
"1. [Use of a `lambda` function in `read_csv` (Corey Schafer YouTube channel)](https://www.youtube.com/watch?v=UFuo7EHI8zc&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&index=10&ab_channel=CoreySchafer)\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 August 2022 Environment",
"language": "python",
"name": "aug22"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}