# Pandas 4: Working with date- and time-based data

<center><img src="https://github.com/pandas-dev/pandas/raw/main/web/pandas/static/img/pandas.svg" alt="pandas Logo" style="width: 800px;"/></center>


---

## Overview 

In this notebook, we'll work with Pandas `DataFrame` and `Series` objects  to do the following:
1. Work with  Pandas' implementation of methods and attributes from Python's `datetime` library
1. Relabel a Series from a column whose values are date and time strings
1. Employ a `lambda` function to convert date/time strings to `datetime` objects
1. Use Pandas' built-in `plot` function to generate a basic time series plot
1. Improve the look of the time series plot by using Matplotlib

We'll once again use NYS Mesonet data, but for the entire day of 2 September 2021.

## Prerequisites

| Concepts | Importance | Notes |
| --- | --- | --- |
| Matplotlib  | Necessary | |
| Datetime | Helpful | |
| Pandas  | Necessary | Notebooks 1-3 |

* **Time to learn**: 30 minutes

___

## Imports

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

#### Create a `DataFrame` objects from a csv file that contains NYSM observational data. Choose the station ID as the row index.

In [None]:
dataFile = '/spare11/atm533/data/nysm_data_20210902.csv'
nysm_data = pd.read_csv(dataFile,index_col='station')

### Examine the `nysm_data` object.

In [None]:
nysm_data

#### Work with Pandas' implementation of methods and attributes from Python's `datetime` library

<div class="alert alert-info">
    <b>Tip:</b> For a background on the use of <code>datetime</code> in Python, please check out <a href="https://www.atmos.albany.edu/facstaff/ktyle/atm350/core/week7/Datetime.html">this notebook from ATM350<a></div>

#### Relabel a Series from a column whose values are date and time strings

### First, let's load 5-minute accumulated precipitation for the Manhattan site.

In [None]:
# Select the column and row of interest
prcpMANH = nysm_data['precip_incremental [mm]'].loc['MANH']
prcpMANH

#### Next, let's inspect the column correpsonding to date and time from the DataFrame.

In [None]:
timeSer = nysm_data['time']
timeSer

#### The *dtype: object* signifies that the values for *time* are being treated as a *string*. When working with time-based arrays, we want to treat them differently than a generic string type ... instead, let's treat them as `datetime` objects (derived from NumPy: see reference at end of notebook).

First, let's look at the output after converting the `Series` from string to `datetime`. To do that, we'll use the `to_datetime` method in Pandas. We pass in the Series, which consists of an array of strings, and then specify how the strings are *formatted*. See the reference at the end of the notebook for a guide to formatting date/time strings.

In [None]:
pd.to_datetime(timeSer, format = "%Y-%m-%d %H:%M:%S UTC", utc=True)

Notice that the `dtype` of the Series has changed to `datetime64`, with precision to the nanosecond level and a timezone of UTC.

With the use of a `lambda` function, we can accomplish the string-->datetime conversion directly in the call to `read_csv`. We'll also now set the row index to be time.

In [None]:
# First define the format and then define the function
format = "%Y-%m-%d %H:%M:%S UTC"
# This function will iterate over each string in a 1-d array 
# and use Pandas' implementation of strptime to convert the string into a datetime object.
parseTime = lambda x: datetime.strptime(x, format)

Remind ourselves of how Pandas' `read_csv` method works:

In [None]:
pd.read_csv?

___
#### Re-create the *nysm_data* `DataFrame`, with appropriate additional arguments to `read_csv` (including our `lambda` function, via the `date_parser` argument)

In [None]:
nysm_data = pd.read_csv(dataFile,index_col=1,parse_dates=['time'], date_parser=parseTime)

In [None]:
nysm_data.head(2)

#### Now *time* is the `DataFrame`'s row index. Let's inspect this index; it's much like a generic Pandas `RangeIndex`, but specific for date/time purposes:

In [None]:
timeIndx = nysm_data.index
timeIndx

#### Note that the `timezone` is missing. The `read_csv` method does not provide a means to specify the timezone. We can take care of that though with the `tz_localize` method.

In [None]:
timeIndx = timeIndx.tz_localize(tz='UTC')

In [None]:
timeIndx

#### <span style="color:red"> If this were a `Series`, not an index, use this `Series`-specific method instead:
`timeIndx= timeIndx.dt.tz_localize(tz='UTC')`

#### Since it's a `datetime` object now, we can apply all sorts of time/date operations to it. For example, let's convert to Eastern time.

In [None]:
timeIndx = timeIndx.tz_convert(tz='US/Eastern')
timeIndx

#### (Yes, it automatically accounts for Standard or Daylight time!)

### Use Pandas' built-in `plot` function ... which leverages Matplotlib:
#### Select all the rows for site MANH

In [None]:
condition = nysm_data['station'] == 'MANH'
MANH = nysm_data.loc[condition]

### Generate a basic time series plot by passing the desired column to Pandas' `plot` method:

In [None]:
prcp = MANH['precip_incremental [mm]']
prcp.plot()

#### That was a way to get a quick look at the data and verify it looks reasonable. Now, let's pretty it up by using Matplotlib functions.

#### We'll draw a line plot, passing in time and wind gust speed for the x- and y-axes, respectively. Follow the same procedure as we did in the Matplotlib notebooks from week 3.

In [None]:
plt.style.use("seaborn")
fig = plt.figure(figsize=(11,8.5))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel ('Date and Time')
ax.set_ylabel ('5-min accum. precip (mm)')
ax.set_title ("Manhattan, NY 5-minute accumulated precip associated with the remnants of Hurricane Ida")
ax.plot (timeIndx, prcp)

### <span style="color: red">Didn't work!!! Look at the error message above!</span>
#### This is a *mismatch* between array sizes. The time index is based on the entire Dataframe, which has 12 x 24 x 126 rows, while the Manhattan precip array is only 12 x 24!

### Let's set a condition where we match only those times that are in the same row as the Manhattan station id.

In [None]:
condition = nysm_data['station'] == 'MANH'
timeIndxMANH = timeIndx[condition]
timeIndxMANH

In [None]:
plt.style.use("seaborn")
fig = plt.figure(figsize=(11,8.5))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel ('Date and Time')
ax.set_ylabel ('5-min accum. precip (mm)')
ax.set_title ("Manhattan, NY 5-minute accumulated precip associated with the remnants of Hurricane Ida")
ax.plot (timeIndxMANH, prcp);

### That's looking better! We still have work to do to improve the labeling of the x-axis tick marks, but we'll save that for another time.

<div class="alert alert-warning">
    <b>Explore further:</b> Try making plots of other NYSM variables, from different NYSM sites.</div>

---
## Summary

* Use a `lambda` function to convert Date/time strings into Python `datetime` objects
* Pandas' `plot` method allows for a quick visualization of `DataFrame` and `Series` objects. 
* x- and y- arrays must be of the same size in order to be plotted.

### What's Next?
Coming up next week, we will conclude our exploration of Pandas.

## Resources and References
1. [`datetime`objects in NumPy arrays](https://numpy.org/doc/stable/reference/arrays.datetime.html)
1. [Date/time string formatting guide](https://strftime.org/)
1. [Use of a `lambda` function in `read_csv` (Corey Schafer YouTube channel)](https://www.youtube.com/watch?v=UFuo7EHI8zc&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&index=10&ab_channel=CoreySchafer)

