{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas 4: Working with date- and time-based data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><img src=\"https://github.com/pandas-dev/pandas/raw/main/web/pandas/static/img/pandas.svg\" alt=\"pandas Logo\" style=\"width: 800px;\"/></center>\n",
    "\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview \n",
    "\n",
    "In this notebook, we'll work with Pandas `DataFrame` and `Series` objects  to do the following:\n",
    "1. Work with  Pandas' implementation of methods and attributes from Python's `datetime` library\n",
    "1. Relabel a Series from a column whose values are date and time strings\n",
    "1. Employ a `lambda` function to convert date/time strings to `datetime` objects\n",
    "1. Use Pandas' built-in `plot` function to generate a basic time series plot\n",
    "1. Improve the look of the time series plot by using Matplotlib\n",
    "\n",
    "We'll once again use NYS Mesonet data, but for the entire day of 2 September 2021."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "| Concepts | Importance | Notes |\n",
    "| --- | --- | --- |\n",
    "| Matplotlib  | Necessary | |\n",
    "| Datetime | Helpful | |\n",
    "| Pandas  | Necessary | Notebooks 1-3 |\n",
    "\n",
    "* **Time to learn**: 30 minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from datetime import datetime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Create a `DataFrame` objects from a csv file that contains NYSM observational data. Choose the station ID as the row index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataFile = '/spare11/atm533/data/nysm_data_20210902.csv'\n",
    "nysm_data = pd.read_csv(dataFile,index_col='station')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examine the `nysm_data` object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Work with Pandas' implementation of methods and attributes from Python's `datetime` library"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "    <b>Tip:</b> For a background on the use of <code>datetime</code> in Python, please check out <a href=\"https://www.atmos.albany.edu/facstaff/ktyle/atm350/core/week7/Datetime.html\">this notebook from ATM350<a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Relabel a Series from a column whose values are date and time strings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### First, let's load 5-minute accumulated precipitation for the Manhattan site."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select the column and row of interest\n",
    "prcpMANH = nysm_data['precip_incremental [mm]'].loc['MANH']\n",
    "prcpMANH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Next, let's inspect the column correpsonding to date and time from the DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "timeSer = nysm_data['time']\n",
    "timeSer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The *dtype: object* signifies that the values for *time* are being treated as a *string*. When working with time-based arrays, we want to treat them differently than a generic string type ... instead, let's treat them as `datetime` objects (derived from NumPy: see reference at end of notebook)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's look at the output after converting the `Series` from string to `datetime`. To do that, we'll use the `to_datetime` method in Pandas. We pass in the Series, which consists of an array of strings, and then specify how the strings are *formatted*. See the reference at the end of the notebook for a guide to formatting date/time strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.to_datetime(timeSer, format = \"%Y-%m-%d %H:%M:%S UTC\", utc=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the `dtype` of the Series has changed to `datetime64`, with precision to the nanosecond level and a timezone of UTC."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the use of a `lambda` function, we can accomplish the string-->datetime conversion directly in the call to `read_csv`. We'll also now set the row index to be time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First define the format and then define the function\n",
    "format = \"%Y-%m-%d %H:%M:%S UTC\"\n",
    "# This function will iterate over each string in a 1-d array \n",
    "# and use Pandas' implementation of strptime to convert the string into a datetime object.\n",
    "parseTime = lambda x: datetime.strptime(x, format)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remind ourselves of how Pandas' `read_csv` method works:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "#### Re-create the *nysm_data* `DataFrame`, with appropriate additional arguments to `read_csv` (including our `lambda` function, via the `date_parser` argument)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data = pd.read_csv(dataFile,index_col=1,parse_dates=['time'], date_parser=parseTime)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "nysm_data.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Now *time* is the `DataFrame`'s row index. Let's inspect this index; it's much like a generic Pandas `RangeIndex`, but specific for date/time purposes:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "timeIndx = nysm_data.index\n",
    "timeIndx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Note that the `timezone` is missing. The `read_csv` method does not provide a means to specify the timezone. We can take care of that though with the `tz_localize` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "timeIndx = timeIndx.tz_localize(tz='UTC')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "timeIndx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### <span style=\"color:red\"> If this were a `Series`, not an index, use this `Series`-specific method instead:\n",
    "`timeIndx= timeIndx.dt.tz_localize(tz='UTC')`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Since it's a `datetime` object now, we can apply all sorts of time/date operations to it. For example, let's convert to Eastern time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "timeIndx = timeIndx.tz_convert(tz='US/Eastern')\n",
    "timeIndx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### (Yes, it automatically accounts for Standard or Daylight time!)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Use Pandas' built-in `plot` function ... which leverages Matplotlib:\n",
    "#### Select all the rows for site MANH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "condition = nysm_data['station'] == 'MANH'\n",
    "MANH = nysm_data.loc[condition]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate a basic time series plot by passing the desired column to Pandas' `plot` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prcp = MANH['precip_incremental [mm]']\n",
    "prcp.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### That was a way to get a quick look at the data and verify it looks reasonable. Now, let's pretty it up by using Matplotlib functions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### We'll draw a line plot, passing in time and wind gust speed for the x- and y-axes, respectively. Follow the same procedure as we did in the Matplotlib notebooks from week 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.style.use(\"seaborn\")\n",
    "fig = plt.figure(figsize=(11,8.5))\n",
    "ax = fig.add_subplot(1,1,1) \n",
    "ax.set_xlabel ('Date and Time')\n",
    "ax.set_ylabel ('5-min accum. precip (mm)')\n",
    "ax.set_title (\"Manhattan, NY 5-minute accumulated precip associated with the remnants of Hurricane Ida\")\n",
    "ax.plot (timeIndx, prcp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### <span style=\"color: red\">Didn't work!!! Look at the error message above!</span>\n",
    "#### This is a *mismatch* between array sizes. The time index is based on the entire Dataframe, which has 12 x 24 x 126 rows, while the Manhattan precip array is only 12 x 24!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's set a condition where we match only those times that are in the same row as the Manhattan station id."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "condition = nysm_data['station'] == 'MANH'\n",
    "timeIndxMANH = timeIndx[condition]\n",
    "timeIndxMANH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.style.use(\"seaborn\")\n",
    "fig = plt.figure(figsize=(11,8.5))\n",
    "ax = fig.add_subplot(1,1,1) \n",
    "ax.set_xlabel ('Date and Time')\n",
    "ax.set_ylabel ('5-min accum. precip (mm)')\n",
    "ax.set_title (\"Manhattan, NY 5-minute accumulated precip associated with the remnants of Hurricane Ida\")\n",
    "ax.plot (timeIndxMANH, prcp);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### That's looking better! We still have work to do to improve the labeling of the x-axis tick marks, but we'll save that for another time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "    <b>Explore further:</b> Try making plots of other NYSM variables, from different NYSM sites.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Summary\n",
    "\n",
    "* Use a `lambda` function to convert Date/time strings into Python `datetime` objects\n",
    "* Pandas' `plot` method allows for a quick visualization of `DataFrame` and `Series` objects. \n",
    "* x- and y- arrays must be of the same size in order to be plotted.\n",
    "\n",
    "### What's Next?\n",
    "Coming up next week, we will conclude our exploration of Pandas.\n",
    "\n",
    "## Resources and References\n",
    "1. [`datetime`objects in NumPy arrays](https://numpy.org/doc/stable/reference/arrays.datetime.html)\n",
    "1. [Date/time string formatting guide](https://strftime.org/)\n",
    "1. [Use of a `lambda` function in `read_csv` (Corey Schafer YouTube channel)](https://www.youtube.com/watch?v=UFuo7EHI8zc&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&index=10&ab_channel=CoreySchafer)\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 August 2022 Environment",
   "language": "python",
   "name": "aug22"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}