{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas 4: Working with date- and time-based data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\"pandas
\n", "\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview \n", "\n", "In this notebook, we'll work with Pandas `DataFrame` and `Series` objects to do the following:\n", "1. Work with Pandas' implementation of methods and attributes from Python's `datetime` library\n", "1. Relabel a Series from a column whose values are date and time strings\n", "1. Employ a `lambda` function to convert date/time strings to `datetime` objects\n", "1. Use Pandas' built-in `plot` function to generate a basic time series plot\n", "1. Improve the look of the time series plot by using Matplotlib\n", "\n", "We'll once again use NYS Mesonet data, but for the entire day of 2 September 2021." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "| Matplotlib | Necessary | |\n", "| Datetime | Helpful | |\n", "| Pandas | Necessary | Notebooks 1-3 |\n", "\n", "* **Time to learn**: 30 minutes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from datetime import datetime" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create a `DataFrame` objects from a csv file that contains NYSM observational data. Choose the station ID as the row index." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataFile = '/spare11/atm533/data/nysm_data_20210902.csv'\n", "nysm_data = pd.read_csv(dataFile,index_col='station')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examine the `nysm_data` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Work with Pandas' implementation of methods and attributes from Python's `datetime` library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tip: For a background on the use of datetime in Python, please check out this notebook from ATM350
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Relabel a Series from a column whose values are date and time strings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### First, let's load 5-minute accumulated precipitation for the Manhattan site." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Select the column and row of interest\n", "prcpMANH = nysm_data['precip_incremental [mm]'].loc['MANH']\n", "prcpMANH" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Next, let's inspect the column correpsonding to date and time from the DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timeSer = nysm_data['time']\n", "timeSer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The *dtype: object* signifies that the values for *time* are being treated as a *string*. When working with time-based arrays, we want to treat them differently than a generic string type ... instead, let's treat them as `datetime` objects (derived from NumPy: see reference at end of notebook)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's look at the output after converting the `Series` from string to `datetime`. To do that, we'll use the `to_datetime` method in Pandas. We pass in the Series, which consists of an array of strings, and then specify how the strings are *formatted*. See the reference at the end of the notebook for a guide to formatting date/time strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.to_datetime(timeSer, format = \"%Y-%m-%d %H:%M:%S UTC\", utc=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the `dtype` of the Series has changed to `datetime64`, with precision to the nanosecond level and a timezone of UTC." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the use of a `lambda` function, we can accomplish the string-->datetime conversion directly in the call to `read_csv`. We'll also now set the row index to be time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First define the format and then define the function\n", "format = \"%Y-%m-%d %H:%M:%S UTC\"\n", "# This function will iterate over each string in a 1-d array \n", "# and use Pandas' implementation of strptime to convert the string into a datetime object.\n", "parseTime = lambda x: datetime.strptime(x, format)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remind ourselves of how Pandas' `read_csv` method works:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.read_csv?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "#### Re-create the *nysm_data* `DataFrame`, with appropriate additional arguments to `read_csv` (including our `lambda` function, via the `date_parser` argument)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data = pd.read_csv(dataFile,index_col=1,parse_dates=['time'], date_parser=parseTime)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "nysm_data.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Now *time* is the `DataFrame`'s row index. Let's inspect this index; it's much like a generic Pandas `RangeIndex`, but specific for date/time purposes:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timeIndx = nysm_data.index\n", "timeIndx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Note that the `timezone` is missing. The `read_csv` method does not provide a means to specify the timezone. We can take care of that though with the `tz_localize` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timeIndx = timeIndx.tz_localize(tz='UTC')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timeIndx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### If this were a `Series`, not an index, use this `Series`-specific method instead:\n", "`timeIndx= timeIndx.dt.tz_localize(tz='UTC')`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Since it's a `datetime` object now, we can apply all sorts of time/date operations to it. For example, let's convert to Eastern time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timeIndx = timeIndx.tz_convert(tz='US/Eastern')\n", "timeIndx" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### (Yes, it automatically accounts for Standard or Daylight time!)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Pandas' built-in `plot` function ... which leverages Matplotlib:\n", "#### Select all the rows for site MANH" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "condition = nysm_data['station'] == 'MANH'\n", "MANH = nysm_data.loc[condition]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate a basic time series plot by passing the desired column to Pandas' `plot` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "prcp = MANH['precip_incremental [mm]']\n", "prcp.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### That was a way to get a quick look at the data and verify it looks reasonable. Now, let's pretty it up by using Matplotlib functions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### We'll draw a line plot, passing in time and wind gust speed for the x- and y-axes, respectively. Follow the same procedure as we did in the Matplotlib notebooks from week 3." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.style.use(\"seaborn\")\n", "fig = plt.figure(figsize=(11,8.5))\n", "ax = fig.add_subplot(1,1,1) \n", "ax.set_xlabel ('Date and Time')\n", "ax.set_ylabel ('5-min accum. precip (mm)')\n", "ax.set_title (\"Manhattan, NY 5-minute accumulated precip associated with the remnants of Hurricane Ida\")\n", "ax.plot (timeIndx, prcp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Didn't work!!! Look at the error message above!\n", "#### This is a *mismatch* between array sizes. The time index is based on the entire Dataframe, which has 12 x 24 x 126 rows, while the Manhattan precip array is only 12 x 24!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's set a condition where we match only those times that are in the same row as the Manhattan station id." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "condition = nysm_data['station'] == 'MANH'\n", "timeIndxMANH = timeIndx[condition]\n", "timeIndxMANH" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.style.use(\"seaborn\")\n", "fig = plt.figure(figsize=(11,8.5))\n", "ax = fig.add_subplot(1,1,1) \n", "ax.set_xlabel ('Date and Time')\n", "ax.set_ylabel ('5-min accum. precip (mm)')\n", "ax.set_title (\"Manhattan, NY 5-minute accumulated precip associated with the remnants of Hurricane Ida\")\n", "ax.plot (timeIndxMANH, prcp);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### That's looking better! We still have work to do to improve the labeling of the x-axis tick marks, but we'll save that for another time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Explore further: Try making plots of other NYSM variables, from different NYSM sites.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Summary\n", "\n", "* Use a `lambda` function to convert Date/time strings into Python `datetime` objects\n", "* Pandas' `plot` method allows for a quick visualization of `DataFrame` and `Series` objects. \n", "* x- and y- arrays must be of the same size in order to be plotted.\n", "\n", "### What's Next?\n", "Coming up next week, we will conclude our exploration of Pandas.\n", "\n", "## Resources and References\n", "1. [`datetime`objects in NumPy arrays](https://numpy.org/doc/stable/reference/arrays.datetime.html)\n", "1. [Date/time string formatting guide](https://strftime.org/)\n", "1. [Use of a `lambda` function in `read_csv` (Corey Schafer YouTube channel)](https://www.youtube.com/watch?v=UFuo7EHI8zc&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&index=10&ab_channel=CoreySchafer)\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 August 2022 Environment", "language": "python", "name": "aug22" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }