{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\"pandas
\n", "\n", "# Pandas 9: Multi-index DataFrames\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### In many instances, tabular data may best be represented with more than a single row index. For the case of NYSM hourly data files, which have 126 stations each with 13 sets of 5-minute obs, we definitely want to take advantage of using Multi-index Dataframes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "1. Open an hour's worth of NYSM data using a single row index dataframe\n", "1. Review selection and conditional selection methodology\n", "1. Create a dataframe with station ID and date/time as the two row indices\n", "1. Use selection and conditions on a multi-index dataframe\n", "1. Briefly define tuples as opposed to lists in Python\n", "1. Work with the time index as a Datetime object\n", "\n", "## Prerequisites\n", "\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "| Pandas notebooks 1-8 | Necessary | |\n", "\n", "* **Time to learn**: 15 minutes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from datetime import datetime " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Open an hour's worth of NYSM data using a single row index dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data_file = '/spare11/atm533/data/nysm_data_2021090202a.csv'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(nysm_data_file)\n", "# Look at the first few rows\n", "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remind ourselves that the default `DataFrame`'s index is a special type of Python object, called a *RangeIndex*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Review selection and conditional selection methodology" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we have multiple stations and multiple times. We can select rows/columns, set conditions, and make further selections based on those conditions as we did in the **03_Pandas_IndexSubsetsConditonals** notebook. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[df['station']=='VOOR'] # Voorheesville, NY" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a second dataframe that first contains just VOOR rows, then further subset by choosing one time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = df[df['station']=='VOOR']\n", "df2[df2['time'] == '2021-09-02 02:50:00 UTC'] # we haven't recast this as a Datetime object ... it's just a string" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, there is a better way! Since our DataFrame consists of one station after another, each associated with a number of discrete times, let's create a **multi-indexed** DataFrame, that has *station* as its outer row index and *time* as its inner." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create a dataframe with station ID and date/time as the two row indices" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.set_index(['station', 'time'], inplace = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Note: inplace means that the df object gets re-created with its new indexes. It's convenient, but if at any point we wish to re-run cells beyond this point, it is better to start from the beginning to ensure that the df object conforms to whatever code cell operates on it.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The date/time index is currently a string. Let's do this nifty trick to convert it to a `datetime` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.index = df.index.set_levels([df.index.levels[0], pd.to_datetime(df.index.levels[1])])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Examine first three row indices\n", "df.index[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Use selection and conditions on a multi-index dataframe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a column in the usual way." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc['VOOR']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `loc` to select not only the station id, but also a specific time: we pass in a `tuple` to the `loc` method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Briefly define **tuples** as opposed to **lists** in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `tuple` is a core Python class, similar to but distinct from a `list`\n", "1. Tuples are enclosed in parentheses (); lists are enclosed in brackets []\n", "1. Elements in a `tuple` are *immutable* (i.e. can't be changed), but `list` elements can be changed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = (4,7,9) # a tuple\n", "print (a)\n", "print(a[2])\n", "# this next line won't work; comment it out to see:\n", "# a[2] = 8" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b = [4, 7, 9]\n", "print (b)\n", "print (b[2])\n", "# this will work:\n", "b[2] = 8\n", "print(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tip: Some Pandas methods accept tuples as their arguments, although lists will usually work too. When in doubt, append a ? to the method call to see the documentation.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What can be confusing is that just because something is enclosed in brackets, that doesn't mean it's a Python list object. In terms of Pandas, the DataFrame's `loc` method typically expects a string that's enclosed in brackets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tip: Note: we can pass in the requested time as a string ... it will work!
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[('VOOR','2021-09-02 02:45:00 UTC')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As above, but also select a particular column:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[('VOOR','2021-09-02 02:45:00 UTC'),'temp_2m [degC]' ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Taking advantage of *multi-indexing* allows us to write selection criteria that is more intuitive than how we've previously defined and utilized conditonal statements in our selection criteria (as in our cell earlier in the notebook with nested data frame object names). It also executes quicker!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pass in multiple columns (either as a list or a tuple) ... get back a `DataFrame` whose columns are the ones we selected" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[('VOOR','2021-09-02 02:45:00 UTC'),('temp_2m [degC]','precip_incremental [mm]')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pass in multiple indexes and multiple columns also returns `DataFrame`s:\n", "1. Two outer and one inner index:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[(('VOOR','KIND'),'2021-09-02 02:45:00 UTC'),('temp_2m [degC]','precip_incremental [mm]')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Two outer and two inner:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[(('VOOR','KIND'),('2021-09-02 02:45:00 UTC','2021-09-02 02:55:00 UTC')),('temp_2m [degC]','precip_incremental [mm]')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. One outer and two inner:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[('VOOR',('2021-09-02 02:45:00 UTC','2021-09-02 02:55:00 UTC')),('temp_2m [degC]','precip_incremental [mm]')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A more efficient way to get all stations at a particular time is via Pandas' `xs` DataFrame method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Pass in one index value and what index it belongs to\n", "df.xs('2021-09-02 02:45:00 UTC', level='time')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Include just a *list* of columns (a *tuple* won't work here)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.xs('2021-09-02 02:45:00 UTC', level = 'time')[['temp_2m [degC]','temp_9m [degC]']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Work with the time index as a Datetime object:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas' handling of `datetime` objects is incredibly powerful (see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) . Below, we select 3 times using the `date_range` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "timeRange = pd.date_range('2021-09-02 02:25', periods=3, freq='5min')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.loc[('MANH',timeRange.values),('max_wind_speed_prop [m/s]', 'max_wind_speed_sonic [m/s]')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Summary\n", "* One can recast a Pandas `DataFrame` so it leverages multi-indexing\n", "* Selecting and specifying conditions on a multi-index `DataFrame` work just as in a single-index `DataFrame`\n", "* Pandas has a large set of utilities to work with time-series based data. `xs` is one such method.\n", "* In Python, **tuples** are akin to **lists**, but a tuple's elements are *immutable*.\n", "\n", "### What's Next?\n", "In the next notebook, we will merge DataFrames that have different row indexes.\n", "## Resources and References\n", "1. [MetPy Monday Episode 96](https://www.youtube.com/watch?v=yQ5IxnZouKo&list=PLQut5OXpV-0ir4IdllSt1iEZKTwFBa7kO&index=87&ab_channel=Unidata)\n", "1. https://www.geeksforgeeks.org/python-difference-between-list-and-tuple/\n", "1. https://stackoverflow.com/questions/8900166/whats-the-difference-between-lists-enclosed-by-square-brackets-and-parentheses\n", "1. [Time series in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 August 2022 Environment", "language": "python", "name": "aug22" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }