{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas Notebook 1, ATM350 Spring 2024\n", "***" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we read in a text file that has climatological data compiled at the National Weather Service in Albany NY for 2023, previously downloaded and reformatted from the [xmACIS2](https://xmacis.rcc-acis.org) climate data portal.\n", "\n", "We will use the Pandas library to read and analyze the data. We will also use the Matplotlib package to visualize it." ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Motivating Science Questions:\n", "1. How can we analyze and display *tabular climate data* for a site?\n", "2. What was the yearly trace of max/min temperatures for Albany, NY last year?\n", "3. What was the most common 10-degree maximum temperature range for Albany, NY last year?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "# import Pandas and Numpy, and use their conventional two-letter abbreviations when we\n", "# use methods from these packages. Also, import matplotlib's plotting package, using its \n", "# standard abbreviation.\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify the location of the file that contains the climo data. Use the linux ls command to verify it exists. \n", "#### Note that in a Jupyter notebook, we can simply use the ! directive to \"call\" a Linux command. \n", "#### Also notice how we refer to a Python variable name when passing it to a Linux command line in this way ... we enclose it in braces!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "file = '/spare11/atm350/common/data/climo_alb_2023.csv'\n", "! ls -l {file}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use pandas' `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).\n", "### Once this call succeeds, it returns a Pandas Dataframe object which we reference as `df`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(file, dtype='string')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## By simply typing the name of the dataframe object, we can get some of its contents to be \"pretty-printed\" to the notebook!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Our dataframe has 365 rows (corresponding to all the days in the year 2023) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### It will be useful to have a variable (more accurately, an object ) that holds the value of the number of rows, and another for the number of columns.\n", "#### Remember that Python is a language that uses zero-based indexing, so the first value is accessed as element 0, and the second as element 1!\n", "#### Look at the syntax we use below to print out the (integer) value of nRows ... it's another example of **string formating**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nRows = df.shape[0]\n", "print (\"Number of rows = %d\" % nRows )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's do the same for the # of columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nCols = df.shape[1]\n", "print (\"Number of columns = %d\" % nCols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### To access the values in a particular column, we reference it with its column name as a string. The next cell pulls in all values of the year-month-date column, and assigns it to an object of the same name. We could have named the object anything we wanted, not just **Date** ... but on the right side of the assignment statement, we have to use the exact name of the column. \n", "\n", "Print out what this object looks like." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Date = df['DATE']\n", "print (Date)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Each column of a Pandas dataframe is known as a series. It is basically an array of values, each of which has a corresponding row #. By default, row #'s accompanying a Series are numbered consecutively, starting with 0 (since Python's convention is to use zero-based indexing )." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let's print out the first 30 rows of the dates." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (Date[:30])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Similarly, let's print out the last, or 364th row (Why is it 364, not 365???)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(Date[364])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that using -1 as the last index doesn't work!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(Date[-1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, using a negative value as part of a *slice* does work:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(Date[-9:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EXERCISE: Now, let's create new Series objects; one for Max Temp (name it *maxT*), and the other for Min Temp (name it *minT*)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
TIP: After you have tried on your own, you can uncomment the first line of the cell below and re-run to load the solution.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load /spare11/atm350/common/feb22/01a.py\n", "maxT = df['MAX']\n", "minT = df['MIN']\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minT" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hotDays = maxT >= 90" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why did it fail? Remember, when we read in the file, we had Pandas assign the type of every column to string! We need to change the type of maxT to a numerical value. Let's use a 32-bit floating point #, as that will be more than enough precision for this type of measurement. We'll do the same for the minimum temp." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT = maxT.astype(\"float32\")\n", "minT = minT.astype(\"float32\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hotDays = maxT >= 90" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now, the test works. What does this data series look like? It actually is a table of booleans ... i.e., true/false values." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (hotDays)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### As the default output only includes the first and last 5 rows , let's `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (hotDays[180:195])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Now, let's get a count of the # of days meeting this temperature criterion. Note carefully that we first have to express our set of days exceeding the threshold as a Pandas series. Then, recall that to get a count of the # of rows, we take the first (0th) element of the array returned by a call to the `shape` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[maxT >= 90]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[maxT >= 90].shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's reverse the sense of the test, and get its count. The two counts should add up to 365!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[maxT < 90].shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We can combine a test of two different thresholds. Let's get a count of days where the max. temperature was in the 70s or 80s." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[(maxT< 90) & (maxT>=70)].shape[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's show all the climate data for all these \"pleasantly warm\" days!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pleasant = df[(maxT< 90) & (maxT>=70)]\n", "pleasant" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option ('display.max_rows', None)\n", "pleasant" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Now let's visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib's plotting package." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minT.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The data plotted fine, but the look could be better. First, let's import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib's graphs look better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Info on seaborn: https://seaborn.pydata.org/index.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "sns.set()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT.plot()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minT.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Next, let's plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We'll also add some helpful labels and expand the size of the figure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(15,10))\n", "ax.plot (Date, maxT, color='red')\n", "ax.plot (Date, minT, color='blue')\n", "ax.set_title (\"ALB Year 2023\")\n", "ax.set_xlabel('Day of Year')\n", "ax.set_ylabel('Temperature (°F')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We will deal with this by using one of Pandas' methods that take strings and convert them to a special type of data ... not strings nor numbers, but datetime objects. Note carefully how we do this here ... it is not terribly intuitive, but we'll explain it more in an upcoming lecture/notebook on `datetime`. You will see though that the output column now looks a bit more *date-like*, with a four-digit year followed by two-digit month and date." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Date = pd.to_datetime(Date,format=\"%Y-%m-%d\")\n", "Date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib will recognize this array as being date/time-related, and when we pass it in as the x-axis, the graphic appears faster, and we also have a more meaningful x-axis label." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(15,10))\n", "ax.plot (Date, maxT, color='red')\n", "ax.plot (Date, minT, color='blue')\n", "ax.set_title (\"ALB Year 2023\")\n", "ax.set_xlabel('Day of Year')\n", "ax.set_ylabel('Temperature (°F)')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We'll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from matplotlib.dates import DateFormatter, AutoDateLocator,HourLocator,DayLocator,MonthLocator" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(15,10))\n", "ax.plot (Date, maxT, color='red',label = \"Max T\")\n", "ax.plot (Date, minT, color='blue', label = \"Min T\")\n", "ax.set_title (\"ALB Year 2023\")\n", "ax.set_xlabel('Date')\n", "ax.set_ylabel('Temperature (°F)' )\n", "ax.xaxis.set_major_locator(MonthLocator(interval=1))\n", "dateFmt = DateFormatter('%b %d')\n", "ax.xaxis.set_major_formatter(dateFmt)\n", "ax.legend (loc=\"best\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's save our beautiful graphic to disk." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig.savefig ('albTemps2023.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Now, let's answer the question, \"what was the most common range of maximum temperatures last year in Albany?\" via a histogram. We use `matplotlib`'s `hist` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a figure and size it.\n", "fig, ax = plt.subplots(figsize=(15,10))\n", "# Create a histogram of our data series and divide it in to 10 bins.\n", "ax.hist(maxT, bins=10, color='k', alpha=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ok, but the 10 bins were autoselected. Let's customize our call to the `hist` method by specifying the bounds of each of our bins.\n", "### How can we learn more about how to customize this call? Append a `?` to the name of the method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax.hist?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram's bins." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(15,10))\n", "ax.hist(maxT, bins=(0,10,20,30,40,50,60,70,80,90,100), color='k', alpha=0.3)\n", "ax.xaxis.set_major_locator(plt.MultipleLocator(10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save this histogram to disk." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig.savefig(\"maxT_hist.png\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `describe` method on the maximum temperature series to reveal some simple statistical properties." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "maxT.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Project Pythia: Pandas](https://foundations.projectpythia.org/core/pandas.html)\n", "2. [The Carpentries: Pandas](https://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/index.html)\n", "3. [Matplotlib: Setting x/y-axis tick label properties](https://jakevdp.github.io/PythonDataScienceHandbook/04.10-customizing-ticks.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 Jan. 2024 Environment", "language": "python", "name": "jan24" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 4 }