{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas Notebook 1, ATM350 Spring 2024\n",
    "***"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we read in a text file that has climatological data compiled at the National Weather Service in Albany NY for 2023, previously downloaded and reformatted from the [xmACIS2](https://xmacis.rcc-acis.org) climate data portal.\n",
    "\n",
    "We will use the <a href = \"https://pandas.pydata.org/\">Pandas</a> library to read and analyze the data. We will also use the <a href=\"https://matplotlib.org/\">Matplotlib</a> package to visualize it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Motivating Science Questions:\n",
    "1. How can we analyze and display *tabular climate data* for a site?\n",
    "2. What was the yearly trace of max/min temperatures for Albany, NY last year?\n",
    "3. What was the most common 10-degree maximum temperature range for Albany, NY last year?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# import Pandas and Numpy, and use their conventional two-letter abbreviations when we\n",
    "# use methods from these packages. Also, import matplotlib's plotting package, using its \n",
    "# standard abbreviation.\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Specify the location of the file that contains the climo data. Use the linux <b>ls</b> command to verify it exists. \n",
    "#### Note that in a Jupyter notebook, we can simply use the <i>!</i> directive to \"call\" a Linux command. \n",
    "#### Also notice how we refer to a Python variable name when passing it to a Linux command line in this way ... we enclose it in braces!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "file = '/spare11/atm350/common/data/climo_alb_2023.csv'\n",
    "! ls -l {file}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use pandas' `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).\n",
    "### Once this call succeeds, it returns a <i>Pandas Dataframe</i> object which we reference as `df`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(file, dtype='string')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## By simply typing the name of the dataframe object, we can get some of its contents to be \"pretty-printed\" to the notebook!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Our dataframe has 365 rows (corresponding to all the days in the year 2023) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### It will be useful to have a variable (more accurately, an <i>object</i> ) that holds the value of the number of rows, and another for the number of columns.\n",
    "#### Remember that Python is a language that uses <i>zero-based</i> indexing, so the first value is accessed as element 0, and the second as element 1!\n",
    "#### Look at the syntax we use below to print out the (integer) value of nRows ... it's another example of **string formating**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nRows = df.shape[0]\n",
    "print (\"Number of rows = %d\" % nRows )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's do the same for the # of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nCols = df.shape[1]\n",
    "print (\"Number of columns = %d\" % nCols)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### To access the values in a particular column, we reference it with its column name as a string. The next cell pulls in all values of the year-month-date column, and assigns it to an object of the same name. We could have named the object anything we wanted, not just **Date** ... but on the right side of the assignment statement, we have to use the exact name of the column. \n",
    "\n",
    "Print out what this object looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Date = df['DATE']\n",
    "print (Date)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Each column of a Pandas dataframe is known as a <i>series</i>. It is basically an array of values, each of which has a corresponding row #. By default, row #'s accompanying a Series are numbered consecutively, starting with 0 (since Python's convention is to use <i>zero-based indexing </i>)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let's print out the first 30 rows of the dates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print (Date[:30])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Similarly, let's print out the last, or 364th row (Why is it 364, not 365???)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(Date[364])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that using -1 as the last index doesn't work!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(Date[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, using a negative value as part of a *slice* does work:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(Date[-9:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### EXERCISE: Now, let's create new Series objects; one for Max Temp (name it *maxT*), and the other for Min Temp (name it *minT*)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-success\"> <b>TIP:</b> After you have tried on your own, you can uncomment the first line of the cell below and re-run to <i>load</i> the solution.</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load /spare11/atm350/common/feb22/01a.py\n",
    "maxT = df['MAX']\n",
    "minT = df['MIN']\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "minT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hotDays = maxT >= 90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why did it fail? Remember, when we read in the file, we had Pandas assign the type of every column to <i>string</i>! We need to change the type of maxT to a numerical value. Let's use a 32-bit floating point #, as that will be more than enough precision for this type of measurement. We'll do the same for the minimum temp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT = maxT.astype(\"float32\")\n",
    "minT = minT.astype(\"float32\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hotDays = maxT >= 90"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now, the test works. What does this data series look like? It actually is a table of <i>booleans</i> ... i.e., true/false values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print (hotDays)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### As the default output only includes the first and last 5 rows , let's `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print (hotDays[180:195])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Now, let's get a count of the # of days meeting this temperature criterion. Note carefully that we first have to express our set of days exceeding the threshold as a Pandas series. Then, recall that to get a count of the # of rows, we take the first (0th) element of the array returned by a call to the `shape` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[maxT >= 90]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[maxT >= 90].shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's reverse the sense of the test, and get its count. The two counts should add up to 365!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[maxT < 90].shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We can combine a test of two different thresholds. Let's get a count of days where the max. temperature was in the 70s or 80s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[(maxT< 90) & (maxT>=70)].shape[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's show all the climate data for all these \"pleasantly warm\" days!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pleasant = df[(maxT< 90) & (maxT>=70)]\n",
    "pleasant"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option ('display.max_rows', None)\n",
    "pleasant"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Now let's visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib's plotting package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "minT.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### The data plotted fine, but the look could be better. First, let's import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib's graphs look better."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Info on seaborn: https://seaborn.pydata.org/index.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "sns.set()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT.plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "minT.plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Next, let's plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We'll also add some helpful labels and expand the size of the figure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(15,10))\n",
    "ax.plot (Date, maxT, color='red')\n",
    "ax.plot (Date, minT, color='blue')\n",
    "ax.set_title (\"ALB Year 2023\")\n",
    "ax.set_xlabel('Day of Year')\n",
    "ax.set_ylabel('Temperature (°F')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We will deal with this by using one of Pandas' methods that take strings and convert them to a special type of data ... not strings nor numbers, but <i>datetime</i> objects. Note carefully how we do this here ... it is not terribly intuitive, but we'll explain it more in an upcoming lecture/notebook on `datetime`. You will see though that the output column now looks a bit more *date-like*, with a four-digit year followed by two-digit month and date."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Date = pd.to_datetime(Date,format=\"%Y-%m-%d\")\n",
    "Date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Matplotlib will recognize this array as being date/time-related, and when we pass it in as the x-axis, the graphic appears faster, and we also have a more meaningful x-axis label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(15,10))\n",
    "ax.plot (Date, maxT, color='red')\n",
    "ax.plot (Date, minT, color='blue')\n",
    "ax.set_title (\"ALB Year 2023\")\n",
    "ax.set_xlabel('Day of Year')\n",
    "ax.set_ylabel('Temperature (°F)')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### We'll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from matplotlib.dates import DateFormatter, AutoDateLocator,HourLocator,DayLocator,MonthLocator"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(15,10))\n",
    "ax.plot (Date, maxT, color='red',label = \"Max T\")\n",
    "ax.plot (Date, minT, color='blue', label = \"Min T\")\n",
    "ax.set_title (\"ALB Year 2023\")\n",
    "ax.set_xlabel('Date')\n",
    "ax.set_ylabel('Temperature (°F)' )\n",
    "ax.xaxis.set_major_locator(MonthLocator(interval=1))\n",
    "dateFmt = DateFormatter('%b %d')\n",
    "ax.xaxis.set_major_formatter(dateFmt)\n",
    "ax.legend (loc=\"best\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's save our beautiful graphic to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig.savefig ('albTemps2023.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Now, let's answer the question, \"what was the most common range of maximum temperatures last year in Albany?\" via a histogram. We use `matplotlib`'s `hist` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a figure and size it.\n",
    "fig, ax = plt.subplots(figsize=(15,10))\n",
    "# Create a histogram of our data series and divide it in to 10 bins.\n",
    "ax.hist(maxT, bins=10, color='k', alpha=0.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ok, but the 10 bins were autoselected. Let's customize our call to the `hist` method by specifying the bounds of each of our bins.\n",
    "### How can we learn more about how to customize this call? Append a `?` to the name of the method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax.hist?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram's bins."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(15,10))\n",
    "ax.hist(maxT, bins=(0,10,20,30,40,50,60,70,80,90,100), color='k', alpha=0.3)\n",
    "ax.xaxis.set_major_locator(plt.MultipleLocator(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save this histogram to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig.savefig(\"maxT_hist.png\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the `describe` method on the maximum temperature series to reveal some simple statistical properties."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "maxT.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. [Project Pythia: Pandas](https://foundations.projectpythia.org/core/pandas.html)\n",
    "2. [The Carpentries: Pandas](https://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/index.html)\n",
    "3. [Matplotlib: Setting x/y-axis tick label properties](https://jakevdp.github.io/PythonDataScienceHandbook/04.10-customizing-ticks.html)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 Jan. 2024 Environment",
   "language": "python",
   "name": "jan24"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}