# Pandas Notebook 1, ATM350 Spring 2025
***

Here, we read in a text file that has climatological data compiled at the National Weather Service in Albany NY for 2024, previously downloaded and reformatted from the [xmACIS2](https://xmacis.rcc-acis.org) climate data portal.

We will use the <a href = "https://pandas.pydata.org/">Pandas</a> library to read and analyze the data. We will also use the <a href="https://matplotlib.org/">Matplotlib</a> package to visualize it.

## Motivating Science Questions:
1. How can we analyze and display *tabular climate data* for a site?
2. What was the yearly trace of max/min temperatures for Albany, NY last year?
3. What was the most common 10-degree maximum temperature range for Albany, NY last year?

In [None]:
# import Pandas and Numpy, and use their conventional two-letter abbreviations when we
# use methods from these packages. Also, import matplotlib's plotting package, using its 
# standard abbreviation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Specify the location of the file that contains the climo data. Use the linux <b>ls</b> command to verify it exists. 
#### Note that in a Jupyter notebook, we can simply use the <i>!</i> directive to "call" a Linux command. 
#### Also notice how we refer to a Python variable name when passing it to a Linux command line in this way ... we enclose it in braces!

In [None]:
year = 2024
file = f'/spare11/atm350/common/data/climo_alb_{year}.csv'
! ls -l {file}

## Use pandas' `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).
### Once this call succeeds, it returns a <i>Pandas Dataframe</i> object which we reference as `df`

In [None]:
df = pd.read_csv(file, dtype='string')

## By simply typing the name of the dataframe object, we can get some of its contents to be "pretty-printed" to the notebook!

In [None]:
df

### Our dataframe has 365 or 366 rows (corresponding to all the days in the year) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.

In [None]:
df.shape

### It will be useful to have a variable (more accurately, an <i>object</i> ) that holds the value of the number of rows, and another for the number of columns.
#### Remember that Python is a language that uses <i>zero-based</i> indexing, so the first value is accessed as element 0, and the second as element 1!
#### Look at the syntax we use below to print out the (integer) value of nRows ... it's another example of **string formating**.

In [None]:
nRows = df.shape[0]
print (f"Number of rows = {nRows}")

### Let's do the same for the # of columns.

In [None]:
nCols = df.shape[1]
print (f"Number of columns = {nCols}")

### To access the values in a particular column, we reference it with its column name as a string. The next cell pulls in all values of the year-month-date column, and assigns it to an object of the same name. We could have named the object anything we wanted, not just **Date** ... but on the right side of the assignment statement, we have to use the exact name of the column. 

Print out what this object looks like.

In [None]:
Date = df['DATE']
print (Date)

### Each column of a Pandas dataframe is known as a <i>series</i>. It is basically an array of values, each of which has a corresponding row #. By default, row #'s accompanying a Series are numbered consecutively, starting with 0 (since Python's convention is to use <i>zero-based indexing </i>).

### We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let's print out the first 30 rows of the dates.

In [None]:
print (Date[:30])

### Similarly, let's print out the last, or 365th row (Why is it 365, not 366???)

In [None]:
print(Date[365])

Note that using -1 as the last index doesn't work!

In [None]:
print(Date[-1])

However, using a negative value as part of a *slice* does work:

In [None]:
print(Date[-9:])

### EXERCISE: Now, let's create new Series objects; one for Max Temp (name it *maxT*), and the other for Min Temp (name it *minT*).

<div class="alert alert-success"> <b>TIP:</b> After you have tried on your own, you can uncomment the first line of the cell below and re-run to <i>load</i> the solution.</div>

In [None]:
# %load /spare11/atm350/common/mar04/01a.py
maxT = df['MAX']
minT = df['MIN']


In [None]:
maxT

In [None]:
minT

## Let's now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!

In [None]:
hotDays = maxT >= 90

### Why did it fail? Remember, when we read in the file, we had Pandas assign the type of every column to <i>string</i>! We need to change the type of maxT to a numerical value. Let's use a 32-bit floating point #, as that will be more than enough precision for this type of measurement. We'll do the same for the minimum temp.

In [None]:
maxT = maxT.astype("float32")
minT = minT.astype("float32")

In [None]:
maxT

In [None]:
hotDays = maxT >= 90

### Now, the test works. What does this data series look like? It actually is a table of <i>booleans</i> ... i.e., true/false values.

In [None]:
print (hotDays)

### As the default output only includes the first and last 5 rows , let's `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!

In [None]:
print (hotDays[180:195])

## Now, let's get a count of the # of days meeting this temperature criterion. Note carefully that we first have to express our set of days exceeding the threshold as a Pandas series. Then, recall that to get a count of the # of rows, we take the first (0th) element of the array returned by a call to the `shape` method.

In [None]:
df[maxT >= 90]

In [None]:
df[maxT >= 90].shape[0]

### Let's reverse the sense of the test, and get its count. The two counts should add up to the total number of days in the year!

In [None]:
df[maxT < 90].shape[0]

### We can combine a test of two different thresholds. Let's get a count of days where the max. temperature was in the 70s or 80s.

In [None]:
df[(maxT< 90) & (maxT>=70)].shape[0]

## Let's show all the climate data for all these "pleasantly warm" days!

In [None]:
pleasant = df[(maxT< 90) & (maxT>=70)]
pleasant

## Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days.

In [None]:
pd.set_option ('display.max_rows', None)
pleasant

### Now let's visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib's plotting package.

In [None]:
maxT.plot()

In [None]:
minT.plot()

### The data plotted fine, but the look could be better. First, let's import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib's graphs look better.

Info on seaborn: https://seaborn.pydata.org/index.html

In [None]:
import seaborn as sns
sns.set()

In [None]:
maxT.plot()

In [None]:
minT.plot()

### Next, let's plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We'll also add some helpful labels and expand the size of the figure.

### You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed! 

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red')
ax.plot (Date, minT, color='blue')
ax.set_title (f"ALB Year {year}")
ax.set_xlabel('Day of Year')
ax.set_ylabel('Temperature (°F')


### We will deal with this by using one of Pandas' methods that take strings and convert them to a special type of data ... not strings nor numbers, but <i>datetime</i> objects. Note carefully how we do this here ... it is not terribly intuitive, but we'll explain it more in an upcoming lecture/notebook on `datetime`. You will see though that the output column now looks a bit more *date-like*, with a four-digit year followed by two-digit month and date.

In [None]:
Date = pd.to_datetime(Date,format="%Y-%m-%d")
Date

### Matplotlib will recognize this array as being date/time-related, and when we pass it in as the x-axis, the graphic appears faster, and we also have a more meaningful x-axis label.

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red')
ax.plot (Date, minT, color='blue')
ax.set_title (f"ALB Year {year}")
ax.set_xlabel('Day of Year')
ax.set_ylabel('Temperature (°F)')


### We'll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month.

In [None]:
from matplotlib.dates import DateFormatter, AutoDateLocator,HourLocator,DayLocator,MonthLocator

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red',label = "Max T")
ax.plot (Date, minT, color='blue', label = "Min T")
ax.set_title (f"ALB Year {year}")
ax.set_xlabel('Date')
ax.set_ylabel('Temperature (°F)' )
ax.xaxis.set_major_locator(MonthLocator(interval=1))
dateFmt = DateFormatter('%b %d')
ax.xaxis.set_major_formatter(dateFmt)
ax.legend (loc="best")

### Let's save our beautiful graphic to disk.

In [None]:
fig.savefig (f'albTemps{year}.png')

## Now, let's answer the question, "what was the most common range of maximum temperatures last year in Albany?" via a histogram. We use `matplotlib`'s `hist` method.

In [None]:
# %load '/spare11/atm350/common/mar04/01b.py'
# Create a figure and size it.
fig, ax = plt.subplots(figsize=(15,10))
# Create a histogram of our data series and divide it in to 10 bins.
ax.hist(maxT, bins=10, color='k', alpha=0.3)


## Ok, but the 10 bins were autoselected. Let's customize our call to the `hist` method by specifying the bounds of each of our bins.
### How can we learn more about how to customize this call? Append a `?` to the name of the method.

In [None]:
ax.hist?

### Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram's bins.

In [None]:
# %load '/spare11/atm350/common/mar04/01c.py'
fig, ax = plt.subplots(figsize=(15,10))
ax.hist(maxT, bins=(0,10,20,30,40,50,60,70,80,90,100), color='k', alpha=0.3)
ax.xaxis.set_major_locator(plt.MultipleLocator(10))


### Save this histogram to disk.

In [None]:
fig.savefig("maxT_hist.png")

Use the `describe` method on the maximum temperature series to reveal some simple statistical properties.

In [None]:
maxT.describe()

## References

1. [Project Pythia: Pandas](https://foundations.projectpythia.org/core/pandas.html)
2. [The Carpentries: Pandas](https://swcarpentry.github.io/python-novice-gapminder/07-reading-tabular/index.html)
3. [Matplotlib: Setting x/y-axis tick label properties](https://jakevdp.github.io/PythonDataScienceHandbook/04.10-customizing-ticks.html)