Pandas Notebook 1, ATM350 Spring 2025

Contents

Pandas Notebook 1, ATM350 Spring 2025#


Here, we read in a text file that has climatological data compiled at the National Weather Service in Albany NY for 2024, previously downloaded and reformatted from the xmACIS2 climate data portal.

We will use the Pandas library to read and analyze the data. We will also use the Matplotlib package to visualize it.

Motivating Science Questions:#

  1. How can we analyze and display tabular climate data for a site?

  2. What was the yearly trace of max/min temperatures for Albany, NY last year?

  3. What was the most common 10-degree maximum temperature range for Albany, NY last year?

# import Pandas and Numpy, and use their conventional two-letter abbreviations when we
# use methods from these packages. Also, import matplotlib's plotting package, using its 
# standard abbreviation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Specify the location of the file that contains the climo data. Use the linux ls command to verify it exists.#

Note that in a Jupyter notebook, we can simply use the ! directive to “call” a Linux command.#

Also notice how we refer to a Python variable name when passing it to a Linux command line in this way … we enclose it in braces!#

year = 2024
file = f'/spare11/atm350/common/data/climo_alb_{year}.csv'
! ls -l {file}
-rw-r--r-- 1 ktyle faculty 15184 Mar  5 19:20 /spare11/atm350/common/data/climo_alb_2024.csv

Use pandas’ read_csv method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).#

Once this call succeeds, it returns a Pandas Dataframe object which we reference as df#

df = pd.read_csv(file, dtype='string')

By simply typing the name of the dataframe object, we can get some of its contents to be “pretty-printed” to the notebook!#

df
DATE MAX MIN AVG DEP PCP SNW DPTH HDD CDD
0 2024-01-01 36 25 30.5 4.2 0.00 0.0 0 34 0
1 2024-01-02 38 24 31.0 4.9 0.00 0.0 0 34 0
2 2024-01-03 40 32 36.0 10.1 0.00 0.0 0 29 0
3 2024-01-04 43 19 31.0 5.3 T T 0 34 0
4 2024-01-05 31 16 23.5 -2.0 T T 0 41 0
... ... ... ... ... ... ... ... ... ... ...
361 2024-12-27 27 10 18.5 -8.9 0.03 0.3 2 46 0
362 2024-12-28 35 22 28.5 1.3 0.04 0.0 2 36 0
363 2024-12-29 57 33 45.0 18.1 0.04 0.0 1 20 0
364 2024-12-30 58 44 51.0 24.3 0.17 0.0 0 14 0
365 2024-12-31 51 30 40.5 14.0 T 0.0 0 24 0

366 rows × 10 columns

Our dataframe has 365 or 366 rows (corresponding to all the days in the year) and 10 columns that contain data. This is expressed by calling the shape attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.#

df.shape
(366, 10)

It will be useful to have a variable (more accurately, an object ) that holds the value of the number of rows, and another for the number of columns.#

Remember that Python is a language that uses zero-based indexing, so the first value is accessed as element 0, and the second as element 1!#

Look at the syntax we use below to print out the (integer) value of nRows … it’s another example of string formating.#

nRows = df.shape[0]
print (f"Number of rows = {nRows}")
Number of rows = 366

Let’s do the same for the # of columns.#

nCols = df.shape[1]
print (f"Number of columns = {nCols}")
Number of columns = 10

To access the values in a particular column, we reference it with its column name as a string. The next cell pulls in all values of the year-month-date column, and assigns it to an object of the same name. We could have named the object anything we wanted, not just Date … but on the right side of the assignment statement, we have to use the exact name of the column.#

Print out what this object looks like.

Date = df['DATE']
print (Date)
0      2024-01-01
1      2024-01-02
2      2024-01-03
3      2024-01-04
4      2024-01-05
          ...    
361    2024-12-27
362    2024-12-28
363    2024-12-29
364    2024-12-30
365    2024-12-31
Name: DATE, Length: 366, dtype: string

Each column of a Pandas dataframe is known as a series. It is basically an array of values, each of which has a corresponding row #. By default, row #’s accompanying a Series are numbered consecutively, starting with 0 (since Python’s convention is to use zero-based indexing ).#

We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let’s print out the first 30 rows of the dates.#

print (Date[:30])
0     2024-01-01
1     2024-01-02
2     2024-01-03
3     2024-01-04
4     2024-01-05
5     2024-01-06
6     2024-01-07
7     2024-01-08
8     2024-01-09
9     2024-01-10
10    2024-01-11
11    2024-01-12
12    2024-01-13
13    2024-01-14
14    2024-01-15
15    2024-01-16
16    2024-01-17
17    2024-01-18
18    2024-01-19
19    2024-01-20
20    2024-01-21
21    2024-01-22
22    2024-01-23
23    2024-01-24
24    2024-01-25
25    2024-01-26
26    2024-01-27
27    2024-01-28
28    2024-01-29
29    2024-01-30
Name: DATE, dtype: string

Similarly, let’s print out the last, or 365th row (Why is it 365, not 366???)#

print(Date[365])
2024-12-31

Note that using -1 as the last index doesn’t work!

print(Date[-1])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
    412 try:
--> 413     return self._range.index(new_key)
    414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(Date[-1])

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/series.py:1121, in Series.__getitem__(self, key)
   1118     return self._values[key]
   1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
   1123 # Convert generator to list before going through hashable part
   1124 # (We will iterate through the generator there to check for slices)
   1125 if is_iterator(key):

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
   1234     return self._values[label]
   1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
   1239 if is_integer(loc):
   1240     return self._values[loc]

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
    413         return self._range.index(new_key)
    414     except ValueError as err:
--> 415         raise KeyError(key) from err
    416 if isinstance(key, Hashable):
    417     raise KeyError(key)

KeyError: -1

However, using a negative value as part of a slice does work:

print(Date[-9:])
357    2024-12-23
358    2024-12-24
359    2024-12-25
360    2024-12-26
361    2024-12-27
362    2024-12-28
363    2024-12-29
364    2024-12-30
365    2024-12-31
Name: DATE, dtype: string

EXERCISE: Now, let’s create new Series objects; one for Max Temp (name it maxT), and the other for Min Temp (name it minT).#

TIP: After you have tried on your own, you can uncomment the first line of the cell below and re-run to load the solution.
# %load /spare11/atm350/common/mar04/01a.py
maxT = df['MAX']
minT = df['MIN']
maxT
0      36
1      38
2      40
3      43
4      31
       ..
361    27
362    35
363    57
364    58
365    51
Name: MAX, Length: 366, dtype: string
minT
0      25
1      24
2      32
3      19
4      16
       ..
361    10
362    22
363    33
364    44
365    30
Name: MIN, Length: 366, dtype: string

Let’s now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!#

hotDays = maxT >= 90
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 hotDays = maxT >= 90

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/arraylike.py:60, in OpsMixin.__ge__(self, other)
     58 @unpack_zerodim_and_defer("__ge__")
     59 def __ge__(self, other):
---> 60     return self._cmp_method(other, operator.ge)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/series.py:6119, in Series._cmp_method(self, other, op)
   6116 lvalues = self._values
   6117 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
-> 6119 res_values = ops.comparison_op(lvalues, rvalues, op)
   6121 return self._construct_result(res_values, name=res_name)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/ops/array_ops.py:330, in comparison_op(left, right, op)
    321         raise ValueError(
    322             "Lengths must match to compare", lvalues.shape, rvalues.shape
    323         )
    325 if should_extension_dispatch(lvalues, rvalues) or (
    326     (isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT)
    327     and lvalues.dtype != object
    328 ):
    329     # Call the method on lvalues
--> 330     res_values = op(lvalues, rvalues)
    332 elif is_scalar(rvalues) and isna(rvalues):  # TODO: but not pd.NA?
    333     # numpy does not like comparisons vs None
    334     if op is operator.ne:

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/arraylike.py:60, in OpsMixin.__ge__(self, other)
     58 @unpack_zerodim_and_defer("__ge__")
     59 def __ge__(self, other):
---> 60     return self._cmp_method(other, operator.ge)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/arrays/string_.py:593, in StringArray._cmp_method(self, other, op)
    590 else:
    591     # logical
    592     result = np.zeros(len(self._ndarray), dtype="bool")
--> 593     result[valid] = op(self._ndarray[valid], other)
    594     return BooleanArray(result, mask)

TypeError: '>=' not supported between instances of 'str' and 'int'

Why did it fail? Remember, when we read in the file, we had Pandas assign the type of every column to string! We need to change the type of maxT to a numerical value. Let’s use a 32-bit floating point #, as that will be more than enough precision for this type of measurement. We’ll do the same for the minimum temp.#

maxT = maxT.astype("float32")
minT = minT.astype("float32")
maxT
0      36.0
1      38.0
2      40.0
3      43.0
4      31.0
       ... 
361    27.0
362    35.0
363    57.0
364    58.0
365    51.0
Name: MAX, Length: 366, dtype: float32
hotDays = maxT >= 90

Now, the test works. What does this data series look like? It actually is a table of booleans … i.e., true/false values.#

print (hotDays)
0      False
1      False
2      False
3      False
4      False
       ...  
361    False
362    False
363    False
364    False
365    False
Name: MAX, Length: 366, dtype: bool

As the default output only includes the first and last 5 rows , let’s slice and pull out a period in the middle of the year, where we might be more likely to get some Trues!#

print (hotDays[180:195])
180    False
181    False
182    False
183    False
184    False
185    False
186    False
187     True
188    False
189     True
190     True
191     True
192    False
193    False
194     True
Name: MAX, dtype: bool

Now, let’s get a count of the # of days meeting this temperature criterion. Note carefully that we first have to express our set of days exceeding the threshold as a Pandas series. Then, recall that to get a count of the # of rows, we take the first (0th) element of the array returned by a call to the shape method.#

df[maxT >= 90]
DATE MAX MIN AVG DEP PCP SNW DPTH HDD CDD
169 2024-06-18 94 70 82.0 12.8 0.00 0.0 0 0 17
170 2024-06-19 94 76 85.0 15.5 T 0.0 0 0 20
171 2024-06-20 96 73 84.5 14.8 0.01 0.0 0 0 20
187 2024-07-06 93 72 82.5 9.7 0.16 0.0 0 0 18
189 2024-07-08 92 65 78.5 5.5 T 0.0 0 0 14
190 2024-07-09 95 76 85.5 12.4 T 0.0 0 0 21
191 2024-07-10 94 76 85.0 11.8 T 0.0 0 0 20
194 2024-07-13 91 70 80.5 7.1 0.00 0.0 0 0 16
195 2024-07-14 91 63 77.0 3.6 0.00 0.0 0 0 12
196 2024-07-15 94 74 84.0 10.6 0.01 0.0 0 0 19
197 2024-07-16 93 72 82.5 9.0 0.60 0.0 0 0 18
211 2024-07-30 91 71 81.0 7.9 0.00 0.0 0 0 16
213 2024-08-01 92 69 80.5 7.5 0.00 0.0 0 0 16
df[maxT >= 90].shape[0]
13

Let’s reverse the sense of the test, and get its count. The two counts should add up to the total number of days in the year!#

df[maxT < 90].shape[0]
353

We can combine a test of two different thresholds. Let’s get a count of days where the max. temperature was in the 70s or 80s.#

df[(maxT< 90) & (maxT>=70)].shape[0]
149

Let’s show all the climate data for all these “pleasantly warm” days!#

pleasant = df[(maxT< 90) & (maxT>=70)]
pleasant
DATE MAX MIN AVG DEP PCP SNW DPTH HDD CDD
99 2024-04-09 77 37 57.0 11.9 0.00 0.0 0 8 0
113 2024-04-23 70 33 51.5 0.1 T 0.0 0 13 0
118 2024-04-28 75 48 61.5 8.0 0.05 0.0 0 3 0
120 2024-04-30 74 51 62.5 8.2 0.43 0.0 0 2 0
121 2024-05-01 75 54 64.5 9.8 T 0.0 0 0 0
... ... ... ... ... ... ... ... ... ... ...
303 2024-10-30 74 54 64.0 17.7 T 0.0 0 1 0
304 2024-10-31 79 55 67.0 21.1 0.00 0.0 0 0 2
305 2024-11-01 74 48 61.0 15.4 0.02 0.0 0 4 0
309 2024-11-05 72 52 62.0 17.8 0.00 0.0 0 3 0
310 2024-11-06 77 53 65.0 21.2 T 0.0 0 0 0

149 rows × 10 columns

Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days.#

pd.set_option ('display.max_rows', None)
pleasant
DATE MAX MIN AVG DEP PCP SNW DPTH HDD CDD
99 2024-04-09 77 37 57.0 11.9 0.00 0.0 0 8 0
113 2024-04-23 70 33 51.5 0.1 T 0.0 0 13 0
118 2024-04-28 75 48 61.5 8.0 0.05 0.0 0 3 0
120 2024-04-30 74 51 62.5 8.2 0.43 0.0 0 2 0
121 2024-05-01 75 54 64.5 9.8 T 0.0 0 0 0
122 2024-05-02 76 55 65.5 10.4 0.00 0.0 0 0 1
123 2024-05-03 73 50 61.5 6.0 0.00 0.0 0 3 0
124 2024-05-04 74 53 63.5 7.6 0.00 0.0 0 1 0
126 2024-05-06 78 52 65.0 8.4 T 0.0 0 0 0
127 2024-05-07 79 44 61.5 4.6 0.00 0.0 0 3 0
128 2024-05-08 82 52 67.0 9.7 0.48 0.0 0 0 2
133 2024-05-13 71 39 55.0 -3.9 0.02 0.0 0 10 0
134 2024-05-14 83 58 70.5 11.3 0.00 0.0 0 0 6
136 2024-05-16 73 56 64.5 4.7 T 0.0 0 0 0
137 2024-05-17 76 50 63.0 2.9 0.00 0.0 0 2 0
138 2024-05-18 77 56 66.5 6.1 0.00 0.0 0 0 2
139 2024-05-19 80 57 68.5 7.8 0.00 0.0 0 0 4
140 2024-05-20 86 58 72.0 11.0 0.00 0.0 0 0 7
141 2024-05-21 85 63 74.0 12.8 0.39 0.0 0 0 9
142 2024-05-22 89 64 76.5 15.0 0.00 0.0 0 0 12
143 2024-05-23 80 56 68.0 6.2 0.04 0.0 0 0 3
144 2024-05-24 82 52 67.0 4.9 0.00 0.0 0 0 2
145 2024-05-25 81 51 66.0 3.6 T 0.0 0 0 1
146 2024-05-26 84 60 72.0 9.3 0.00 0.0 0 0 7
147 2024-05-27 81 69 75.0 12.1 0.56 0.0 0 0 10
148 2024-05-28 79 59 69.0 5.8 T 0.0 0 0 4
149 2024-05-29 76 57 66.5 3.0 0.00 0.0 0 0 2
150 2024-05-30 71 52 61.5 -2.3 T 0.0 0 3 0
151 2024-05-31 76 47 61.5 -2.6 0.00 0.0 0 3 0
152 2024-06-01 85 48 66.5 2.1 0.00 0.0 0 0 2
153 2024-06-02 84 51 67.5 2.9 0.00 0.0 0 0 3
154 2024-06-03 87 64 75.5 10.6 0.00 0.0 0 0 11
155 2024-06-04 88 58 73.0 7.8 0.00 0.0 0 0 8
156 2024-06-05 87 66 76.5 11.0 0.00 0.0 0 0 12
157 2024-06-06 83 63 73.0 7.2 0.32 0.0 0 0 8
158 2024-06-07 78 55 66.5 0.4 0.51 0.0 0 0 2
159 2024-06-08 73 54 63.5 -2.9 T 0.0 0 1 0
160 2024-06-09 72 56 64.0 -2.7 0.73 0.0 0 1 0
161 2024-06-10 71 52 61.5 -5.5 0.00 0.0 0 3 0
163 2024-06-12 74 54 64.0 -3.5 0.00 0.0 0 1 0
164 2024-06-13 86 51 68.5 0.7 0.00 0.0 0 0 4
165 2024-06-14 86 66 76.0 7.9 0.05 0.0 0 0 11
166 2024-06-15 79 56 67.5 -0.9 0.00 0.0 0 0 3
167 2024-06-16 77 47 62.0 -6.7 0.00 0.0 0 3 0
168 2024-06-17 88 63 75.5 6.6 T 0.0 0 0 11
172 2024-06-21 89 69 79.0 9.0 0.32 0.0 0 0 14
173 2024-06-22 81 70 75.5 5.3 0.53 0.0 0 0 11
174 2024-06-23 85 71 78.0 7.5 0.24 0.0 0 0 13
175 2024-06-24 76 61 68.5 -2.2 0.23 0.0 0 0 4
176 2024-06-25 88 57 72.5 1.6 T 0.0 0 0 8
177 2024-06-26 86 69 77.5 6.3 T 0.0 0 0 13
178 2024-06-27 79 60 69.5 -1.9 0.00 0.0 0 0 5
179 2024-06-28 77 50 63.5 -8.1 0.00 0.0 0 1 0
180 2024-06-29 76 59 67.5 -4.3 0.32 0.0 0 0 3
181 2024-06-30 87 64 75.5 3.5 0.02 0.0 0 0 11
182 2024-07-01 79 61 70.0 -2.1 0.12 0.0 0 0 5
183 2024-07-02 84 54 69.0 -3.3 0.00 0.0 0 0 4
184 2024-07-03 85 64 74.5 2.1 0.00 0.0 0 0 10
185 2024-07-04 85 70 77.5 4.9 T 0.0 0 0 13
186 2024-07-05 87 74 80.5 7.8 T 0.0 0 0 16
188 2024-07-07 89 68 78.5 5.6 0.00 0.0 0 0 14
192 2024-07-11 86 75 80.5 7.2 0.30 0.0 0 0 16
193 2024-07-12 85 73 79.0 5.7 0.00 0.0 0 0 14
198 2024-07-17 88 68 78.0 4.5 0.53 0.0 0 0 13
199 2024-07-18 83 65 74.0 0.5 T 0.0 0 0 9
200 2024-07-19 83 59 71.0 -2.5 0.00 0.0 0 0 6
201 2024-07-20 86 59 72.5 -1.0 0.00 0.0 0 0 8
202 2024-07-21 85 61 73.0 -0.5 0.05 0.0 0 0 8
203 2024-07-22 87 64 75.5 2.1 T 0.0 0 0 11
204 2024-07-23 83 69 76.0 2.6 0.12 0.0 0 0 11
205 2024-07-24 83 70 76.5 3.1 0.02 0.0 0 0 12
206 2024-07-25 83 66 74.5 1.1 T 0.0 0 0 10
207 2024-07-26 85 61 73.0 -0.3 0.00 0.0 0 0 8
208 2024-07-27 86 59 72.5 -0.8 0.00 0.0 0 0 8
209 2024-07-28 87 62 74.5 1.3 0.00 0.0 0 0 10
210 2024-07-29 84 69 76.5 3.3 0.01 0.0 0 0 12
212 2024-07-31 89 74 81.5 8.5 0.65 0.0 0 0 17
214 2024-08-02 87 68 77.5 4.6 0.72 0.0 0 0 13
215 2024-08-03 86 73 79.5 6.7 0.48 0.0 0 0 15
216 2024-08-04 86 73 79.5 6.7 0.11 0.0 0 0 15
217 2024-08-05 87 67 77.0 4.3 0.02 0.0 0 0 12
218 2024-08-06 73 61 67.0 -5.6 0.63 0.0 0 0 2
219 2024-08-07 78 61 69.5 -3.0 0.18 0.0 0 0 5
220 2024-08-08 74 59 66.5 -5.9 0.64 0.0 0 0 2
221 2024-08-09 82 67 74.5 2.2 2.91 0.0 0 0 10
222 2024-08-10 83 62 72.5 0.3 0.00 0.0 0 0 8
223 2024-08-11 81 56 68.5 -3.6 0.07 0.0 0 0 4
224 2024-08-12 74 55 64.5 -7.5 T 0.0 0 0 0
225 2024-08-13 82 55 68.5 -3.4 0.00 0.0 0 0 4
226 2024-08-14 86 60 73.0 1.2 0.00 0.0 0 0 8
227 2024-08-15 86 59 72.5 0.8 T 0.0 0 0 8
228 2024-08-16 86 64 75.0 3.4 0.00 0.0 0 0 10
229 2024-08-17 80 70 75.0 3.5 T 0.0 0 0 10
230 2024-08-18 76 67 71.5 0.2 0.28 0.0 0 0 7
231 2024-08-19 82 59 70.5 -0.7 1.89 0.0 0 0 6
233 2024-08-21 74 52 63.0 -7.9 T 0.0 0 2 0
234 2024-08-22 75 56 65.5 -5.2 0.00 0.0 0 0 1
235 2024-08-23 81 53 67.0 -3.6 0.00 0.0 0 0 2
236 2024-08-24 83 55 69.0 -1.4 0.00 0.0 0 0 4
237 2024-08-25 83 59 71.0 0.8 0.02 0.0 0 0 6
238 2024-08-26 84 61 72.5 2.5 0.00 0.0 0 0 8
239 2024-08-27 87 59 73.0 3.2 0.00 0.0 0 0 8
240 2024-08-28 88 65 76.5 6.9 T 0.0 0 0 12
241 2024-08-29 73 61 67.0 -2.3 0.00 0.0 0 0 2
242 2024-08-30 77 66 71.5 2.4 0.00 0.0 0 0 7
243 2024-08-31 76 63 69.5 0.7 0.15 0.0 0 0 5
244 2024-09-01 87 67 77.0 8.4 T 0.0 0 0 12
245 2024-09-02 74 51 62.5 -5.8 0.00 0.0 0 2 0
246 2024-09-03 75 45 60.0 -8.0 0.00 0.0 0 5 0
247 2024-09-04 78 50 64.0 -3.7 0.00 0.0 0 1 0
248 2024-09-05 80 53 66.5 -0.9 0.00 0.0 0 0 2
249 2024-09-06 79 49 64.0 -3.1 0.00 0.0 0 1 0
250 2024-09-07 75 54 64.5 -2.3 0.53 0.0 0 0 0
252 2024-09-09 73 48 60.5 -5.6 0.04 0.0 0 4 0
253 2024-09-10 73 51 62.0 -3.8 0.00 0.0 0 3 0
254 2024-09-11 77 47 62.0 -3.4 0.00 0.0 0 3 0
255 2024-09-12 81 51 66.0 0.9 0.00 0.0 0 0 1
256 2024-09-13 84 59 71.5 6.8 0.00 0.0 0 0 7
257 2024-09-14 85 59 72.0 7.7 0.00 0.0 0 0 7
258 2024-09-15 85 59 72.0 8.1 0.00 0.0 0 0 7
259 2024-09-16 84 57 70.5 7.0 0.00 0.0 0 0 6
260 2024-09-17 80 54 67.0 3.9 0.00 0.0 0 0 2
261 2024-09-18 79 54 66.5 3.8 0.00 0.0 0 0 2
262 2024-09-19 85 56 70.5 8.2 0.00 0.0 0 0 6
263 2024-09-20 80 57 68.5 6.6 0.00 0.0 0 0 4
264 2024-09-21 79 57 68.0 6.5 0.00 0.0 0 0 3
265 2024-09-22 76 53 64.5 3.5 0.00 0.0 0 0 0
266 2024-09-23 70 55 62.5 1.9 0.00 0.0 0 2 0
267 2024-09-24 71 47 59.0 -1.2 0.00 0.0 0 6 0
270 2024-09-27 76 59 67.5 8.6 0.00 0.0 0 0 3
271 2024-09-28 80 58 69.0 10.5 0.00 0.0 0 0 4
272 2024-09-29 72 56 64.0 6.0 0.00 0.0 0 1 0
273 2024-09-30 77 55 66.0 8.4 0.00 0.0 0 0 1
274 2024-10-01 74 54 64.0 6.8 0.00 0.0 0 1 0
276 2024-10-03 74 54 64.0 7.6 0.00 0.0 0 1 0
277 2024-10-04 74 49 61.5 5.6 0.02 0.0 0 3 0
278 2024-10-05 71 49 60.0 4.5 0.02 0.0 0 5 0
279 2024-10-06 72 41 56.5 1.4 0.00 0.0 0 8 0
280 2024-10-07 70 49 59.5 4.8 T 0.0 0 5 0
292 2024-10-19 71 34 52.5 2.4 0.00 0.0 0 12 0
293 2024-10-20 70 33 51.5 1.7 0.00 0.0 0 13 0
294 2024-10-21 80 42 61.0 11.6 0.00 0.0 0 4 0
295 2024-10-22 80 47 63.5 14.4 0.00 0.0 0 1 0
296 2024-10-23 78 55 66.5 17.8 T 0.0 0 0 2
303 2024-10-30 74 54 64.0 17.7 T 0.0 0 1 0
304 2024-10-31 79 55 67.0 21.1 0.00 0.0 0 0 2
305 2024-11-01 74 48 61.0 15.4 0.02 0.0 0 4 0
309 2024-11-05 72 52 62.0 17.8 0.00 0.0 0 3 0
310 2024-11-06 77 53 65.0 21.2 T 0.0 0 0 0

Now let’s visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib’s plotting package.#

maxT.plot()
<Axes: >
../../_images/0a7afe288da006d029d96c607152bef13a91be7e20cf333fe29e427cda6ea0c3.png
minT.plot()
<Axes: >
../../_images/64b1672d30e27a7ee1289b697e38d7043455b50f7a2519ad1e6a2f3c4c9b8b8a.png

The data plotted fine, but the look could be better. First, let’s import a package, seaborn, that when imported and set using its own method, makes matplotlib’s graphs look better.#

Info on seaborn: https://seaborn.pydata.org/index.html

import seaborn as sns
sns.set()
maxT.plot()
<Axes: >
../../_images/3ea3a2d6d39707f78d7801587f9117671db8c0d106287e3d9644c864ec0114ad.png
minT.plot()
<Axes: >
../../_images/19b8e48c4e01d624a0750f222f0982db6931ecf94bfa7869cfa7ea3f0d4b52ad.png

Next, let’s plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We’ll also add some helpful labels and expand the size of the figure.#

You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed!#

fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red')
ax.plot (Date, minT, color='blue')
ax.set_title (f"ALB Year {year}")
ax.set_xlabel('Day of Year')
ax.set_ylabel('Temperature (°F')
Text(0, 0.5, 'Temperature (°F')
../../_images/ac6551c66cd27852ac1564cbf9900288ab686256ae3e1abdfef61f9f7bf33249.png

We will deal with this by using one of Pandas’ methods that take strings and convert them to a special type of data … not strings nor numbers, but datetime objects. Note carefully how we do this here … it is not terribly intuitive, but we’ll explain it more in an upcoming lecture/notebook on datetime. You will see though that the output column now looks a bit more date-like, with a four-digit year followed by two-digit month and date.#

Date = pd.to_datetime(Date,format="%Y-%m-%d")
Date
0     2024-01-01
1     2024-01-02
2     2024-01-03
3     2024-01-04
4     2024-01-05
5     2024-01-06
6     2024-01-07
7     2024-01-08
8     2024-01-09
9     2024-01-10
10    2024-01-11
11    2024-01-12
12    2024-01-13
13    2024-01-14
14    2024-01-15
15    2024-01-16
16    2024-01-17
17    2024-01-18
18    2024-01-19
19    2024-01-20
20    2024-01-21
21    2024-01-22
22    2024-01-23
23    2024-01-24
24    2024-01-25
25    2024-01-26
26    2024-01-27
27    2024-01-28
28    2024-01-29
29    2024-01-30
30    2024-01-31
31    2024-02-01
32    2024-02-02
33    2024-02-03
34    2024-02-04
35    2024-02-05
36    2024-02-06
37    2024-02-07
38    2024-02-08
39    2024-02-09
40    2024-02-10
41    2024-02-11
42    2024-02-12
43    2024-02-13
44    2024-02-14
45    2024-02-15
46    2024-02-16
47    2024-02-17
48    2024-02-18
49    2024-02-19
50    2024-02-20
51    2024-02-21
52    2024-02-22
53    2024-02-23
54    2024-02-24
55    2024-02-25
56    2024-02-26
57    2024-02-27
58    2024-02-28
59    2024-02-29
60    2024-03-01
61    2024-03-02
62    2024-03-03
63    2024-03-04
64    2024-03-05
65    2024-03-06
66    2024-03-07
67    2024-03-08
68    2024-03-09
69    2024-03-10
70    2024-03-11
71    2024-03-12
72    2024-03-13
73    2024-03-14
74    2024-03-15
75    2024-03-16
76    2024-03-17
77    2024-03-18
78    2024-03-19
79    2024-03-20
80    2024-03-21
81    2024-03-22
82    2024-03-23
83    2024-03-24
84    2024-03-25
85    2024-03-26
86    2024-03-27
87    2024-03-28
88    2024-03-29
89    2024-03-30
90    2024-03-31
91    2024-04-01
92    2024-04-02
93    2024-04-03
94    2024-04-04
95    2024-04-05
96    2024-04-06
97    2024-04-07
98    2024-04-08
99    2024-04-09
100   2024-04-10
101   2024-04-11
102   2024-04-12
103   2024-04-13
104   2024-04-14
105   2024-04-15
106   2024-04-16
107   2024-04-17
108   2024-04-18
109   2024-04-19
110   2024-04-20
111   2024-04-21
112   2024-04-22
113   2024-04-23
114   2024-04-24
115   2024-04-25
116   2024-04-26
117   2024-04-27
118   2024-04-28
119   2024-04-29
120   2024-04-30
121   2024-05-01
122   2024-05-02
123   2024-05-03
124   2024-05-04
125   2024-05-05
126   2024-05-06
127   2024-05-07
128   2024-05-08
129   2024-05-09
130   2024-05-10
131   2024-05-11
132   2024-05-12
133   2024-05-13
134   2024-05-14
135   2024-05-15
136   2024-05-16
137   2024-05-17
138   2024-05-18
139   2024-05-19
140   2024-05-20
141   2024-05-21
142   2024-05-22
143   2024-05-23
144   2024-05-24
145   2024-05-25
146   2024-05-26
147   2024-05-27
148   2024-05-28
149   2024-05-29
150   2024-05-30
151   2024-05-31
152   2024-06-01
153   2024-06-02
154   2024-06-03
155   2024-06-04
156   2024-06-05
157   2024-06-06
158   2024-06-07
159   2024-06-08
160   2024-06-09
161   2024-06-10
162   2024-06-11
163   2024-06-12
164   2024-06-13
165   2024-06-14
166   2024-06-15
167   2024-06-16
168   2024-06-17
169   2024-06-18
170   2024-06-19
171   2024-06-20
172   2024-06-21
173   2024-06-22
174   2024-06-23
175   2024-06-24
176   2024-06-25
177   2024-06-26
178   2024-06-27
179   2024-06-28
180   2024-06-29
181   2024-06-30
182   2024-07-01
183   2024-07-02
184   2024-07-03
185   2024-07-04
186   2024-07-05
187   2024-07-06
188   2024-07-07
189   2024-07-08
190   2024-07-09
191   2024-07-10
192   2024-07-11
193   2024-07-12
194   2024-07-13
195   2024-07-14
196   2024-07-15
197   2024-07-16
198   2024-07-17
199   2024-07-18
200   2024-07-19
201   2024-07-20
202   2024-07-21
203   2024-07-22
204   2024-07-23
205   2024-07-24
206   2024-07-25
207   2024-07-26
208   2024-07-27
209   2024-07-28
210   2024-07-29
211   2024-07-30
212   2024-07-31
213   2024-08-01
214   2024-08-02
215   2024-08-03
216   2024-08-04
217   2024-08-05
218   2024-08-06
219   2024-08-07
220   2024-08-08
221   2024-08-09
222   2024-08-10
223   2024-08-11
224   2024-08-12
225   2024-08-13
226   2024-08-14
227   2024-08-15
228   2024-08-16
229   2024-08-17
230   2024-08-18
231   2024-08-19
232   2024-08-20
233   2024-08-21
234   2024-08-22
235   2024-08-23
236   2024-08-24
237   2024-08-25
238   2024-08-26
239   2024-08-27
240   2024-08-28
241   2024-08-29
242   2024-08-30
243   2024-08-31
244   2024-09-01
245   2024-09-02
246   2024-09-03
247   2024-09-04
248   2024-09-05
249   2024-09-06
250   2024-09-07
251   2024-09-08
252   2024-09-09
253   2024-09-10
254   2024-09-11
255   2024-09-12
256   2024-09-13
257   2024-09-14
258   2024-09-15
259   2024-09-16
260   2024-09-17
261   2024-09-18
262   2024-09-19
263   2024-09-20
264   2024-09-21
265   2024-09-22
266   2024-09-23
267   2024-09-24
268   2024-09-25
269   2024-09-26
270   2024-09-27
271   2024-09-28
272   2024-09-29
273   2024-09-30
274   2024-10-01
275   2024-10-02
276   2024-10-03
277   2024-10-04
278   2024-10-05
279   2024-10-06
280   2024-10-07
281   2024-10-08
282   2024-10-09
283   2024-10-10
284   2024-10-11
285   2024-10-12
286   2024-10-13
287   2024-10-14
288   2024-10-15
289   2024-10-16
290   2024-10-17
291   2024-10-18
292   2024-10-19
293   2024-10-20
294   2024-10-21
295   2024-10-22
296   2024-10-23
297   2024-10-24
298   2024-10-25
299   2024-10-26
300   2024-10-27
301   2024-10-28
302   2024-10-29
303   2024-10-30
304   2024-10-31
305   2024-11-01
306   2024-11-02
307   2024-11-03
308   2024-11-04
309   2024-11-05
310   2024-11-06
311   2024-11-07
312   2024-11-08
313   2024-11-09
314   2024-11-10
315   2024-11-11
316   2024-11-12
317   2024-11-13
318   2024-11-14
319   2024-11-15
320   2024-11-16
321   2024-11-17
322   2024-11-18
323   2024-11-19
324   2024-11-20
325   2024-11-21
326   2024-11-22
327   2024-11-23
328   2024-11-24
329   2024-11-25
330   2024-11-26
331   2024-11-27
332   2024-11-28
333   2024-11-29
334   2024-11-30
335   2024-12-01
336   2024-12-02
337   2024-12-03
338   2024-12-04
339   2024-12-05
340   2024-12-06
341   2024-12-07
342   2024-12-08
343   2024-12-09
344   2024-12-10
345   2024-12-11
346   2024-12-12
347   2024-12-13
348   2024-12-14
349   2024-12-15
350   2024-12-16
351   2024-12-17
352   2024-12-18
353   2024-12-19
354   2024-12-20
355   2024-12-21
356   2024-12-22
357   2024-12-23
358   2024-12-24
359   2024-12-25
360   2024-12-26
361   2024-12-27
362   2024-12-28
363   2024-12-29
364   2024-12-30
365   2024-12-31
Name: DATE, dtype: datetime64[ns]

We’ll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month.#

from matplotlib.dates import DateFormatter, AutoDateLocator,HourLocator,DayLocator,MonthLocator
fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red',label = "Max T")
ax.plot (Date, minT, color='blue', label = "Min T")
ax.set_title (f"ALB Year {year}")
ax.set_xlabel('Date')
ax.set_ylabel('Temperature (°F)' )
ax.xaxis.set_major_locator(MonthLocator(interval=1))
dateFmt = DateFormatter('%b %d')
ax.xaxis.set_major_formatter(dateFmt)
ax.legend (loc="best")
<matplotlib.legend.Legend at 0x146ec84bf020>
../../_images/4f69ad623b2556a2709edebe669127f814053ed4c6b7706d6432813f045d2725.png

Let’s save our beautiful graphic to disk.#

fig.savefig (f'albTemps{year}.png')

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use matplotlib’s hist method.#

# %load '/spare11/atm350/common/mar04/01b.py'
# Create a figure and size it.
fig, ax = plt.subplots(figsize=(15,10))
# Create a histogram of our data series and divide it in to 10 bins.
ax.hist(maxT, bins=10, color='k', alpha=0.3)
(array([ 2.,  9., 31., 42., 49., 40., 35., 57., 79., 22.]),
 array([11. , 19.5, 28. , 36.5, 45. , 53.5, 62. , 70.5, 79. , 87.5, 96. ]),
 <BarContainer object of 10 artists>)
../../_images/905bef964f106ab7bca79c5bcf8bfe660accf40380b1ccbd76f38f40337c4df9.png

Ok, but the 10 bins were autoselected. Let’s customize our call to the hist method by specifying the bounds of each of our bins.#

How can we learn more about how to customize this call? Append a ? to the name of the method.#

ax.hist?

Revise the call to ax.hist, and also draw tick marks that align with the bounds of the histogram’s bins.#

# %load '/spare11/atm350/common/mar04/01c.py'
fig, ax = plt.subplots(figsize=(15,10))
ax.hist(maxT, bins=(0,10,20,30,40,50,60,70,80,90,100), color='k', alpha=0.3)
ax.xaxis.set_major_locator(plt.MultipleLocator(10))
../../_images/e7f29ee919b2cd113c6e9fdb586feb7aa989dcd72a958b7f3f9f1e27023d8d73.png

Save this histogram to disk.#

fig.savefig("maxT_hist.png")

Use the describe method on the maximum temperature series to reveal some simple statistical properties.

maxT.describe()
count    366.000000
mean      62.338799
std       19.533743
min       11.000000
25%       46.000000
50%       65.000000
75%       79.750000
max       96.000000
Name: MAX, dtype: float64

References#

  1. Project Pythia: Pandas

  2. The Carpentries: Pandas

  3. Matplotlib: Setting x/y-axis tick label properties