Pandas Notebook 1, ATM350 Spring 2025

Pandas Notebook 1, ATM350 Spring 2025#

Here, we read in a text file that has climatological data compiled at the National Weather Service in Albany NY for 2024, previously downloaded and reformatted from the xmACIS2 climate data portal.

We will use the Pandas library to read and analyze the data. We will also use the Matplotlib package to visualize it.

Motivating Science Questions:#

How can we analyze and display tabular climate data for a site?
What was the yearly trace of max/min temperatures for Albany, NY last year?
What was the most common 10-degree maximum temperature range for Albany, NY last year?

# import Pandas and Numpy, and use their conventional two-letter abbreviations when we
# use methods from these packages. Also, import matplotlib's plotting package, using its 
# standard abbreviation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Specify the location of the file that contains the climo data. Use the linux ls command to verify it exists.#

Note that in a Jupyter notebook, we can simply use the ! directive to “call” a Linux command.#

Also notice how we refer to a Python variable name when passing it to a Linux command line in this way … we enclose it in braces!#

year = 2024
file = f'/spare11/atm350/common/data/climo_alb_{year}.csv'
! ls -l {file}

-rw-r--r-- 1 ktyle faculty 15184 Mar  5 19:20 /spare11/atm350/common/data/climo_alb_2024.csv

Use pandas’ `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).#

Once this call succeeds, it returns a Pandas Dataframe object which we reference as `df`#

df = pd.read_csv(file, dtype='string')

By simply typing the name of the dataframe object, we can get some of its contents to be “pretty-printed” to the notebook!#

df

	DATE	MAX	MIN	AVG	DEP	PCP	SNW	DPTH	HDD	CDD
0	2024-01-01	36	25	30.5	4.2	0.00	0.0	0	34	0
1	2024-01-02	38	24	31.0	4.9	0.00	0.0	0	34	0
2	2024-01-03	40	32	36.0	10.1	0.00	0.0	0	29	0
3	2024-01-04	43	19	31.0	5.3	T	T	0	34	0
4	2024-01-05	31	16	23.5	-2.0	T	T	0	41	0
...	...	...	...	...	...	...	...	...	...	...
361	2024-12-27	27	10	18.5	-8.9	0.03	0.3	2	46	0
362	2024-12-28	35	22	28.5	1.3	0.04	0.0	2	36	0
363	2024-12-29	57	33	45.0	18.1	0.04	0.0	1	20	0
364	2024-12-30	58	44	51.0	24.3	0.17	0.0	0	14	0
365	2024-12-31	51	30	40.5	14.0	T	0.0	0	24	0

366 rows × 10 columns

Our dataframe has 365 or 366 rows (corresponding to all the days in the year) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.#

df.shape

(366, 10)

It will be useful to have a variable (more accurately, an object ) that holds the value of the number of rows, and another for the number of columns.#

Remember that Python is a language that uses zero-based indexing, so the first value is accessed as element 0, and the second as element 1!#

Look at the syntax we use below to print out the (integer) value of nRows … it’s another example of string formating.#

nRows = df.shape[0]
print (f"Number of rows = {nRows}")

Number of rows = 366

Let’s do the same for the # of columns.#

nCols = df.shape[1]
print (f"Number of columns = {nCols}")

Number of columns = 10

To access the values in a particular column, we reference it with its column name as a string. The next cell pulls in all values of the year-month-date column, and assigns it to an object of the same name. We could have named the object anything we wanted, not just Date … but on the right side of the assignment statement, we have to use the exact name of the column.#

Print out what this object looks like.

Date = df['DATE']
print (Date)

    2024-01-01
    2024-01-02
    2024-01-03
    2024-01-04
    2024-01-05
          ...    
  2024-12-27
  2024-12-28
  2024-12-29
  2024-12-30
  2024-12-31
Name: DATE, Length: 366, dtype: string

Each column of a Pandas dataframe is known as a series. It is basically an array of values, each of which has a corresponding row #. By default, row #’s accompanying a Series are numbered consecutively, starting with 0 (since Python’s convention is to use zero-based indexing ).#

We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let’s print out the first 30 rows of the dates.#

print (Date[:30])

   2024-01-01
   2024-01-02
   2024-01-03
   2024-01-04
   2024-01-05
   2024-01-06
   2024-01-07
   2024-01-08
   2024-01-09
   2024-01-10
  2024-01-11
  2024-01-12
  2024-01-13
  2024-01-14
  2024-01-15
  2024-01-16
  2024-01-17
  2024-01-18
  2024-01-19
  2024-01-20
  2024-01-21
  2024-01-22
  2024-01-23
  2024-01-24
  2024-01-25
  2024-01-26
  2024-01-27
  2024-01-28
  2024-01-29
  2024-01-30
Name: DATE, dtype: string

Similarly, let’s print out the last, or 365th row (Why is it 365, not 366???)#

print(Date[365])

2024-12-31

Note that using -1 as the last index doesn’t work!

print(Date[-1])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key)
    412 try:
--> 413     return self._range.index(new_key)
    414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(Date[-1])

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/series.py:1121, in Series.__getitem__(self, key)
   1118     return self._values[key]
   1120 elif key_is_scalar:
-> 1121     return self._get_value(key)
   1123 # Convert generator to list before going through hashable part
   1124 # (We will iterate through the generator there to check for slices)
   1125 if is_iterator(key):

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/series.py:1237, in Series._get_value(self, label, takeable)
   1234     return self._values[label]
   1236 # Similar to Index.get_value, but we do not fall back to positional
-> 1237 loc = self.index.get_loc(label)
   1239 if is_integer(loc):
   1240     return self._values[loc]

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key)
    413         return self._range.index(new_key)
    414     except ValueError as err:
--> 415         raise KeyError(key) from err
    416 if isinstance(key, Hashable):
    417     raise KeyError(key)

KeyError: -1

However, using a negative value as part of a slice does work:

print(Date[-9:])

  2024-12-23
  2024-12-24
  2024-12-25
  2024-12-26
  2024-12-27
  2024-12-28
  2024-12-29
  2024-12-30
  2024-12-31
Name: DATE, dtype: string

EXERCISE: Now, let’s create new Series objects; one for Max Temp (name it maxT), and the other for Min Temp (name it minT).#

TIP: After you have tried on your own, you can uncomment the first line of the cell below and re-run to load the solution.

# %load /spare11/atm350/common/mar04/01a.py
maxT = df['MAX']
minT = df['MIN']

maxT

    36
    38
    40
    43
    31
       ..
  27
  35
  57
  58
  51
Name: MAX, Length: 366, dtype: string

minT

    25
    24
    32
    19
    16
       ..
  10
  22
  33
  44
  30
Name: MIN, Length: 366, dtype: string

Let’s now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!#

hotDays = maxT >= 90

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 hotDays = maxT >= 90

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/arraylike.py:60, in OpsMixin.__ge__(self, other)
     58 @unpack_zerodim_and_defer("__ge__")
     59 def __ge__(self, other):
---> 60     return self._cmp_method(other, operator.ge)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/series.py:6119, in Series._cmp_method(self, other, op)
   6116 lvalues = self._values
   6117 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
-> 6119 res_values = ops.comparison_op(lvalues, rvalues, op)
   6121 return self._construct_result(res_values, name=res_name)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/ops/array_ops.py:330, in comparison_op(left, right, op)
    321         raise ValueError(
    322             "Lengths must match to compare", lvalues.shape, rvalues.shape
    323         )
    325 if should_extension_dispatch(lvalues, rvalues) or (
    326     (isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT)
    327     and lvalues.dtype != object
    328 ):
    329     # Call the method on lvalues
--> 330     res_values = op(lvalues, rvalues)
    332 elif is_scalar(rvalues) and isna(rvalues):  # TODO: but not pd.NA?
    333     # numpy does not like comparisons vs None
    334     if op is operator.ne:

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/arraylike.py:60, in OpsMixin.__ge__(self, other)
     58 @unpack_zerodim_and_defer("__ge__")
     59 def __ge__(self, other):
---> 60     return self._cmp_method(other, operator.ge)

File /knight/jan25/envs/jan25_env/lib/python3.12/site-packages/pandas/core/arrays/string_.py:593, in StringArray._cmp_method(self, other, op)
    590 else:
    591     # logical
    592     result = np.zeros(len(self._ndarray), dtype="bool")
--> 593     result[valid] = op(self._ndarray[valid], other)
    594     return BooleanArray(result, mask)

TypeError: '>=' not supported between instances of 'str' and 'int'

Why did it fail? Remember, when we read in the file, we had Pandas assign the type of every column to string! We need to change the type of maxT to a numerical value. Let’s use a 32-bit floating point #, as that will be more than enough precision for this type of measurement. We’ll do the same for the minimum temp.#

maxT = maxT.astype("float32")
minT = minT.astype("float32")

maxT

    36.0
    38.0
    40.0
    43.0
    31.0
       ... 
  27.0
  35.0
  57.0
  58.0
  51.0
Name: MAX, Length: 366, dtype: float32

hotDays = maxT >= 90

Now, the test works. What does this data series look like? It actually is a table of booleans … i.e., true/false values.#

print (hotDays)

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: MAX, Length: 366, dtype: bool

As the default output only includes the first and last 5 rows , let’s `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!#

print (hotDays[180:195])

  False
  False
  False
  False
  False
  False
  False
   True
  False
   True
   True
   True
  False
  False
   True
Name: MAX, dtype: bool

Now, let’s get a count of the # of days meeting this temperature criterion. Note carefully that we first have to express our set of days exceeding the threshold as a Pandas series. Then, recall that to get a count of the # of rows, we take the first (0th) element of the array returned by a call to the `shape` method.#

df[maxT >= 90]

	DATE	MAX	MIN	AVG	DEP	PCP	CDD
169	2024-06-18	94	70	82.0	12.8	0.00	17
170	2024-06-19	94	76	85.0	15.5	T	20
171	2024-06-20	96	73	84.5	14.8	0.01	20
187	2024-07-06	93	72	82.5	9.7	0.16	18
189	2024-07-08	92	65	78.5	5.5	T	14
190	2024-07-09	95	76	85.5	12.4	T	21
191	2024-07-10	94	76	85.0	11.8	T	20
194	2024-07-13	91	70	80.5	7.1	0.00	16
195	2024-07-14	91	63	77.0	3.6	0.00	12
196	2024-07-15	94	74	84.0	10.6	0.01	19
197	2024-07-16	93	72	82.5	9.0	0.60	18
211	2024-07-30	91	71	81.0	7.9	0.00	16
213	2024-08-01	92	69	80.5	7.5	0.00	16

df[maxT >= 90].shape[0]

Let’s reverse the sense of the test, and get its count. The two counts should add up to the total number of days in the year!#

df[maxT < 90].shape[0]

We can combine a test of two different thresholds. Let’s get a count of days where the max. temperature was in the 70s or 80s.#

df[(maxT< 90) & (maxT>=70)].shape[0]

Let’s show all the climate data for all these “pleasantly warm” days!#

pleasant = df[(maxT< 90) & (maxT>=70)]
pleasant

	DATE	MAX	MIN	AVG	DEP	PCP	SNW	DPTH	HDD	CDD
99	2024-04-09	77	37	57.0	11.9	0.00	0.0	0	8	0
113	2024-04-23	70	33	51.5	0.1	T	0.0	0	13	0
118	2024-04-28	75	48	61.5	8.0	0.05	0.0	0	3	0
120	2024-04-30	74	51	62.5	8.2	0.43	0.0	0	2	0
121	2024-05-01	75	54	64.5	9.8	T	0.0	0	0	0
...	...	...	...	...	...	...	...	...	...	...
303	2024-10-30	74	54	64.0	17.7	T	0.0	0	1	0
304	2024-10-31	79	55	67.0	21.1	0.00	0.0	0	0	2
305	2024-11-01	74	48	61.0	15.4	0.02	0.0	0	4	0
309	2024-11-05	72	52	62.0	17.8	0.00	0.0	0	3	0
310	2024-11-06	77	53	65.0	21.2	T	0.0	0	0	0

149 rows × 10 columns

Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days.#

pd.set_option ('display.max_rows', None)
pleasant

	DATE	MAX	MIN	AVG	DEP	PCP	HDD	CDD
99	2024-04-09	77	37	57.0	11.9	0.00	8	0
113	2024-04-23	70	33	51.5	0.1	T	13	0
118	2024-04-28	75	48	61.5	8.0	0.05	3	0
120	2024-04-30	74	51	62.5	8.2	0.43	2	0
121	2024-05-01	75	54	64.5	9.8	T	0	0
122	2024-05-02	76	55	65.5	10.4	0.00	0	1
123	2024-05-03	73	50	61.5	6.0	0.00	3	0
124	2024-05-04	74	53	63.5	7.6	0.00	1	0
126	2024-05-06	78	52	65.0	8.4	T	0	0
127	2024-05-07	79	44	61.5	4.6	0.00	3	0
128	2024-05-08	82	52	67.0	9.7	0.48	0	2
133	2024-05-13	71	39	55.0	-3.9	0.02	10	0
134	2024-05-14	83	58	70.5	11.3	0.00	0	6
136	2024-05-16	73	56	64.5	4.7	T	0	0
137	2024-05-17	76	50	63.0	2.9	0.00	2	0
138	2024-05-18	77	56	66.5	6.1	0.00	0	2
139	2024-05-19	80	57	68.5	7.8	0.00	0	4
140	2024-05-20	86	58	72.0	11.0	0.00	0	7
141	2024-05-21	85	63	74.0	12.8	0.39	0	9
142	2024-05-22	89	64	76.5	15.0	0.00	0	12
143	2024-05-23	80	56	68.0	6.2	0.04	0	3
144	2024-05-24	82	52	67.0	4.9	0.00	0	2
145	2024-05-25	81	51	66.0	3.6	T	0	1
146	2024-05-26	84	60	72.0	9.3	0.00	0	7
147	2024-05-27	81	69	75.0	12.1	0.56	0	10
148	2024-05-28	79	59	69.0	5.8	T	0	4
149	2024-05-29	76	57	66.5	3.0	0.00	0	2
150	2024-05-30	71	52	61.5	-2.3	T	3	0
151	2024-05-31	76	47	61.5	-2.6	0.00	3	0
152	2024-06-01	85	48	66.5	2.1	0.00	0	2
153	2024-06-02	84	51	67.5	2.9	0.00	0	3
154	2024-06-03	87	64	75.5	10.6	0.00	0	11
155	2024-06-04	88	58	73.0	7.8	0.00	0	8
156	2024-06-05	87	66	76.5	11.0	0.00	0	12
157	2024-06-06	83	63	73.0	7.2	0.32	0	8
158	2024-06-07	78	55	66.5	0.4	0.51	0	2
159	2024-06-08	73	54	63.5	-2.9	T	1	0
160	2024-06-09	72	56	64.0	-2.7	0.73	1	0
161	2024-06-10	71	52	61.5	-5.5	0.00	3	0
163	2024-06-12	74	54	64.0	-3.5	0.00	1	0
164	2024-06-13	86	51	68.5	0.7	0.00	0	4
165	2024-06-14	86	66	76.0	7.9	0.05	0	11
166	2024-06-15	79	56	67.5	-0.9	0.00	0	3
167	2024-06-16	77	47	62.0	-6.7	0.00	3	0
168	2024-06-17	88	63	75.5	6.6	T	0	11
172	2024-06-21	89	69	79.0	9.0	0.32	0	14
173	2024-06-22	81	70	75.5	5.3	0.53	0	11
174	2024-06-23	85	71	78.0	7.5	0.24	0	13
175	2024-06-24	76	61	68.5	-2.2	0.23	0	4
176	2024-06-25	88	57	72.5	1.6	T	0	8
177	2024-06-26	86	69	77.5	6.3	T	0	13
178	2024-06-27	79	60	69.5	-1.9	0.00	0	5
179	2024-06-28	77	50	63.5	-8.1	0.00	1	0
180	2024-06-29	76	59	67.5	-4.3	0.32	0	3
181	2024-06-30	87	64	75.5	3.5	0.02	0	11
182	2024-07-01	79	61	70.0	-2.1	0.12	0	5
183	2024-07-02	84	54	69.0	-3.3	0.00	0	4
184	2024-07-03	85	64	74.5	2.1	0.00	0	10
185	2024-07-04	85	70	77.5	4.9	T	0	13
186	2024-07-05	87	74	80.5	7.8	T	0	16
188	2024-07-07	89	68	78.5	5.6	0.00	0	14
192	2024-07-11	86	75	80.5	7.2	0.30	0	16
193	2024-07-12	85	73	79.0	5.7	0.00	0	14
198	2024-07-17	88	68	78.0	4.5	0.53	0	13
199	2024-07-18	83	65	74.0	0.5	T	0	9
200	2024-07-19	83	59	71.0	-2.5	0.00	0	6
201	2024-07-20	86	59	72.5	-1.0	0.00	0	8
202	2024-07-21	85	61	73.0	-0.5	0.05	0	8
203	2024-07-22	87	64	75.5	2.1	T	0	11
204	2024-07-23	83	69	76.0	2.6	0.12	0	11
205	2024-07-24	83	70	76.5	3.1	0.02	0	12
206	2024-07-25	83	66	74.5	1.1	T	0	10
207	2024-07-26	85	61	73.0	-0.3	0.00	0	8
208	2024-07-27	86	59	72.5	-0.8	0.00	0	8
209	2024-07-28	87	62	74.5	1.3	0.00	0	10
210	2024-07-29	84	69	76.5	3.3	0.01	0	12
212	2024-07-31	89	74	81.5	8.5	0.65	0	17
214	2024-08-02	87	68	77.5	4.6	0.72	0	13
215	2024-08-03	86	73	79.5	6.7	0.48	0	15
216	2024-08-04	86	73	79.5	6.7	0.11	0	15
217	2024-08-05	87	67	77.0	4.3	0.02	0	12
218	2024-08-06	73	61	67.0	-5.6	0.63	0	2
219	2024-08-07	78	61	69.5	-3.0	0.18	0	5
220	2024-08-08	74	59	66.5	-5.9	0.64	0	2
221	2024-08-09	82	67	74.5	2.2	2.91	0	10
222	2024-08-10	83	62	72.5	0.3	0.00	0	8
223	2024-08-11	81	56	68.5	-3.6	0.07	0	4
224	2024-08-12	74	55	64.5	-7.5	T	0	0
225	2024-08-13	82	55	68.5	-3.4	0.00	0	4
226	2024-08-14	86	60	73.0	1.2	0.00	0	8
227	2024-08-15	86	59	72.5	0.8	T	0	8
228	2024-08-16	86	64	75.0	3.4	0.00	0	10
229	2024-08-17	80	70	75.0	3.5	T	0	10
230	2024-08-18	76	67	71.5	0.2	0.28	0	7
231	2024-08-19	82	59	70.5	-0.7	1.89	0	6
233	2024-08-21	74	52	63.0	-7.9	T	2	0
234	2024-08-22	75	56	65.5	-5.2	0.00	0	1
235	2024-08-23	81	53	67.0	-3.6	0.00	0	2
236	2024-08-24	83	55	69.0	-1.4	0.00	0	4
237	2024-08-25	83	59	71.0	0.8	0.02	0	6
238	2024-08-26	84	61	72.5	2.5	0.00	0	8
239	2024-08-27	87	59	73.0	3.2	0.00	0	8
240	2024-08-28	88	65	76.5	6.9	T	0	12
241	2024-08-29	73	61	67.0	-2.3	0.00	0	2
242	2024-08-30	77	66	71.5	2.4	0.00	0	7
243	2024-08-31	76	63	69.5	0.7	0.15	0	5
244	2024-09-01	87	67	77.0	8.4	T	0	12
245	2024-09-02	74	51	62.5	-5.8	0.00	2	0
246	2024-09-03	75	45	60.0	-8.0	0.00	5	0
247	2024-09-04	78	50	64.0	-3.7	0.00	1	0
248	2024-09-05	80	53	66.5	-0.9	0.00	0	2
249	2024-09-06	79	49	64.0	-3.1	0.00	1	0
250	2024-09-07	75	54	64.5	-2.3	0.53	0	0
252	2024-09-09	73	48	60.5	-5.6	0.04	4	0
253	2024-09-10	73	51	62.0	-3.8	0.00	3	0
254	2024-09-11	77	47	62.0	-3.4	0.00	3	0
255	2024-09-12	81	51	66.0	0.9	0.00	0	1
256	2024-09-13	84	59	71.5	6.8	0.00	0	7
257	2024-09-14	85	59	72.0	7.7	0.00	0	7
258	2024-09-15	85	59	72.0	8.1	0.00	0	7
259	2024-09-16	84	57	70.5	7.0	0.00	0	6
260	2024-09-17	80	54	67.0	3.9	0.00	0	2
261	2024-09-18	79	54	66.5	3.8	0.00	0	2
262	2024-09-19	85	56	70.5	8.2	0.00	0	6
263	2024-09-20	80	57	68.5	6.6	0.00	0	4
264	2024-09-21	79	57	68.0	6.5	0.00	0	3
265	2024-09-22	76	53	64.5	3.5	0.00	0	0
266	2024-09-23	70	55	62.5	1.9	0.00	2	0
267	2024-09-24	71	47	59.0	-1.2	0.00	6	0
270	2024-09-27	76	59	67.5	8.6	0.00	0	3
271	2024-09-28	80	58	69.0	10.5	0.00	0	4
272	2024-09-29	72	56	64.0	6.0	0.00	1	0
273	2024-09-30	77	55	66.0	8.4	0.00	0	1
274	2024-10-01	74	54	64.0	6.8	0.00	1	0
276	2024-10-03	74	54	64.0	7.6	0.00	1	0
277	2024-10-04	74	49	61.5	5.6	0.02	3	0
278	2024-10-05	71	49	60.0	4.5	0.02	5	0
279	2024-10-06	72	41	56.5	1.4	0.00	8	0
280	2024-10-07	70	49	59.5	4.8	T	5	0
292	2024-10-19	71	34	52.5	2.4	0.00	12	0
293	2024-10-20	70	33	51.5	1.7	0.00	13	0
294	2024-10-21	80	42	61.0	11.6	0.00	4	0
295	2024-10-22	80	47	63.5	14.4	0.00	1	0
296	2024-10-23	78	55	66.5	17.8	T	0	2
303	2024-10-30	74	54	64.0	17.7	T	1	0
304	2024-10-31	79	55	67.0	21.1	0.00	0	2
305	2024-11-01	74	48	61.0	15.4	0.02	4	0
309	2024-11-05	72	52	62.0	17.8	0.00	3	0
310	2024-11-06	77	53	65.0	21.2	T	0	0

Now let’s visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib’s plotting package.#

maxT.plot()

<Axes: >

../../_images/0a7afe288da006d029d96c607152bef13a91be7e20cf333fe29e427cda6ea0c3.png

minT.plot()

<Axes: >

../../_images/64b1672d30e27a7ee1289b697e38d7043455b50f7a2519ad1e6a2f3c4c9b8b8a.png

The data plotted fine, but the look could be better. First, let’s import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib’s graphs look better.#

Info on seaborn: https://seaborn.pydata.org/index.html

import seaborn as sns
sns.set()

maxT.plot()

<Axes: >

../../_images/3ea3a2d6d39707f78d7801587f9117671db8c0d106287e3d9644c864ec0114ad.png

minT.plot()

<Axes: >

../../_images/19b8e48c4e01d624a0750f222f0982db6931ecf94bfa7869cfa7ea3f0d4b52ad.png

Next, let’s plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We’ll also add some helpful labels and expand the size of the figure.#

You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed!#

fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red')
ax.plot (Date, minT, color='blue')
ax.set_title (f"ALB Year {year}")
ax.set_xlabel('Day of Year')
ax.set_ylabel('Temperature (°F')

Text(0, 0.5, 'Temperature (°F')

../../_images/ac6551c66cd27852ac1564cbf9900288ab686256ae3e1abdfef61f9f7bf33249.png

We will deal with this by using one of Pandas’ methods that take strings and convert them to a special type of data … not strings nor numbers, but datetime objects. Note carefully how we do this here … it is not terribly intuitive, but we’ll explain it more in an upcoming lecture/notebook on `datetime`. You will see though that the output column now looks a bit more date-like, with a four-digit year followed by two-digit month and date.#

Date = pd.to_datetime(Date,format="%Y-%m-%d")
Date

   2024-01-01
   2024-01-02
   2024-01-03
   2024-01-04
   2024-01-05
   2024-01-06
   2024-01-07
   2024-01-08
   2024-01-09
   2024-01-10
  2024-01-11
  2024-01-12
  2024-01-13
  2024-01-14
  2024-01-15
  2024-01-16
  2024-01-17
  2024-01-18
  2024-01-19
  2024-01-20
  2024-01-21
  2024-01-22
  2024-01-23
  2024-01-24
  2024-01-25
  2024-01-26
  2024-01-27
  2024-01-28
  2024-01-29
  2024-01-30
  2024-01-31
  2024-02-01
  2024-02-02
  2024-02-03
  2024-02-04
  2024-02-05
  2024-02-06
  2024-02-07
  2024-02-08
  2024-02-09
  2024-02-10
  2024-02-11
  2024-02-12
  2024-02-13
  2024-02-14
  2024-02-15
  2024-02-16
  2024-02-17
  2024-02-18
  2024-02-19
  2024-02-20
  2024-02-21
  2024-02-22
  2024-02-23
  2024-02-24
  2024-02-25
  2024-02-26
  2024-02-27
  2024-02-28
  2024-02-29
  2024-03-01
  2024-03-02
  2024-03-03
  2024-03-04
  2024-03-05
  2024-03-06
  2024-03-07
  2024-03-08
  2024-03-09
  2024-03-10
  2024-03-11
  2024-03-12
  2024-03-13
  2024-03-14
  2024-03-15
  2024-03-16
  2024-03-17
  2024-03-18
  2024-03-19
  2024-03-20
  2024-03-21
  2024-03-22
  2024-03-23
  2024-03-24
  2024-03-25
  2024-03-26
  2024-03-27
  2024-03-28
  2024-03-29
  2024-03-30
  2024-03-31
  2024-04-01
  2024-04-02
  2024-04-03
  2024-04-04
  2024-04-05
  2024-04-06
  2024-04-07
  2024-04-08
  2024-04-09
 2024-04-10
 2024-04-11
 2024-04-12
 2024-04-13
 2024-04-14
 2024-04-15
 2024-04-16
 2024-04-17
 2024-04-18
 2024-04-19
 2024-04-20
 2024-04-21
 2024-04-22
 2024-04-23
 2024-04-24
 2024-04-25
 2024-04-26
 2024-04-27
 2024-04-28
 2024-04-29
 2024-04-30
 2024-05-01
 2024-05-02
 2024-05-03
 2024-05-04
 2024-05-05
 2024-05-06
 2024-05-07
 2024-05-08
 2024-05-09
 2024-05-10
 2024-05-11
 2024-05-12
 2024-05-13
 2024-05-14
 2024-05-15
 2024-05-16
 2024-05-17
 2024-05-18
 2024-05-19
 2024-05-20
 2024-05-21
 2024-05-22
 2024-05-23
 2024-05-24
 2024-05-25
 2024-05-26
 2024-05-27
 2024-05-28
 2024-05-29
 2024-05-30
 2024-05-31
 2024-06-01
 2024-06-02
 2024-06-03
 2024-06-04
 2024-06-05
 2024-06-06
 2024-06-07
 2024-06-08
 2024-06-09
 2024-06-10
 2024-06-11
 2024-06-12
 2024-06-13
 2024-06-14
 2024-06-15
 2024-06-16
 2024-06-17
 2024-06-18
 2024-06-19
 2024-06-20
 2024-06-21
 2024-06-22
 2024-06-23
 2024-06-24
 2024-06-25
 2024-06-26
 2024-06-27
 2024-06-28
 2024-06-29
 2024-06-30
 2024-07-01
 2024-07-02
 2024-07-03
 2024-07-04
 2024-07-05
 2024-07-06
 2024-07-07
 2024-07-08
 2024-07-09
 2024-07-10
 2024-07-11
 2024-07-12
 2024-07-13
 2024-07-14
 2024-07-15
 2024-07-16
 2024-07-17
 2024-07-18
 2024-07-19
 2024-07-20
 2024-07-21
 2024-07-22
 2024-07-23
 2024-07-24
 2024-07-25
 2024-07-26
 2024-07-27
 2024-07-28
 2024-07-29
 2024-07-30
 2024-07-31
 2024-08-01
 2024-08-02
 2024-08-03
 2024-08-04
 2024-08-05
 2024-08-06
 2024-08-07
 2024-08-08
 2024-08-09
 2024-08-10
 2024-08-11
 2024-08-12
 2024-08-13
 2024-08-14
 2024-08-15
 2024-08-16
 2024-08-17
 2024-08-18
 2024-08-19
 2024-08-20
 2024-08-21
 2024-08-22
 2024-08-23
 2024-08-24
 2024-08-25
 2024-08-26
 2024-08-27
 2024-08-28
 2024-08-29
 2024-08-30
 2024-08-31
 2024-09-01
 2024-09-02
 2024-09-03
 2024-09-04
 2024-09-05
 2024-09-06
 2024-09-07
 2024-09-08
 2024-09-09
 2024-09-10
 2024-09-11
 2024-09-12
 2024-09-13
 2024-09-14
 2024-09-15
 2024-09-16
 2024-09-17
 2024-09-18
 2024-09-19
 2024-09-20
 2024-09-21
 2024-09-22
 2024-09-23
 2024-09-24
 2024-09-25
 2024-09-26
 2024-09-27
 2024-09-28
 2024-09-29
 2024-09-30
 2024-10-01
 2024-10-02
 2024-10-03
 2024-10-04
 2024-10-05
 2024-10-06
 2024-10-07
 2024-10-08
 2024-10-09
 2024-10-10
 2024-10-11
 2024-10-12
 2024-10-13
 2024-10-14
 2024-10-15
 2024-10-16
 2024-10-17
 2024-10-18
 2024-10-19
 2024-10-20
 2024-10-21
 2024-10-22
 2024-10-23
 2024-10-24
 2024-10-25
 2024-10-26
 2024-10-27
 2024-10-28
 2024-10-29
 2024-10-30
 2024-10-31
 2024-11-01
 2024-11-02
 2024-11-03
 2024-11-04
 2024-11-05
 2024-11-06
 2024-11-07
 2024-11-08
 2024-11-09
 2024-11-10
 2024-11-11
 2024-11-12
 2024-11-13
 2024-11-14
 2024-11-15
 2024-11-16
 2024-11-17
 2024-11-18
 2024-11-19
 2024-11-20
 2024-11-21
 2024-11-22
 2024-11-23
 2024-11-24
 2024-11-25
 2024-11-26
 2024-11-27
 2024-11-28
 2024-11-29
 2024-11-30
 2024-12-01
 2024-12-02
 2024-12-03
 2024-12-04
 2024-12-05
 2024-12-06
 2024-12-07
 2024-12-08
 2024-12-09
 2024-12-10
 2024-12-11
 2024-12-12
 2024-12-13
 2024-12-14
 2024-12-15
 2024-12-16
 2024-12-17
 2024-12-18
 2024-12-19
 2024-12-20
 2024-12-21
 2024-12-22
 2024-12-23
 2024-12-24
 2024-12-25
 2024-12-26
 2024-12-27
 2024-12-28
 2024-12-29
 2024-12-30
 2024-12-31
Name: DATE, dtype: datetime64[ns]

We’ll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month.#

from matplotlib.dates import DateFormatter, AutoDateLocator,HourLocator,DayLocator,MonthLocator

fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red',label = "Max T")
ax.plot (Date, minT, color='blue', label = "Min T")
ax.set_title (f"ALB Year {year}")
ax.set_xlabel('Date')
ax.set_ylabel('Temperature (°F)' )
ax.xaxis.set_major_locator(MonthLocator(interval=1))
dateFmt = DateFormatter('%b %d')
ax.xaxis.set_major_formatter(dateFmt)
ax.legend (loc="best")

<matplotlib.legend.Legend at 0x146ec84bf020>

../../_images/4f69ad623b2556a2709edebe669127f814053ed4c6b7706d6432813f045d2725.png

Let’s save our beautiful graphic to disk.#

fig.savefig (f'albTemps{year}.png')

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use `matplotlib`’s `hist` method.#

# %load '/spare11/atm350/common/mar04/01b.py'
# Create a figure and size it.
fig, ax = plt.subplots(figsize=(15,10))
# Create a histogram of our data series and divide it in to 10 bins.
ax.hist(maxT, bins=10, color='k', alpha=0.3)

(array([ 2.,  9., 31., 42., 49., 40., 35., 57., 79., 22.]),
 array([11. , 19.5, 28. , 36.5, 45. , 53.5, 62. , 70.5, 79. , 87.5, 96. ]),
 <BarContainer object of 10 artists>)

../../_images/905bef964f106ab7bca79c5bcf8bfe660accf40380b1ccbd76f38f40337c4df9.png

Ok, but the 10 bins were autoselected. Let’s customize our call to the `hist` method by specifying the bounds of each of our bins.#

How can we learn more about how to customize this call? Append a `?` to the name of the method.#

ax.hist?

Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram’s bins.#

# %load '/spare11/atm350/common/mar04/01c.py'
fig, ax = plt.subplots(figsize=(15,10))
ax.hist(maxT, bins=(0,10,20,30,40,50,60,70,80,90,100), color='k', alpha=0.3)
ax.xaxis.set_major_locator(plt.MultipleLocator(10))

../../_images/e7f29ee919b2cd113c6e9fdb586feb7aa989dcd72a958b7f3f9f1e27023d8d73.png

Save this histogram to disk.#

fig.savefig("maxT_hist.png")

Use the describe method on the maximum temperature series to reveal some simple statistical properties.

maxT.describe()

count    366.000000
mean      62.338799
std       19.533743
min       11.000000
25%       46.000000
50%       65.000000
75%       79.750000
max       96.000000
Name: MAX, dtype: float64

Pandas Notebook 1, ATM350 Spring 2025

Contents

Pandas Notebook 1, ATM350 Spring 2025#

Motivating Science Questions:#

Specify the location of the file that contains the climo data. Use the linux ls command to verify it exists.#

Note that in a Jupyter notebook, we can simply use the ! directive to “call” a Linux command.#

Also notice how we refer to a Python variable name when passing it to a Linux command line in this way … we enclose it in braces!#

Use pandas’ `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).#

Once this call succeeds, it returns a Pandas Dataframe object which we reference as `df`#

By simply typing the name of the dataframe object, we can get some of its contents to be “pretty-printed” to the notebook!#

Our dataframe has 365 or 366 rows (corresponding to all the days in the year) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.#

It will be useful to have a variable (more accurately, an object ) that holds the value of the number of rows, and another for the number of columns.#

Remember that Python is a language that uses zero-based indexing, so the first value is accessed as element 0, and the second as element 1!#

Look at the syntax we use below to print out the (integer) value of nRows … it’s another example of string formating.#

Let’s do the same for the # of columns.#

Each column of a Pandas dataframe is known as a series. It is basically an array of values, each of which has a corresponding row #. By default, row #’s accompanying a Series are numbered consecutively, starting with 0 (since Python’s convention is to use zero-based indexing ).#

We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let’s print out the first 30 rows of the dates.#

Similarly, let’s print out the last, or 365th row (Why is it 365, not 366???)#

EXERCISE: Now, let’s create new Series objects; one for Max Temp (name it maxT), and the other for Min Temp (name it minT).#

Let’s now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!#

Now, the test works. What does this data series look like? It actually is a table of booleans … i.e., true/false values.#

As the default output only includes the first and last 5 rows , let’s `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!#

Let’s reverse the sense of the test, and get its count. The two counts should add up to the total number of days in the year!#

We can combine a test of two different thresholds. Let’s get a count of days where the max. temperature was in the 70s or 80s.#

Let’s show all the climate data for all these “pleasantly warm” days!#

Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days.#

Now let’s visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib’s plotting package.#

The data plotted fine, but the look could be better. First, let’s import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib’s graphs look better.#

Next, let’s plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We’ll also add some helpful labels and expand the size of the figure.#

You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed!#

We’ll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month.#

Let’s save our beautiful graphic to disk.#

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use `matplotlib`’s `hist` method.#

Ok, but the 10 bins were autoselected. Let’s customize our call to the `hist` method by specifying the bounds of each of our bins.#

How can we learn more about how to customize this call? Append a `?` to the name of the method.#

Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram’s bins.#

Save this histogram to disk.#

References#

Pandas Notebook 1, ATM350 Spring 2025

Contents

Pandas Notebook 1, ATM350 Spring 2025#

Motivating Science Questions:#

Specify the location of the file that contains the climo data. Use the linux ls command to verify it exists.#

Note that in a Jupyter notebook, we can simply use the ! directive to “call” a Linux command.#

Also notice how we refer to a Python variable name when passing it to a Linux command line in this way … we enclose it in braces!#

Use pandas’ read_csv method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).#

Once this call succeeds, it returns a Pandas Dataframe object which we reference as df#

By simply typing the name of the dataframe object, we can get some of its contents to be “pretty-printed” to the notebook!#

Our dataframe has 365 or 366 rows (corresponding to all the days in the year) and 10 columns that contain data. This is expressed by calling the shape attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.#

It will be useful to have a variable (more accurately, an object ) that holds the value of the number of rows, and another for the number of columns.#

Remember that Python is a language that uses zero-based indexing, so the first value is accessed as element 0, and the second as element 1!#

Look at the syntax we use below to print out the (integer) value of nRows … it’s another example of string formating.#

Let’s do the same for the # of columns.#

Each column of a Pandas dataframe is known as a series. It is basically an array of values, each of which has a corresponding row #. By default, row #’s accompanying a Series are numbered consecutively, starting with 0 (since Python’s convention is to use zero-based indexing ).#

We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let’s print out the first 30 rows of the dates.#

Similarly, let’s print out the last, or 365th row (Why is it 365, not 366???)#

EXERCISE: Now, let’s create new Series objects; one for Max Temp (name it maxT), and the other for Min Temp (name it minT).#

Let’s now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!#

Now, the test works. What does this data series look like? It actually is a table of booleans … i.e., true/false values.#

As the default output only includes the first and last 5 rows , let’s slice and pull out a period in the middle of the year, where we might be more likely to get some Trues!#

Let’s reverse the sense of the test, and get its count. The two counts should add up to the total number of days in the year!#

We can combine a test of two different thresholds. Let’s get a count of days where the max. temperature was in the 70s or 80s.#

Let’s show all the climate data for all these “pleasantly warm” days!#

Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days.#

Now let’s visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib’s plotting package.#

The data plotted fine, but the look could be better. First, let’s import a package, seaborn, that when imported and set using its own method, makes matplotlib’s graphs look better.#

Next, let’s plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We’ll also add some helpful labels and expand the size of the figure.#

You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed!#

Matplotlib will recognize this array as being date/time-related, and when we pass it in as the x-axis, the graphic appears faster, and we also have a more meaningful x-axis label.#

We’ll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month.#

Let’s save our beautiful graphic to disk.#

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use matplotlib’s hist method.#

Ok, but the 10 bins were autoselected. Let’s customize our call to the hist method by specifying the bounds of each of our bins.#

How can we learn more about how to customize this call? Append a ? to the name of the method.#

Revise the call to ax.hist, and also draw tick marks that align with the bounds of the histogram’s bins.#

Save this histogram to disk.#

References#

Use pandas’ `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).#

Once this call succeeds, it returns a Pandas Dataframe object which we reference as `df`#

Our dataframe has 365 or 366 rows (corresponding to all the days in the year) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.#

As the default output only includes the first and last 5 rows , let’s `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!#

The data plotted fine, but the look could be better. First, let’s import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib’s graphs look better.#

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use `matplotlib`’s `hist` method.#

Ok, but the 10 bins were autoselected. Let’s customize our call to the `hist` method by specifying the bounds of each of our bins.#

How can we learn more about how to customize this call? Append a `?` to the name of the method.#

Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram’s bins.#