Pandas Notebook 1, ATM350 Spring 2024¶

Here, we read in a text file that has climatological data compiled at the National Weather Service in Albany NY for 2023, previously downloaded and reformatted from the xmACIS2 climate data portal.

We will use the Pandas library to read and analyze the data. We will also use the Matplotlib package to visualize it.

Motivating Science Questions:¶

How can we analyze and display tabular climate data for a site?
What was the yearly trace of max/min temperatures for Albany, NY last year?
What was the most common 10-degree maximum temperature range for Albany, NY last year?

# import Pandas and Numpy, and use their conventional two-letter abbreviations when we
# use methods from these packages. Also, import matplotlib's plotting package, using its 
# standard abbreviation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Specify the location of the file that contains the climo data. Use the linux ls command to verify it exists.¶

Note that in a Jupyter notebook, we can simply use the ! directive to “call” a Linux command.¶

Also notice how we refer to a Python variable name when passing it to a Linux command line in this way … we enclose it in braces!¶

file = '/spare11/atm350/common/data/climo_alb_2023.csv'
! ls -l {file}

-rw-r--r-- 1 ktyle faculty 15229 Feb 21 17:10 /spare11/atm350/common/data/climo_alb_2023.csv

Use pandas’ `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).¶

Once this call succeeds, it returns a Pandas Dataframe object which we reference as `df`¶

df = pd.read_csv(file, dtype='string')

By simply typing the name of the dataframe object, we can get some of its contents to be “pretty-printed” to the notebook!¶

df

	DATE	MAX	MIN	AVG	DEP	HDD	CDD	PCP	SNW	DPT
0	2023-01-01	51	34	42.5	16.2	22	0	0.00	0.0	0
1	2023-01-02	51	29	40.0	13.9	25	0	0.00	0.0	0
2	2023-01-03	37	27	32.0	6.1	33	0	0.30	0.0	0
3	2023-01-04	43	37	40.0	14.3	25	0	0.08	0.0	0
4	2023-01-05	49	38	43.5	18.0	21	0	0.03	0.0	0
...	...	...	...	...	...	...	...	...	...	...
360	2023-12-27	51	44	47.5	20.1	17	0	0.30	0.0	0
361	2023-12-28	49	44	46.5	19.3	18	0	0.17	0.0	0
362	2023-12-29	48	42	45.0	18.1	20	0	0.15	0.0	0
363	2023-12-30	45	35	40.0	13.3	25	0	0.05	0.0	0
364	2023-12-31	37	33	35.0	8.5	30	0	T	T	0

365 rows × 10 columns

Our dataframe has 365 rows (corresponding to all the days in the year 2023) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.¶

df.shape

(365, 10)

It will be useful to have a variable (more accurately, an object ) that holds the value of the number of rows, and another for the number of columns.¶

Remember that Python is a language that uses zero-based indexing, so the first value is accessed as element 0, and the second as element 1!¶

Look at the syntax we use below to print out the (integer) value of nRows … it’s another example of string formating.¶

nRows = df.shape[0]
print ("Number of rows = %d" % nRows )

Number of rows = 365

Let’s do the same for the # of columns.¶

nCols = df.shape[1]
print ("Number of columns = %d" % nCols)

Number of columns = 10

To access the values in a particular column, we reference it with its column name as a string. The next cell pulls in all values of the year-month-date column, and assigns it to an object of the same name. We could have named the object anything we wanted, not just Date … but on the right side of the assignment statement, we have to use the exact name of the column.¶

Print out what this object looks like.

Date = df['DATE']
print (Date)

    2023-01-01
    2023-01-02
    2023-01-03
    2023-01-04
    2023-01-05
          ...    
  2023-12-27
  2023-12-28
  2023-12-29
  2023-12-30
  2023-12-31
Name: DATE, Length: 365, dtype: string

Each column of a Pandas dataframe is known as a series. It is basically an array of values, each of which has a corresponding row #. By default, row #’s accompanying a Series are numbered consecutively, starting with 0 (since Python’s convention is to use zero-based indexing ).¶

We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let’s print out the first 30 rows of the dates.¶

print (Date[:30])

   2023-01-01
   2023-01-02
   2023-01-03
   2023-01-04
   2023-01-05
   2023-01-06
   2023-01-07
   2023-01-08
   2023-01-09
   2023-01-10
  2023-01-11
  2023-01-12
  2023-01-13
  2023-01-14
  2023-01-15
  2023-01-16
  2023-01-17
  2023-01-18
  2023-01-19
  2023-01-20
  2023-01-21
  2023-01-22
  2023-01-23
  2023-01-24
  2023-01-25
  2023-01-26
  2023-01-27
  2023-01-28
  2023-01-29
  2023-01-30
Name: DATE, dtype: string

Similarly, let’s print out the last, or 364th row (Why is it 364, not 365???)¶

print(Date[364])

2023-12-31

Note that using -1 as the last index doesn’t work!

print(Date[-1])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/indexes/range.py:414, in RangeIndex.get_loc(self, key)
    413 try:
--> 414     return self._range.index(new_key)
    415 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[11], line 1
----> 1 print(Date[-1])

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/series.py:1040, in Series.__getitem__(self, key)
   1037     return self._values[key]
   1039 elif key_is_scalar:
-> 1040     return self._get_value(key)
   1042 # Convert generator to list before going through hashable part
   1043 # (We will iterate through the generator there to check for slices)
   1044 if is_iterator(key):

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/series.py:1156, in Series._get_value(self, label, takeable)
   1153     return self._values[label]
   1155 # Similar to Index.get_value, but we do not fall back to positional
-> 1156 loc = self.index.get_loc(label)
   1158 if is_integer(loc):
   1159     return self._values[loc]

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/indexes/range.py:416, in RangeIndex.get_loc(self, key)
    414         return self._range.index(new_key)
    415     except ValueError as err:
--> 416         raise KeyError(key) from err
    417 if isinstance(key, Hashable):
    418     raise KeyError(key)

KeyError: -1

However, using a negative value as part of a slice does work:

print(Date[-9:])

  2023-12-23
  2023-12-24
  2023-12-25
  2023-12-26
  2023-12-27
  2023-12-28
  2023-12-29
  2023-12-30
  2023-12-31
Name: DATE, dtype: string

EXERCISE: Now, let’s create new Series objects; one for Max Temp (name it maxT), and the other for Min Temp (name it minT).¶

TIP: After you have tried on your own, you can uncomment the first line of the cell below and re-run to load the solution.

# %load /spare11/atm350/common/feb22/01a.py
maxT = df['MAX']
minT = df['MIN']

maxT

    51
    51
    37
    43
    49
       ..
  51
  49
  48
  45
  37
Name: MAX, Length: 365, dtype: string

minT

    34
    29
    27
    37
    38
       ..
  44
  44
  42
  35
  33
Name: MIN, Length: 365, dtype: string

Let’s now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!¶

hotDays = maxT >= 90

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 hotDays = maxT >= 90

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/arraylike.py:60, in OpsMixin.__ge__(self, other)
     58 @unpack_zerodim_and_defer("__ge__")
     59 def __ge__(self, other):
---> 60     return self._cmp_method(other, operator.ge)

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/series.py:5803, in Series._cmp_method(self, other, op)
   5800 lvalues = self._values
   5801 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
-> 5803 res_values = ops.comparison_op(lvalues, rvalues, op)
   5805 return self._construct_result(res_values, name=res_name)

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:332, in comparison_op(left, right, op)
    323         raise ValueError(
    324             "Lengths must match to compare", lvalues.shape, rvalues.shape
    325         )
    327 if should_extension_dispatch(lvalues, rvalues) or (
    328     (isinstance(rvalues, (Timedelta, BaseOffset, Timestamp)) or right is NaT)
    329     and lvalues.dtype != object
    330 ):
    331     # Call the method on lvalues
--> 332     res_values = op(lvalues, rvalues)
    334 elif is_scalar(rvalues) and isna(rvalues):  # TODO: but not pd.NA?
    335     # numpy does not like comparisons vs None
    336     if op is operator.ne:

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/arraylike.py:60, in OpsMixin.__ge__(self, other)
     58 @unpack_zerodim_and_defer("__ge__")
     59 def __ge__(self, other):
---> 60     return self._cmp_method(other, operator.ge)

File /knight/mamba_aug23/envs/jan24_env/lib/python3.11/site-packages/pandas/core/arrays/string_.py:581, in StringArray._cmp_method(self, other, op)
    578 else:
    579     # logical
    580     result = np.zeros(len(self._ndarray), dtype="bool")
--> 581     result[valid] = op(self._ndarray[valid], other)
    582     return BooleanArray(result, mask)

TypeError: '>=' not supported between instances of 'str' and 'int'

Why did it fail? Remember, when we read in the file, we had Pandas assign the type of every column to string! We need to change the type of maxT to a numerical value. Let’s use a 32-bit floating point #, as that will be more than enough precision for this type of measurement. We’ll do the same for the minimum temp.¶

maxT = maxT.astype("float32")
minT = minT.astype("float32")

maxT

    51.0
    51.0
    37.0
    43.0
    49.0
       ... 
  51.0
  49.0
  48.0
  45.0
  37.0
Name: MAX, Length: 365, dtype: float32

hotDays = maxT >= 90

Now, the test works. What does this data series look like? It actually is a table of booleans … i.e., true/false values.¶

print (hotDays)

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: MAX, Length: 365, dtype: bool

As the default output only includes the first and last 5 rows , let’s `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!¶

print (hotDays[180:195])

  False
  False
  False
  False
  False
   True
   True
  False
  False
  False
  False
   True
  False
   True
  False
Name: MAX, dtype: bool

Now, let’s get a count of the # of days meeting this temperature criterion. Note carefully that we first have to express our set of days exceeding the threshold as a Pandas series. Then, recall that to get a count of the # of rows, we take the first (0th) element of the array returned by a call to the `shape` method.¶

df[maxT >= 90]

	DATE	MAX	MIN	AVG	DEP	CDD	PCP
147	2023-05-28	90	53	71.5	8.3	7	0.00
151	2023-06-01	92	58	75.0	10.6	10	0.00
152	2023-06-02	93	61	77.0	12.4	12	0.00
185	2023-07-05	93	66	79.5	6.8	15	0.00
186	2023-07-06	92	69	80.5	7.7	16	0.00
191	2023-07-11	90	64	77.0	3.7	12	0.36
193	2023-07-13	90	67	78.5	5.1	14	0.92
208	2023-07-28	92	69	80.5	7.3	16	0.00
247	2023-09-05	91	68	79.5	12.1	15	0.00
248	2023-09-06	93	70	81.5	14.4	17	0.00
249	2023-09-07	93	70	81.5	14.7	17	0.51

df[maxT >= 90].shape[0]

Let’s reverse the sense of the test, and get its count. The two counts should add up to 365!¶

df[maxT < 90].shape[0]

We can combine a test of two different thresholds. Let’s get a count of days where the max. temperature was in the 70s or 80s.¶

df[(maxT< 90) & (maxT>=70)].shape[0]

Let’s show all the climate data for all these “pleasantly warm” days!¶

pleasant = df[(maxT< 90) & (maxT>=70)]
pleasant

	DATE	MAX	MIN	AVG	DEP	HDD	CDD	PCP	SNW	DPT
90	2023-04-01	74	41	57.5	15.9	7	0	0.45	0.0	0
100	2023-04-11	74	37	55.5	9.5	9	0	0.00	0.0	0
101	2023-04-12	78	56	67.0	20.5	0	2	0.00	0.0	0
102	2023-04-13	89	51	70.0	23.1	0	5	0.00	0.0	0
103	2023-04-14	89	58	73.5	26.1	0	9	0.00	0.0	0
...	...	...	...	...	...	...	...	...	...	...
278	2023-10-06	74	62	68.0	12.9	0	3	0.00	0.0	0
297	2023-10-25	73	45	59.0	11.0	6	0	0.00	0.0	0
298	2023-10-26	78	58	68.0	20.3	0	3	T	0.0	0
299	2023-10-27	80	58	69.0	21.7	0	4	0.00	0.0	0
300	2023-10-28	74	48	61.0	14.0	4	0	0.01	0.0	0

137 rows × 10 columns

Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days.¶

pd.set_option ('display.max_rows', None)
pleasant

	DATE	MAX	MIN	AVG	DEP	HDD	CDD	PCP
90	2023-04-01	74	41	57.5	15.9	7	0	0.45
100	2023-04-11	74	37	55.5	9.5	9	0	0.00
101	2023-04-12	78	56	67.0	20.5	0	2	0.00
102	2023-04-13	89	51	70.0	23.1	0	5	0.00
103	2023-04-14	89	58	73.5	26.1	0	9	0.00
104	2023-04-15	84	52	68.0	20.1	0	3	0.00
105	2023-04-16	78	57	67.5	19.2	0	3	T
110	2023-04-21	79	45	62.0	11.4	3	0	0.00
125	2023-05-06	72	39	55.5	-1.1	9	0	0.00
126	2023-05-07	76	41	58.5	1.6	6	0	T
127	2023-05-08	71	47	59.0	1.7	6	0	0.00
129	2023-05-10	72	36	54.0	-3.9	11	0	0.00
130	2023-05-11	80	43	61.5	3.3	3	0	0.00
131	2023-05-12	84	50	67.0	8.4	0	2	0.01
132	2023-05-13	79	54	66.5	7.6	0	2	0.00
134	2023-05-15	74	37	55.5	-4.0	9	0	0.00
135	2023-05-16	80	47	63.5	3.7	1	0	0.00
138	2023-05-19	72	48	60.0	-0.7	5	0	0.00
139	2023-05-20	70	55	62.5	1.5	2	0	0.31
140	2023-05-21	73	50	61.5	0.3	3	0	0.00
141	2023-05-22	74	49	61.5	0.0	3	0	0.00
142	2023-05-23	77	46	61.5	-0.3	3	0	0.00
143	2023-05-24	75	45	60.0	-2.1	5	0	0.02
145	2023-05-26	75	40	57.5	-5.2	7	0	0.00
146	2023-05-27	81	44	62.5	-0.4	2	0	0.00
148	2023-05-29	85	52	68.5	5.0	0	4	0.00
149	2023-05-30	81	55	68.0	4.2	0	3	0.00
150	2023-05-31	88	54	71.0	6.9	0	6	0.00
153	2023-06-03	70	51	60.5	-4.4	4	0	T
155	2023-06-05	77	51	64.0	-1.5	1	0	0.00
156	2023-06-06	71	50	60.5	-5.3	4	0	0.11
160	2023-06-10	75	46	60.5	-6.5	4	0	T
161	2023-06-11	85	50	67.5	0.2	0	3	0.00
162	2023-06-12	85	64	74.5	7.0	0	10	T
163	2023-06-13	76	57	66.5	-1.3	0	2	0.34
164	2023-06-14	75	52	63.5	-4.6	1	0	0.17
165	2023-06-15	81	59	70.0	1.6	0	5	0.01
166	2023-06-16	71	57	64.0	-4.7	1	0	0.08
168	2023-06-18	75	60	67.5	-1.7	0	3	0.00
169	2023-06-19	78	56	67.0	-2.5	0	2	0.00
170	2023-06-20	82	59	70.5	0.8	0	6	0.00
171	2023-06-21	80	60	70.0	0.0	0	5	0.00
172	2023-06-22	82	53	67.5	-2.7	0	3	0.00
173	2023-06-23	87	63	75.0	4.5	0	10	T
174	2023-06-24	78	68	73.0	2.3	0	8	0.54
175	2023-06-25	89	65	77.0	6.1	0	12	T
176	2023-06-26	84	67	75.5	4.3	0	11	1.14
177	2023-06-27	79	64	71.5	0.1	0	7	0.38
178	2023-06-28	77	64	70.5	-1.1	0	6	0.09
179	2023-06-29	83	60	71.5	-0.3	0	7	0.00
180	2023-06-30	86	60	73.0	1.0	0	8	0.00
181	2023-07-01	87	65	76.0	3.9	0	11	0.00
182	2023-07-02	77	65	71.0	-1.3	0	6	1.17
183	2023-07-03	87	69	78.0	5.6	0	13	0.39
184	2023-07-04	89	68	78.5	5.9	0	14	0.07
187	2023-07-07	87	71	79.0	6.1	0	14	0.03
188	2023-07-08	87	69	78.0	5.0	0	13	0.00
189	2023-07-09	86	69	77.5	4.4	0	13	0.60
190	2023-07-10	74	65	69.5	-3.7	0	5	1.65
192	2023-07-12	86	67	76.5	3.2	0	12	0.02
194	2023-07-14	86	67	76.5	3.1	0	12	0.43
195	2023-07-15	88	68	78.0	4.6	0	13	0.03
196	2023-07-16	85	70	77.5	4.0	0	13	1.43
197	2023-07-17	87	67	77.0	3.5	0	12	0.00
198	2023-07-18	87	68	77.5	4.0	0	13	2.31
199	2023-07-19	85	66	75.5	2.0	0	11	0.00
200	2023-07-20	86	61	73.5	0.0	0	9	0.00
201	2023-07-21	82	67	74.5	1.0	0	10	0.18
202	2023-07-22	83	63	73.0	-0.4	0	8	0.00
203	2023-07-23	87	57	72.0	-1.4	0	7	0.00
204	2023-07-24	86	63	74.5	1.1	0	10	0.47
205	2023-07-25	83	63	73.0	-0.4	0	8	0.00
206	2023-07-26	87	66	76.5	3.2	0	12	0.00
207	2023-07-27	87	74	80.5	7.2	0	16	0.34
209	2023-07-29	86	61	73.5	0.3	0	9	0.25
210	2023-07-30	79	57	68.0	-5.1	0	3	0.04
211	2023-07-31	80	55	67.5	-5.5	0	3	0.01
212	2023-08-01	77	56	66.5	-6.5	0	2	0.00
213	2023-08-02	78	52	65.0	-7.9	0	0	0.00
214	2023-08-03	80	60	70.0	-2.8	0	5	0.04
215	2023-08-04	77	62	69.5	-3.3	0	5	0.45
216	2023-08-05	82	61	71.5	-1.2	0	7	0.00
217	2023-08-06	82	57	69.5	-3.1	0	5	0.00
218	2023-08-07	76	65	70.5	-2.0	0	6	1.06
219	2023-08-08	81	68	74.5	2.1	0	10	0.07
220	2023-08-09	83	63	73.0	0.7	0	8	T
221	2023-08-10	79	60	69.5	-2.7	0	5	0.35
222	2023-08-11	81	60	70.5	-1.6	0	6	0.00
223	2023-08-12	81	55	68.0	-4.0	0	3	0.00
224	2023-08-13	83	64	73.5	1.6	0	9	0.43
225	2023-08-14	84	62	73.0	1.2	0	8	0.02
226	2023-08-15	73	67	70.0	-1.7	0	5	0.53
227	2023-08-16	82	68	75.0	3.4	0	10	T
228	2023-08-17	81	71	76.0	4.5	0	11	0.01
229	2023-08-18	79	63	71.0	-0.3	0	6	0.72
230	2023-08-19	76	57	66.5	-4.7	0	2	T
231	2023-08-20	79	55	67.0	-4.0	0	2	0.00
232	2023-08-21	84	63	73.5	2.6	0	9	T
233	2023-08-22	80	55	67.5	-3.2	0	3	0.00
234	2023-08-23	80	57	68.5	-2.1	0	4	0.00
235	2023-08-24	71	57	64.0	-6.4	1	0	0.35
236	2023-08-25	79	66	72.5	2.3	0	8	0.13
237	2023-08-26	84	67	75.5	5.5	0	11	0.00
238	2023-08-27	79	62	70.5	0.7	0	6	0.00
239	2023-08-28	77	57	67.0	-2.6	0	2	0.46
240	2023-08-29	84	67	75.5	6.2	0	11	0.00
241	2023-08-30	80	62	71.0	1.9	0	6	0.51
242	2023-08-31	73	55	64.0	-4.8	1	0	0.00
243	2023-09-01	77	51	64.0	-4.6	1	0	0.00
244	2023-09-02	81	55	68.0	-0.3	0	3	0.02
245	2023-09-03	86	64	75.0	7.0	0	10	0.00
246	2023-09-04	89	68	78.5	10.8	0	14	0.00
250	2023-09-08	88	68	78.0	11.5	0	13	0.79
251	2023-09-09	83	66	74.5	8.4	0	10	0.01
252	2023-09-10	78	66	72.0	6.2	0	7	1.35
253	2023-09-11	76	64	70.0	4.6	0	5	0.01
254	2023-09-12	82	64	73.0	7.9	0	8	0.00
255	2023-09-13	75	63	69.0	4.3	0	4	0.19
256	2023-09-14	73	53	63.0	-1.3	2	0	T
258	2023-09-16	76	56	66.0	2.5	0	1	0.00
259	2023-09-17	74	50	62.0	-1.1	3	0	0.20
261	2023-09-19	71	52	61.5	-0.8	3	0	T
262	2023-09-20	72	49	60.5	-1.4	4	0	0.00
263	2023-09-21	72	47	59.5	-2.0	5	0	0.00
264	2023-09-22	74	48	61.0	0.0	4	0	0.00
269	2023-09-27	71	45	58.0	-0.9	7	0	0.00
270	2023-09-28	72	45	58.5	0.0	6	0	0.00
273	2023-10-01	79	53	66.0	8.8	0	1	0.00
274	2023-10-02	78	56	67.0	10.2	0	2	0.00
275	2023-10-03	85	52	68.5	12.1	0	4	0.00
276	2023-10-04	86	57	71.5	15.6	0	7	0.00
277	2023-10-05	82	57	69.5	14.0	0	5	0.00
278	2023-10-06	74	62	68.0	12.9	0	3	0.00
297	2023-10-25	73	45	59.0	11.0	6	0	0.00
298	2023-10-26	78	58	68.0	20.3	0	3	T
299	2023-10-27	80	58	69.0	21.7	0	4	0.00
300	2023-10-28	74	48	61.0	14.0	4	0	0.01

Now let’s visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib’s plotting package.¶

maxT.plot()

<Axes: >

../../_images/4c67bd5dbed88e9646a44840c584068c628bd73991bfab1946021bbc365f60ef.png

minT.plot()

<Axes: >

../../_images/889850dd7ffef4f45b9c1095f623fb849c99e8adfc93ab80f96547cecee6a885.png

The data plotted fine, but the look could be better. First, let’s import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib’s graphs look better.¶

Info on seaborn: https://seaborn.pydata.org/index.html

import seaborn as sns
sns.set()

maxT.plot()

<Axes: >

../../_images/b54dbcaeec1eb15c5b8d6d6ea5ec5020a82e1925d062c539952262c037795cb6.png

minT.plot()

<Axes: >

../../_images/70d3e73af65aeb0bed0a6ee46c62d168cbaf596afcee693638621451154ee146.png

Next, let’s plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We’ll also add some helpful labels and expand the size of the figure.¶

fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red')
ax.plot (Date, minT, color='blue')
ax.set_title ("ALB Year 2023")
ax.set_xlabel('Day of Year')
ax.set_ylabel('Temperature (°F')

Text(0, 0.5, 'Temperature (°F')

../../_images/046cd6dd7e10c569e68adffb191458bdb2e0e315168e303025fa8235b165409f.png

You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed!¶

We will deal with this by using one of Pandas’ methods that take strings and convert them to a special type of data … not strings nor numbers, but datetime objects. Note carefully how we do this here … it is not terribly intuitive, but we’ll explain it more in an upcoming lecture/notebook on `datetime`. You will see though that the output column now looks a bit more date-like, with a four-digit year followed by two-digit month and date.¶

Date = pd.to_datetime(Date,format="%Y-%m-%d")
Date

   2023-01-01
   2023-01-02
   2023-01-03
   2023-01-04
   2023-01-05
   2023-01-06
   2023-01-07
   2023-01-08
   2023-01-09
   2023-01-10
  2023-01-11
  2023-01-12
  2023-01-13
  2023-01-14
  2023-01-15
  2023-01-16
  2023-01-17
  2023-01-18
  2023-01-19
  2023-01-20
  2023-01-21
  2023-01-22
  2023-01-23
  2023-01-24
  2023-01-25
  2023-01-26
  2023-01-27
  2023-01-28
  2023-01-29
  2023-01-30
  2023-01-31
  2023-02-01
  2023-02-02
  2023-02-03
  2023-02-04
  2023-02-05
  2023-02-06
  2023-02-07
  2023-02-08
  2023-02-09
  2023-02-10
  2023-02-11
  2023-02-12
  2023-02-13
  2023-02-14
  2023-02-15
  2023-02-16
  2023-02-17
  2023-02-18
  2023-02-19
  2023-02-20
  2023-02-21
  2023-02-22
  2023-02-23
  2023-02-24
  2023-02-25
  2023-02-26
  2023-02-27
  2023-02-28
  2023-03-01
  2023-03-02
  2023-03-03
  2023-03-04
  2023-03-05
  2023-03-06
  2023-03-07
  2023-03-08
  2023-03-09
  2023-03-10
  2023-03-11
  2023-03-12
  2023-03-13
  2023-03-14
  2023-03-15
  2023-03-16
  2023-03-17
  2023-03-18
  2023-03-19
  2023-03-20
  2023-03-21
  2023-03-22
  2023-03-23
  2023-03-24
  2023-03-25
  2023-03-26
  2023-03-27
  2023-03-28
  2023-03-29
  2023-03-30
  2023-03-31
  2023-04-01
  2023-04-02
  2023-04-03
  2023-04-04
  2023-04-05
  2023-04-06
  2023-04-07
  2023-04-08
  2023-04-09
  2023-04-10
 2023-04-11
 2023-04-12
 2023-04-13
 2023-04-14
 2023-04-15
 2023-04-16
 2023-04-17
 2023-04-18
 2023-04-19
 2023-04-20
 2023-04-21
 2023-04-22
 2023-04-23
 2023-04-24
 2023-04-25
 2023-04-26
 2023-04-27
 2023-04-28
 2023-04-29
 2023-04-30
 2023-05-01
 2023-05-02
 2023-05-03
 2023-05-04
 2023-05-05
 2023-05-06
 2023-05-07
 2023-05-08
 2023-05-09
 2023-05-10
 2023-05-11
 2023-05-12
 2023-05-13
 2023-05-14
 2023-05-15
 2023-05-16
 2023-05-17
 2023-05-18
 2023-05-19
 2023-05-20
 2023-05-21
 2023-05-22
 2023-05-23
 2023-05-24
 2023-05-25
 2023-05-26
 2023-05-27
 2023-05-28
 2023-05-29
 2023-05-30
 2023-05-31
 2023-06-01
 2023-06-02
 2023-06-03
 2023-06-04
 2023-06-05
 2023-06-06
 2023-06-07
 2023-06-08
 2023-06-09
 2023-06-10
 2023-06-11
 2023-06-12
 2023-06-13
 2023-06-14
 2023-06-15
 2023-06-16
 2023-06-17
 2023-06-18
 2023-06-19
 2023-06-20
 2023-06-21
 2023-06-22
 2023-06-23
 2023-06-24
 2023-06-25
 2023-06-26
 2023-06-27
 2023-06-28
 2023-06-29
 2023-06-30
 2023-07-01
 2023-07-02
 2023-07-03
 2023-07-04
 2023-07-05
 2023-07-06
 2023-07-07
 2023-07-08
 2023-07-09
 2023-07-10
 2023-07-11
 2023-07-12
 2023-07-13
 2023-07-14
 2023-07-15
 2023-07-16
 2023-07-17
 2023-07-18
 2023-07-19
 2023-07-20
 2023-07-21
 2023-07-22
 2023-07-23
 2023-07-24
 2023-07-25
 2023-07-26
 2023-07-27
 2023-07-28
 2023-07-29
 2023-07-30
 2023-07-31
 2023-08-01
 2023-08-02
 2023-08-03
 2023-08-04
 2023-08-05
 2023-08-06
 2023-08-07
 2023-08-08
 2023-08-09
 2023-08-10
 2023-08-11
 2023-08-12
 2023-08-13
 2023-08-14
 2023-08-15
 2023-08-16
 2023-08-17
 2023-08-18
 2023-08-19
 2023-08-20
 2023-08-21
 2023-08-22
 2023-08-23
 2023-08-24
 2023-08-25
 2023-08-26
 2023-08-27
 2023-08-28
 2023-08-29
 2023-08-30
 2023-08-31
 2023-09-01
 2023-09-02
 2023-09-03
 2023-09-04
 2023-09-05
 2023-09-06
 2023-09-07
 2023-09-08
 2023-09-09
 2023-09-10
 2023-09-11
 2023-09-12
 2023-09-13
 2023-09-14
 2023-09-15
 2023-09-16
 2023-09-17
 2023-09-18
 2023-09-19
 2023-09-20
 2023-09-21
 2023-09-22
 2023-09-23
 2023-09-24
 2023-09-25
 2023-09-26
 2023-09-27
 2023-09-28
 2023-09-29
 2023-09-30
 2023-10-01
 2023-10-02
 2023-10-03
 2023-10-04
 2023-10-05
 2023-10-06
 2023-10-07
 2023-10-08
 2023-10-09
 2023-10-10
 2023-10-11
 2023-10-12
 2023-10-13
 2023-10-14
 2023-10-15
 2023-10-16
 2023-10-17
 2023-10-18
 2023-10-19
 2023-10-20
 2023-10-21
 2023-10-22
 2023-10-23
 2023-10-24
 2023-10-25
 2023-10-26
 2023-10-27
 2023-10-28
 2023-10-29
 2023-10-30
 2023-10-31
 2023-11-01
 2023-11-02
 2023-11-03
 2023-11-04
 2023-11-05
 2023-11-06
 2023-11-07
 2023-11-08
 2023-11-09
 2023-11-10
 2023-11-11
 2023-11-12
 2023-11-13
 2023-11-14
 2023-11-15
 2023-11-16
 2023-11-17
 2023-11-18
 2023-11-19
 2023-11-20
 2023-11-21
 2023-11-22
 2023-11-23
 2023-11-24
 2023-11-25
 2023-11-26
 2023-11-27
 2023-11-28
 2023-11-29
 2023-11-30
 2023-12-01
 2023-12-02
 2023-12-03
 2023-12-04
 2023-12-05
 2023-12-06
 2023-12-07
 2023-12-08
 2023-12-09
 2023-12-10
 2023-12-11
 2023-12-12
 2023-12-13
 2023-12-14
 2023-12-15
 2023-12-16
 2023-12-17
 2023-12-18
 2023-12-19
 2023-12-20
 2023-12-21
 2023-12-22
 2023-12-23
 2023-12-24
 2023-12-25
 2023-12-26
 2023-12-27
 2023-12-28
 2023-12-29
 2023-12-30
 2023-12-31
Name: DATE, dtype: datetime64[ns]

We’ll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month.¶

from matplotlib.dates import DateFormatter, AutoDateLocator,HourLocator,DayLocator,MonthLocator

fig, ax = plt.subplots(figsize=(15,10))
ax.plot (Date, maxT, color='red',label = "Max T")
ax.plot (Date, minT, color='blue', label = "Min T")
ax.set_title ("ALB Year 2023")
ax.set_xlabel('Date')
ax.set_ylabel('Temperature (°F)' )
ax.xaxis.set_major_locator(MonthLocator(interval=1))
dateFmt = DateFormatter('%b %d')
ax.xaxis.set_major_formatter(dateFmt)
ax.legend (loc="best")

<matplotlib.legend.Legend at 0x154267d66690>

../../_images/bd7d2cc96c1a7c0d609c86dd1ecf7d18da42774d74527c813cc462b05bd3c174.png

Let’s save our beautiful graphic to disk.¶

fig.savefig ('albTemps2023.png')

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use `matplotlib`’s `hist` method.¶

# Create a figure and size it.
fig, ax = plt.subplots(figsize=(15,10))
# Create a histogram of our data series and divide it in to 10 bins.
ax.hist(maxT, bins=10, color='k', alpha=0.3)

(array([ 2.,  6., 49., 47., 43., 33., 39., 46., 62., 38.]),
 array([18. , 25.5, 33. , 40.5, 48. , 55.5, 63. , 70.5, 78. , 85.5, 93. ]),
 <BarContainer object of 10 artists>)

../../_images/342fad385cfab19018805166f1346d207af4c24e23b36e94d981306da3623226.png

Ok, but the 10 bins were autoselected. Let’s customize our call to the `hist` method by specifying the bounds of each of our bins.¶

How can we learn more about how to customize this call? Append a `?` to the name of the method.¶

ax.hist?

Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram’s bins.¶

fig, ax = plt.subplots(figsize=(15,10))
ax.hist(maxT, bins=(0,10,20,30,40,50,60,70,80,90,100), color='k', alpha=0.3)
ax.xaxis.set_major_locator(plt.MultipleLocator(10))

../../_images/036457c9df1f71fe191a7993d6b809b2e5d3deadf0d44acf3738c298740f9d76.png

Save this histogram to disk.¶

fig.savefig("maxT_hist.png")

Use the describe method on the maximum temperature series to reveal some simple statistical properties.

maxT.describe()

count    365.000000
mean      61.879452
std       18.183035
min       18.000000
25%       46.000000
50%       63.000000
75%       79.000000
max       93.000000
Name: MAX, dtype: float64

ATM350 Spring 2024

Pandas Notebook 1, ATM350 Spring 2024

Contents

Pandas Notebook 1, ATM350 Spring 2024¶

Motivating Science Questions:¶

Specify the location of the file that contains the climo data. Use the linux ls command to verify it exists.¶

Note that in a Jupyter notebook, we can simply use the ! directive to “call” a Linux command.¶

Also notice how we refer to a Python variable name when passing it to a Linux command line in this way … we enclose it in braces!¶

Use pandas’ `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).¶

Once this call succeeds, it returns a Pandas Dataframe object which we reference as `df`¶

By simply typing the name of the dataframe object, we can get some of its contents to be “pretty-printed” to the notebook!¶

Our dataframe has 365 rows (corresponding to all the days in the year 2023) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.¶

It will be useful to have a variable (more accurately, an object ) that holds the value of the number of rows, and another for the number of columns.¶

Remember that Python is a language that uses zero-based indexing, so the first value is accessed as element 0, and the second as element 1!¶

Look at the syntax we use below to print out the (integer) value of nRows … it’s another example of string formating.¶

Let’s do the same for the # of columns.¶

Each column of a Pandas dataframe is known as a series. It is basically an array of values, each of which has a corresponding row #. By default, row #’s accompanying a Series are numbered consecutively, starting with 0 (since Python’s convention is to use zero-based indexing ).¶

We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let’s print out the first 30 rows of the dates.¶

Similarly, let’s print out the last, or 364th row (Why is it 364, not 365???)¶

EXERCISE: Now, let’s create new Series objects; one for Max Temp (name it maxT), and the other for Min Temp (name it minT).¶

Let’s now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!¶

Now, the test works. What does this data series look like? It actually is a table of booleans … i.e., true/false values.¶

As the default output only includes the first and last 5 rows , let’s `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!¶

Let’s reverse the sense of the test, and get its count. The two counts should add up to 365!¶

We can combine a test of two different thresholds. Let’s get a count of days where the max. temperature was in the 70s or 80s.¶

Let’s show all the climate data for all these “pleasantly warm” days!¶

Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days.¶

Now let’s visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib’s plotting package.¶

The data plotted fine, but the look could be better. First, let’s import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib’s graphs look better.¶

Next, let’s plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We’ll also add some helpful labels and expand the size of the figure.¶

You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed!¶

We’ll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month.¶

Let’s save our beautiful graphic to disk.¶

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use `matplotlib`’s `hist` method.¶

Ok, but the 10 bins were autoselected. Let’s customize our call to the `hist` method by specifying the bounds of each of our bins.¶

How can we learn more about how to customize this call? Append a `?` to the name of the method.¶

Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram’s bins.¶

Save this histogram to disk.¶

References¶

ATM350 Spring 2024

Pandas Notebook 1, ATM350 Spring 2024

Contents

Pandas Notebook 1, ATM350 Spring 2024¶

Motivating Science Questions:¶

Specify the location of the file that contains the climo data. Use the linux ls command to verify it exists.¶

Note that in a Jupyter notebook, we can simply use the ! directive to “call” a Linux command.¶

Also notice how we refer to a Python variable name when passing it to a Linux command line in this way … we enclose it in braces!¶

Use pandas’ read_csv method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).¶

Once this call succeeds, it returns a Pandas Dataframe object which we reference as df¶

By simply typing the name of the dataframe object, we can get some of its contents to be “pretty-printed” to the notebook!¶

Our dataframe has 365 rows (corresponding to all the days in the year 2023) and 10 columns that contain data. This is expressed by calling the shape attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.¶

It will be useful to have a variable (more accurately, an object ) that holds the value of the number of rows, and another for the number of columns.¶

Remember that Python is a language that uses zero-based indexing, so the first value is accessed as element 0, and the second as element 1!¶

Look at the syntax we use below to print out the (integer) value of nRows … it’s another example of string formating.¶

Let’s do the same for the # of columns.¶

Each column of a Pandas dataframe is known as a series. It is basically an array of values, each of which has a corresponding row #. By default, row #’s accompanying a Series are numbered consecutively, starting with 0 (since Python’s convention is to use zero-based indexing ).¶

We can reference a particular value, or set of values, of a Series by using array-based notation. Below, let’s print out the first 30 rows of the dates.¶

Similarly, let’s print out the last, or 364th row (Why is it 364, not 365???)¶

EXERCISE: Now, let’s create new Series objects; one for Max Temp (name it maxT), and the other for Min Temp (name it minT).¶

Let’s now list all the days that the high temperature was >= 90. Note carefully how we express this test. It will fail!¶

Now, the test works. What does this data series look like? It actually is a table of booleans … i.e., true/false values.¶

As the default output only includes the first and last 5 rows , let’s slice and pull out a period in the middle of the year, where we might be more likely to get some Trues!¶

Let’s reverse the sense of the test, and get its count. The two counts should add up to 365!¶

We can combine a test of two different thresholds. Let’s get a count of days where the max. temperature was in the 70s or 80s.¶

Let’s show all the climate data for all these “pleasantly warm” days!¶

Notice that after a certain point, not all the rows are displayed to the notebook. We can eliminate the limit of maximum rows and thus show all of the matching days.¶

Now let’s visualize the temperature trace over the year! Pandas has a method that directly calls Matplotlib’s plotting package.¶

The data plotted fine, but the look could be better. First, let’s import a package, seaborn, that when imported and set using its own method, makes matplotlib’s graphs look better.¶

Next, let’s plot the two traces simultaneously on the graph so we can better discern max and min temps (this will also enure a single y-axis that will encompass the range of temperature values). We’ll also add some helpful labels and expand the size of the figure.¶

You will notice that this graphic took some time to render. Note that the x-axis label is virtually unreadable. This is because every date is being printed!¶

Matplotlib will recognize this array as being date/time-related, and when we pass it in as the x-axis, the graphic appears faster, and we also have a more meaningful x-axis label.¶

We’ll further refine the look of the plot by adding a legend and have vertical grid lines on a frequency of one month.¶

Let’s save our beautiful graphic to disk.¶

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use matplotlib’s hist method.¶

Ok, but the 10 bins were autoselected. Let’s customize our call to the hist method by specifying the bounds of each of our bins.¶

How can we learn more about how to customize this call? Append a ? to the name of the method.¶

Revise the call to ax.hist, and also draw tick marks that align with the bounds of the histogram’s bins.¶

Save this histogram to disk.¶

References¶

Use pandas’ `read_csv` method to open the file. Specify that the data is to be read in as strings (not integers nor floating points).¶

Once this call succeeds, it returns a Pandas Dataframe object which we reference as `df`¶

Our dataframe has 365 rows (corresponding to all the days in the year 2023) and 10 columns that contain data. This is expressed by calling the `shape` attribute of the dataframe. The first number in the pair is the # of rows, while the second is the # of columns.¶

As the default output only includes the first and last 5 rows , let’s `slice` and pull out a period in the middle of the year, where we might be more likely to get some `True`s!¶

The data plotted fine, but the look could be better. First, let’s import a package, `seaborn`, that when imported and `set` using its own method, makes matplotlib’s graphs look better.¶

Now, let’s answer the question, “what was the most common range of maximum temperatures last year in Albany?” via a histogram. We use `matplotlib`’s `hist` method.¶

Ok, but the 10 bins were autoselected. Let’s customize our call to the `hist` method by specifying the bounds of each of our bins.¶

How can we learn more about how to customize this call? Append a `?` to the name of the method.¶

Revise the call to `ax.hist`, and also draw tick marks that align with the bounds of the histogram’s bins.¶