# Pandas 1: Introduction to Pandas

<center><img src="https://github.com/pandas-dev/pandas/raw/main/web/pandas/static/img/pandas.svg" alt="pandas Logo" style="width: 800px;"/></center>

---

## Overview
### `Pandas`, along with `Matplotlib` and `Numpy`, forms the *Great Triumvirate*  of the scientific Python ecosystem. Its features, as cited in <a href="https://pandas.pydata.org/about/">https://pandas.pydata.org/about/</a>, include:

1. A fast and efficient DataFrame object for data manipulation with integrated indexing;

2. Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;

3. Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;

4. Flexible reshaping and pivoting of data sets;

5. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;

6. Columns can be inserted and deleted from data structures for size mutability;

7. Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;

8. High performance merging and joining of data sets;

9. Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;

10. Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;

11. Highly optimized for performance, with critical code paths written in Cython or C.

12. Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more (<b><i>such as atmospheric science!</i></b>).

## Prerequisites

| Concepts | Importance | Notes |
| --- | --- | --- |
| Python basics| Necessary | |
| Numpy basics | Helpful | |

* **Time to learn**: 30 minutes

___

## Imports

To begin using Pandas, simply import it. You will often see the nickname `pd` used as an abbreviation for pandas in the import statement, just like `numpy` is often imported as `np`. 

In [None]:
import pandas as pd

Typically, one uses Pandas to read from and/or write to files containing *tabular data* ... e.g., text files consisting of rows and columns. Let's use for this notebook a file containing NYS Mesonet (NYSM) data from 0200 UTC 2 September 2021.

First, let's view the first and last five lines of this data file as if we were using the Linux command-line interface.

<div class="alert alert-block alert-info">
<b>Tip:</b> In a Jupyter notebook, you can invoke Linux commands by prepending each Linux command with a <code>!</code></div>

In [None]:
# Directly run the Linux `head` and `tail` commands to display the first five lines and last five lines from the data file.
dataFile = '/spare11/atm533/data/nysm_data_2021090202.csv'
!head -5 {dataFile}
!echo .
!echo .
!echo .
!tail -5 {dataFile}

___

We can see that this file has *comma-separated values*, hence the `csv` suffix is used for naming.
It has a line, or *row* at the top identifying what each *column* corresponds to, data-wise. Then, there follows 126 rows, in alphabetical order for each of the 126 NYS Mesonet sites. 

<div class="alert alert-warning">
    <b>Note:</b> Occasionally, some columns may have <i>missing</i> data. For an example of this, change the <code>dataFile</code>'s file name so it references 0000 UTC Sep. 11, 2020, and then rerun the cell. Examine <i>Wolcott</i>'s (<b>WOLC</b>) values. Change back to 0200 UTC 2 Sep. 2021 and re-run before you proceed!</div>

Although there is a lot of interesting data in this file, it's all currently in a text-based form, not terribly conducive to data analysis nor visualization. **Pandas** to the rescue!

Let's introduce ourselves to Pandas' two core objects: the `DataFrame` and the `Series`. 

## The pandas [`DataFrame`](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)...
... is a **labeled**, two dimensional columnal structure similar to a table, Excel-like spreadsheet, or the R language's `data.frame`.

![dataframe schematic](https://github.com/pandas-dev/pandas/raw/master/doc/source/_static/schemas/01_table_dataframe.svg "Schematic of a pandas DataFrame")

The `columns` that make up our `DataFrame` can be lists, dictionaries, NumPy arrays, pandas `Series`, or more. Within these `columns` our data can be any texts, numbers, dates and times, or many other data types you may have encountered in **Python** and **NumPy**. Shown here on the left in dark gray, our very first `column`  is uniquely referrred to as an `Index`, and this contains information characterizing each row of our `DataFrame`. Similar to any other `column`, the `index` can label our rows by text, numbers, `datetime`s (a popular one!), or more.

It turns out that a Pandas `DataFrame` consists of one or more Pandas `Series`. We'll discuss the latter in a moment, but for now, let's create a `DataFrame` from our text-based data file.

We can read the data into a Pandas `DataFrame` object by calling Pandas' `read_csv` method, since the data file consists of comma-separated values. 

In [None]:
df = pd.read_csv(dataFile)

<div class="alert alert-block alert-info">
<b>Tip:</b> We have used a generic object name, <code>df</code> to store the resulting <code>DataFrame</code>. We are free to choose any valid Python object name. For example, we could have named it <code>nysmData21090200</code> (note that Python object names cannot start with a number).</div>

By simply typing the name of the `DataFrame` object, we can see its contents displayed in a browser-friendly format. Since we passed no arguments besides the name of the `csv` file, the `DataFrame` has the following default properties:
1. The first and last *five* rows and columns are displayed
1. The *column* names arise from the first line in the file
1. The *row* names (or more precisely, row *index* names) are numbered sequentially, beginning at 0.

In [None]:
df

Pandas allows us to use its `set_option` method to override the default settings. Let's use it so we see all rows and columns.

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
df

For a relatively small `DataFrame` as ours, this is ok, but you definitely would want to return to a stricter limit for larger `DataFrame`s (Pandas can support millions of rows and/or columns!) Let's restrict back down to 10 rows and columns (five at the start, five at the end) now.

In [None]:
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

<div class="alert alert-warning"><b>Note: </b>Recall that occasionally, there may be some missing data. In Pandas, these are denoted as <b>NaN</b> ... literally, "Not a Number".</div>

A Pandas `DataFrame` is a 2-dimensional array of **rows** and **columns**. To get the array size, print out the `shape` attribute. The first element is the number of rows, while the second is the number of columns. The following cell prints out the number of rows and columns in this particular `DataFrame`:

In [None]:
print (df.shape)
nRows = df.shape[0]
nColumns = df.shape[1]
print ("There are %d rows and %d columns in this DataFrame." % (nRows, nColumns))

Pandas refers to the column and row names as `Index`es, which are 1-d(imensional) arrays. Display the names of the columns:

In [None]:
colNames = df.columns
colNames

___ 
You might think that the row index would have a similar **attribute**, but it doesn't:

In [None]:
rowNames = df.rows
rowNames

___
We actually use the `index` attribute to get at the row names. It's a special type of object, known as a `RangeIndex`.

In [None]:
rowNames = df.index
rowNames

___
We can view this `RangeIndex` as a Python `list` as follows:

In [None]:
list(rowNames)

___
Why are the row indices a sequence of integers beginning at 0, and not the first column (in this case, **station**) of the `DataFrame`? As we noted above, that is just the default behavior.  We can specify what column to use for the row index as an additional argument to `pd.read_csv` :

In [None]:
df2 = pd.read_csv(dataFile,index_col=0)

<div class="alert alert-block alert-info">
<b>Tip:</b> We assign the resulting `DataFrame` to a different object, to distinguish it from the first one. Once again, we could use any valid object name we want.</div>

In [None]:
df2

In [None]:
df2.index

___
Now, let's examine the *2-meter temperature* column, and thus, begin our exploration of Pandas' second core object, the `Series`.

## The pandas [`Series`](https://pandas.pydata.org/docs/user_guide/dsintro.html#series)...

... is essentially any one of the columns of our `DataFrame`, with its accompanying `Index` to provide a label for each value in our column.

![pandas Series](https://github.com/pandas-dev/pandas/raw/master/doc/source/_static/schemas/01_table_series.svg "Schematic of a pandas Series")

The pandas `Series` is a fast and capable 1-dimensional array of nearly any data type we could want, and it can behave very similarly to a NumPy `ndarray` or a Python `dict`. You can take a look at any of the `Series` that make up your `DataFrame` with its label and the Python `dict` notation, or (if permitted), with dot-shorthand:

1. Python `dict` notation, using brackets:

In [None]:
t2m = df['temp_2m [degC]'] # Note: column name must typed exactly as it is named, so watch out for spaces!

2. As a shorthand, we might use treat the column as an **attribute** and use *dot notation* to access it, but only in certain circumstances, which does *not* include the following, due to the presence of spaces and other special characters in this particular column's name:

In [None]:
#t2m = df.'temp_2m [degC]' # commented out since this will fail!

<div class="alert alert-block alert-info">
<b>Tip:</b> It's never wrong to use the dictionary-based technique, so we'll use it in most of the examples in this and subsequent notebooks that use <b>Pandas</b>!</div>

Let's view this `Series` object:

In [None]:
t2m

### A `Series` is a 1-dimensional array, but with the `DataFrame`'s `Index` attached. To represent it as a `Numpy` array, we use its `values` attribute.

In [None]:
t2m.values

<div class="alert alert-block alert-info">
    <b>Tip:</b> In this case, we must use <i>dot notation</i>, but this is because <code>values</code> is not a column name, but a particular <code>attribute</code> of this <code>Series</code> object.</div>

Notice that there is **metadata** ... i.e., *data about the data*, attached to this data series ... in the form of the column index name. Without it, we'd have no idea what the data represents nor what units its in.

<div class="alert alert-block alert-info">
    <b>Tip:</b> Once we start working with data in <b>NetCDF</b> format, as part of the <b>Xarray</b> library, we will see that NetCDF has even more advanced support for including metadata.</div>

There are several interesting methods available for `Series`. One is `describe`, which prints summary statistics on  numerical `Series` objects:

In [None]:
t2m.describe()

<div class="alert alert-block alert-info">
    <b>Tip:</b> Yet another Pythonic nuance here ... note that we follow <code>describe</code> with a set of parentheses <code>()</code>. In this case, <code>describe</code> is a particular <i>method</i>, or <i>function</i> that is available for a Pandas <code>Series</code>.</div>

<div class="alert alert-warning">
    <b>Exercise:</b> Now define a <code>Series</code> object called <b>RH</b> and populate it with the column from the <code>DataFrame</code>
 corresponding to <i>Relative Humidity</i>. Print out its values and get its summary statistics.

In [None]:
# Write your code below. 
# After you have done so, you can compare your code to the solution by uncommenting the line in the cell below.


In [None]:
# %load /spare11/atm533/common/pandas/01a.py

<div class="alert alert-warning">
<b>Question: </b>Was the <b>count</b>, obtained when you ran the summary statistics method, the same as for 2-meter temperature? If not, why?

In [None]:
# Uncomment the line below after you have considered the question.

In [None]:
# %load /spare11/atm533/common/week4/01b.py

---
## Summary
* Pandas is a very powerful tool for working with tabular (i.e. spreadsheet-style) data
* Pandas core objects are the `DataFrame` and the `Series`
* A Pandas `DataFrame` consists of one or more `Series`
* Pandas can be helpful for exploratory data analysis, such as basic statistics

### What's Next?
In the next notebook, we will use Pandas to further examine meteorological data from the [New York State Mesonet](https://www2.nysmesonet.org) and display it on a map.

## Resources and References
1. [Getting Started with Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)
1. [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)