{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas 1: Introduction to Pandas\n",
    "\n",
    "<center><img src=\"https://github.com/pandas-dev/pandas/raw/main/web/pandas/static/img/pandas.svg\" alt=\"pandas Logo\" style=\"width: 800px;\"/></center>\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Overview\n",
    "### `Pandas`, along with `Matplotlib` and `Numpy`, forms the *Great Triumvirate*  of the scientific Python ecosystem. Its features, as cited in <a href=\"https://pandas.pydata.org/about/\">https://pandas.pydata.org/about/</a>, include:\n",
    "\n",
    "1. A fast and efficient DataFrame object for data manipulation with integrated indexing;\n",
    "\n",
    "2. Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;\n",
    "\n",
    "3. Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;\n",
    "\n",
    "4. Flexible reshaping and pivoting of data sets;\n",
    "\n",
    "5. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;\n",
    "\n",
    "6. Columns can be inserted and deleted from data structures for size mutability;\n",
    "\n",
    "7. Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;\n",
    "\n",
    "8. High performance merging and joining of data sets;\n",
    "\n",
    "9. Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;\n",
    "\n",
    "10. Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;\n",
    "\n",
    "11. Highly optimized for performance, with critical code paths written in Cython or C.\n",
    "\n",
    "12. Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more (<b><i>such as atmospheric science!</i></b>)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "| Concepts | Importance | Notes |\n",
    "| --- | --- | --- |\n",
    "| Python basics| Necessary | |\n",
    "| Numpy basics | Helpful | |\n",
    "\n",
    "* **Time to learn**: 30 minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To begin using Pandas, simply import it. You will often see the nickname `pd` used as an abbreviation for pandas in the import statement, just like `numpy` is often imported as `np`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Typically, one uses Pandas to read from and/or write to files containing *tabular data* ... e.g., text files consisting of rows and columns. Let's use for this notebook a file containing NYS Mesonet (NYSM) data from 0200 UTC 2 September 2021."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's view the first and last five lines of this data file as if we were using the Linux command-line interface."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Tip:</b> In a Jupyter notebook, you can invoke Linux commands by prepending each Linux command with a <code>!</code></div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Directly run the Linux `head` and `tail` commands to display the first five lines and last five lines from the data file.\n",
    "dataFile = '/spare11/atm533/data/nysm_data_2021090202.csv'\n",
    "!head -5 {dataFile}\n",
    "!echo .\n",
    "!echo .\n",
    "!echo .\n",
    "!tail -5 {dataFile}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that this file has *comma-separated values*, hence the `csv` suffix is used for naming.\n",
    "It has a line, or *row* at the top identifying what each *column* corresponds to, data-wise. Then, there follows 126 rows, in alphabetical order for each of the 126 NYS Mesonet sites. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "    <b>Note:</b> Occasionally, some columns may have <i>missing</i> data. For an example of this, change the <code>dataFile</code>'s file name so it references 0000 UTC Sep. 11, 2020, and then rerun the cell. Examine <i>Wolcott</i>'s (<b>WOLC</b>) values. Change back to 0200 UTC 2 Sep. 2021 and re-run before you proceed!</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although there is a lot of interesting data in this file, it's all currently in a text-based form, not terribly conducive to data analysis nor visualization. **Pandas** to the rescue!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's introduce ourselves to Pandas' two core objects: the `DataFrame` and the `Series`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The pandas [`DataFrame`](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)...\n",
    "... is a **labeled**, two dimensional columnal structure similar to a table, Excel-like spreadsheet, or the R language's `data.frame`.\n",
    "\n",
    "![dataframe schematic](https://github.com/pandas-dev/pandas/raw/master/doc/source/_static/schemas/01_table_dataframe.svg \"Schematic of a pandas DataFrame\")\n",
    "\n",
    "The `columns` that make up our `DataFrame` can be lists, dictionaries, NumPy arrays, pandas `Series`, or more. Within these `columns` our data can be any texts, numbers, dates and times, or many other data types you may have encountered in **Python** and **NumPy**. Shown here on the left in dark gray, our very first `column`  is uniquely referrred to as an `Index`, and this contains information characterizing each row of our `DataFrame`. Similar to any other `column`, the `index` can label our rows by text, numbers, `datetime`s (a popular one!), or more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It turns out that a Pandas `DataFrame` consists of one or more Pandas `Series`. We'll discuss the latter in a moment, but for now, let's create a `DataFrame` from our text-based data file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can read the data into a Pandas `DataFrame` object by calling Pandas' `read_csv` method, since the data file consists of comma-separated values. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(dataFile)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Tip:</b> We have used a generic object name, <code>df</code> to store the resulting <code>DataFrame</code>. We are free to choose any valid Python object name. For example, we could have named it <code>nysmData21090200</code> (note that Python object names cannot start with a number).</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By simply typing the name of the `DataFrame` object, we can see its contents displayed in a browser-friendly format. Since we passed no arguments besides the name of the `csv` file, the `DataFrame` has the following default properties:\n",
    "1. The first and last *five* rows and columns are displayed\n",
    "1. The *column* names arise from the first line in the file\n",
    "1. The *row* names (or more precisely, row *index* names) are numbered sequentially, beginning at 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas allows us to use its `set_option` method to override the default settings. Let's use it so we see all rows and columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_rows', None)\n",
    "pd.set_option('display.max_columns', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a relatively small `DataFrame` as ours, this is ok, but you definitely would want to return to a stricter limit for larger `DataFrame`s (Pandas can support millions of rows and/or columns!) Let's restrict back down to 10 rows and columns (five at the start, five at the end) now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.set_option('display.max_rows', 10)\n",
    "pd.set_option('display.max_columns', 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\"><b>Note: </b>Recall that occasionally, there may be some missing data. In Pandas, these are denoted as <b>NaN</b> ... literally, \"Not a Number\".</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A Pandas `DataFrame` is a 2-dimensional array of **rows** and **columns**. To get the array size, print out the `shape` attribute. The first element is the number of rows, while the second is the number of columns. The following cell prints out the number of rows and columns in this particular `DataFrame`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print (df.shape)\n",
    "nRows = df.shape[0]\n",
    "nColumns = df.shape[1]\n",
    "print (\"There are %d rows and %d columns in this DataFrame.\" % (nRows, nColumns))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas refers to the column and row names as `Index`es, which are 1-d(imensional) arrays. Display the names of the columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "colNames = df.columns\n",
    "colNames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___ \n",
    "You might think that the row index would have a similar **attribute**, but it doesn't:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rowNames = df.rows\n",
    "rowNames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "We actually use the `index` attribute to get at the row names. It's a special type of object, known as a `RangeIndex`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rowNames = df.index\n",
    "rowNames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "We can view this `RangeIndex` as a Python `list` as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "list(rowNames)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Why are the row indices a sequence of integers beginning at 0, and not the first column (in this case, **station**) of the `DataFrame`? As we noted above, that is just the default behavior.  We can specify what column to use for the row index as an additional argument to `pd.read_csv` :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2 = pd.read_csv(dataFile,index_col=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Tip:</b> We assign the resulting `DataFrame` to a different object, to distinguish it from the first one. Once again, we could use any valid object name we want.</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df2.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "___\n",
    "Now, let's examine the *2-meter temperature* column, and thus, begin our exploration of Pandas' second core object, the `Series`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The pandas [`Series`](https://pandas.pydata.org/docs/user_guide/dsintro.html#series)...\n",
    "\n",
    "... is essentially any one of the columns of our `DataFrame`, with its accompanying `Index` to provide a label for each value in our column.\n",
    "\n",
    "![pandas Series](https://github.com/pandas-dev/pandas/raw/master/doc/source/_static/schemas/01_table_series.svg \"Schematic of a pandas Series\")\n",
    "\n",
    "The pandas `Series` is a fast and capable 1-dimensional array of nearly any data type we could want, and it can behave very similarly to a NumPy `ndarray` or a Python `dict`. You can take a look at any of the `Series` that make up your `DataFrame` with its label and the Python `dict` notation, or (if permitted), with dot-shorthand:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Python `dict` notation, using brackets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t2m = df['temp_2m [degC]'] # Note: column name must typed exactly as it is named, so watch out for spaces!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. As a shorthand, we might use treat the column as an **attribute** and use *dot notation* to access it, but only in certain circumstances, which does *not* include the following, due to the presence of spaces and other special characters in this particular column's name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#t2m = df.'temp_2m [degC]' # commented out since this will fail!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Tip:</b> It's never wrong to use the dictionary-based technique, so we'll use it in most of the examples in this and subsequent notebooks that use <b>Pandas</b>!</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's view this `Series` object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t2m"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A `Series` is a 1-dimensional array, but with the `DataFrame`'s `Index` attached. To represent it as a `Numpy` array, we use its `values` attribute."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t2m.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>Tip:</b> In this case, we must use <i>dot notation</i>, but this is because <code>values</code> is not a column name, but a particular <code>attribute</code> of this <code>Series</code> object.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that there is **metadata** ... i.e., *data about the data*, attached to this data series ... in the form of the column index name. Without it, we'd have no idea what the data represents nor what units its in."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>Tip:</b> Once we start working with data in <b>NetCDF</b> format, as part of the <b>Xarray</b> library, we will see that NetCDF has even more advanced support for including metadata.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several interesting methods available for `Series`. One is `describe`, which prints summary statistics on  numerical `Series` objects:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "t2m.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "    <b>Tip:</b> Yet another Pythonic nuance here ... note that we follow <code>describe</code> with a set of parentheses <code>()</code>. In this case, <code>describe</code> is a particular <i>method</i>, or <i>function</i> that is available for a Pandas <code>Series</code>.</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "    <b>Exercise:</b> Now define a <code>Series</code> object called <b>RH</b> and populate it with the column from the <code>DataFrame</code>\n",
    " corresponding to <i>Relative Humidity</i>. Print out its values and get its summary statistics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your code below. \n",
    "# After you have done so, you can compare your code to the solution by uncommenting the line in the cell below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load /spare11/atm533/common/pandas/01a.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-warning\">\n",
    "<b>Question: </b>Was the <b>count</b>, obtained when you ran the summary statistics method, the same as for 2-meter temperature? If not, why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment the line below after you have considered the question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %load /spare11/atm533/common/week4/01b.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Summary\n",
    "* Pandas is a very powerful tool for working with tabular (i.e. spreadsheet-style) data\n",
    "* Pandas core objects are the `DataFrame` and the `Series`\n",
    "* A Pandas `DataFrame` consists of one or more `Series`\n",
    "* Pandas can be helpful for exploratory data analysis, such as basic statistics\n",
    "\n",
    "### What's Next?\n",
    "In the next notebook, we will use Pandas to further examine meteorological data from the [New York State Mesonet](https://www2.nysmesonet.org) and display it on a map.\n",
    "\n",
    "## Resources and References\n",
    "1. [Getting Started with Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)\n",
    "1. [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 August 2022 Environment",
   "language": "python",
   "name": "aug22"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}