{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas 1: Introduction to Pandas\n", "\n", "
\"pandas
\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Overview\n", "### `Pandas`, along with `Matplotlib` and `Numpy`, forms the *Great Triumvirate* of the scientific Python ecosystem. Its features, as cited in https://pandas.pydata.org/about/, include:\n", "\n", "1. A fast and efficient DataFrame object for data manipulation with integrated indexing;\n", "\n", "2. Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;\n", "\n", "3. Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;\n", "\n", "4. Flexible reshaping and pivoting of data sets;\n", "\n", "5. Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;\n", "\n", "6. Columns can be inserted and deleted from data structures for size mutability;\n", "\n", "7. Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;\n", "\n", "8. High performance merging and joining of data sets;\n", "\n", "9. Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;\n", "\n", "10. Time series-functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;\n", "\n", "11. Highly optimized for performance, with critical code paths written in Cython or C.\n", "\n", "12. Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more (such as atmospheric science!)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "| Python basics| Necessary | |\n", "| Numpy basics | Helpful | |\n", "\n", "* **Time to learn**: 30 minutes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To begin using Pandas, simply import it. You will often see the nickname `pd` used as an abbreviation for pandas in the import statement, just like `numpy` is often imported as `np`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Typically, one uses Pandas to read from and/or write to files containing *tabular data* ... e.g., text files consisting of rows and columns. Let's use for this notebook a file containing NYS Mesonet (NYSM) data from 0200 UTC 2 September 2021." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's view the first and last five lines of this data file as if we were using the Linux command-line interface." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Tip: In a Jupyter notebook, you can invoke Linux commands by prepending each Linux command with a !
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Directly run the Linux `head` and `tail` commands to display the first five lines and last five lines from the data file.\n", "dataFile = '/spare11/atm533/data/nysm_data_2021090202.csv'\n", "!head -5 {dataFile}\n", "!echo .\n", "!echo .\n", "!echo .\n", "!tail -5 {dataFile}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that this file has *comma-separated values*, hence the `csv` suffix is used for naming.\n", "It has a line, or *row* at the top identifying what each *column* corresponds to, data-wise. Then, there follows 126 rows, in alphabetical order for each of the 126 NYS Mesonet sites. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Note: Occasionally, some columns may have missing data. For an example of this, change the dataFile's file name so it references 0000 UTC Sep. 11, 2020, and then rerun the cell. Examine Wolcott's (WOLC) values. Change back to 0200 UTC 2 Sep. 2021 and re-run before you proceed!
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although there is a lot of interesting data in this file, it's all currently in a text-based form, not terribly conducive to data analysis nor visualization. **Pandas** to the rescue!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's introduce ourselves to Pandas' two core objects: the `DataFrame` and the `Series`. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The pandas [`DataFrame`](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe)...\n", "... is a **labeled**, two dimensional columnal structure similar to a table, Excel-like spreadsheet, or the R language's `data.frame`.\n", "\n", "![dataframe schematic](https://github.com/pandas-dev/pandas/raw/master/doc/source/_static/schemas/01_table_dataframe.svg \"Schematic of a pandas DataFrame\")\n", "\n", "The `columns` that make up our `DataFrame` can be lists, dictionaries, NumPy arrays, pandas `Series`, or more. Within these `columns` our data can be any texts, numbers, dates and times, or many other data types you may have encountered in **Python** and **NumPy**. Shown here on the left in dark gray, our very first `column` is uniquely referrred to as an `Index`, and this contains information characterizing each row of our `DataFrame`. Similar to any other `column`, the `index` can label our rows by text, numbers, `datetime`s (a popular one!), or more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It turns out that a Pandas `DataFrame` consists of one or more Pandas `Series`. We'll discuss the latter in a moment, but for now, let's create a `DataFrame` from our text-based data file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can read the data into a Pandas `DataFrame` object by calling Pandas' `read_csv` method, since the data file consists of comma-separated values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(dataFile)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Tip: We have used a generic object name, df to store the resulting DataFrame. We are free to choose any valid Python object name. For example, we could have named it nysmData21090200 (note that Python object names cannot start with a number).
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By simply typing the name of the `DataFrame` object, we can see its contents displayed in a browser-friendly format. Since we passed no arguments besides the name of the `csv` file, the `DataFrame` has the following default properties:\n", "1. The first and last *five* rows and columns are displayed\n", "1. The *column* names arise from the first line in the file\n", "1. The *row* names (or more precisely, row *index* names) are numbered sequentially, beginning at 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas allows us to use its `set_option` method to override the default settings. Let's use it so we see all rows and columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_rows', None)\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a relatively small `DataFrame` as ours, this is ok, but you definitely would want to return to a stricter limit for larger `DataFrame`s (Pandas can support millions of rows and/or columns!) Let's restrict back down to 10 rows and columns (five at the start, five at the end) now." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_rows', 10)\n", "pd.set_option('display.max_columns', 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note: Recall that occasionally, there may be some missing data. In Pandas, these are denoted as NaN ... literally, \"Not a Number\".
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A Pandas `DataFrame` is a 2-dimensional array of **rows** and **columns**. To get the array size, print out the `shape` attribute. The first element is the number of rows, while the second is the number of columns. The following cell prints out the number of rows and columns in this particular `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print (df.shape)\n", "nRows = df.shape[0]\n", "nColumns = df.shape[1]\n", "print (\"There are %d rows and %d columns in this DataFrame.\" % (nRows, nColumns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas refers to the column and row names as `Index`es, which are 1-d(imensional) arrays. Display the names of the columns:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "colNames = df.columns\n", "colNames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___ \n", "You might think that the row index would have a similar **attribute**, but it doesn't:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rowNames = df.rows\n", "rowNames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "We actually use the `index` attribute to get at the row names. It's a special type of object, known as a `RangeIndex`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rowNames = df.index\n", "rowNames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "We can view this `RangeIndex` as a Python `list` as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list(rowNames)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "Why are the row indices a sequence of integers beginning at 0, and not the first column (in this case, **station**) of the `DataFrame`? As we noted above, that is just the default behavior. We can specify what column to use for the row index as an additional argument to `pd.read_csv` :" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2 = pd.read_csv(dataFile,index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Tip: We assign the resulting `DataFrame` to a different object, to distinguish it from the first one. Once again, we could use any valid object name we want.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df2.index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "___\n", "Now, let's examine the *2-meter temperature* column, and thus, begin our exploration of Pandas' second core object, the `Series`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The pandas [`Series`](https://pandas.pydata.org/docs/user_guide/dsintro.html#series)...\n", "\n", "... is essentially any one of the columns of our `DataFrame`, with its accompanying `Index` to provide a label for each value in our column.\n", "\n", "![pandas Series](https://github.com/pandas-dev/pandas/raw/master/doc/source/_static/schemas/01_table_series.svg \"Schematic of a pandas Series\")\n", "\n", "The pandas `Series` is a fast and capable 1-dimensional array of nearly any data type we could want, and it can behave very similarly to a NumPy `ndarray` or a Python `dict`. You can take a look at any of the `Series` that make up your `DataFrame` with its label and the Python `dict` notation, or (if permitted), with dot-shorthand:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Python `dict` notation, using brackets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t2m = df['temp_2m [degC]'] # Note: column name must typed exactly as it is named, so watch out for spaces!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. As a shorthand, we might use treat the column as an **attribute** and use *dot notation* to access it, but only in certain circumstances, which does *not* include the following, due to the presence of spaces and other special characters in this particular column's name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#t2m = df.'temp_2m [degC]' # commented out since this will fail!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Tip: It's never wrong to use the dictionary-based technique, so we'll use it in most of the examples in this and subsequent notebooks that use Pandas!
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's view this `Series` object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t2m" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A `Series` is a 1-dimensional array, but with the `DataFrame`'s `Index` attached. To represent it as a `Numpy` array, we use its `values` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t2m.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tip: In this case, we must use dot notation, but this is because values is not a column name, but a particular attribute of this Series object.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that there is **metadata** ... i.e., *data about the data*, attached to this data series ... in the form of the column index name. Without it, we'd have no idea what the data represents nor what units its in." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tip: Once we start working with data in NetCDF format, as part of the Xarray library, we will see that NetCDF has even more advanced support for including metadata.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several interesting methods available for `Series`. One is `describe`, which prints summary statistics on numerical `Series` objects:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t2m.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Tip: Yet another Pythonic nuance here ... note that we follow describe with a set of parentheses (). In this case, describe is a particular method, or function that is available for a Pandas Series.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Exercise: Now define a Series object called RH and populate it with the column from the DataFrame\n", " corresponding to Relative Humidity. Print out its values and get its summary statistics." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Write your code below. \n", "# After you have done so, you can compare your code to the solution by uncommenting the line in the cell below.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load /spare11/atm533/common/pandas/01a.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Question: Was the count, obtained when you ran the summary statistics method, the same as for 2-meter temperature? If not, why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment the line below after you have considered the question." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load /spare11/atm533/common/week4/01b.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Summary\n", "* Pandas is a very powerful tool for working with tabular (i.e. spreadsheet-style) data\n", "* Pandas core objects are the `DataFrame` and the `Series`\n", "* A Pandas `DataFrame` consists of one or more `Series`\n", "* Pandas can be helpful for exploratory data analysis, such as basic statistics\n", "\n", "### What's Next?\n", "In the next notebook, we will use Pandas to further examine meteorological data from the [New York State Mesonet](https://www2.nysmesonet.org) and display it on a map.\n", "\n", "## Resources and References\n", "1. [Getting Started with Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)\n", "1. [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 August 2022 Environment", "language": "python", "name": "aug22" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }