<img src="http://xarray.pydata.org/en/stable/_static/dataset-diagram-logo.png" align="center" width="50%">

# Xarray 2: DataArrays
---

## Overview
1. Why not use a Pandas DataFrame for gridded data?
1. Anatomy of a DataArray created from a gridded NetCDF data file 

## Prerequisites

| Concepts | Importance | Notes |
| --- | --- | --- |
| Python basics| Necessary | |
| Numpy basics| Necessary | |
| Pandas | Necessary | |
| Xarray 1 Intro | Necessary | |

* **Time to learn**: 15 minutes

## Imports

In [None]:
import xarray as xr
import pandas as pd

## Why not use a Pandas DataFrame for gridded data?

 
Here is a Pandas DataFrame representation of the 500 hPa geopotential height for 0000 UTC 30 October 2012, from the [ERA-5 reanalysis](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5):

In [None]:
df = pd.read_csv('/spare11/atm533/data/2012103000_z500_era5.csv',index_col='latitude')

In [None]:
df

In [None]:
df.info()

Output the geopotential height at the grid point closest to Albany (42.75N, -73.75 W; note we have expressed the longitude as if it were degrees East)

In [None]:
df.loc[[42.75],['286.25']]

While Pandas is great for many purposes, extending its inherently 2-d data representation to a multidimensional dataset, such as 4-dimensional (time, vertical level, longitude (x) and latitude (y)) gridded numerical weather prediction (NWP) or c
limate model output, is unwise. 

#### Gridded datasets and Xarray

The Xarray package is ideally-suited for gridded data, in particular NetCDF. It builds and extends on the multi-dimensional data structure in NumPy. We will see that some of the same methods we've used for Pandas have analogues in Xarray.

#### Anatomy of a DataArray

Let's look at the same 500 hPa geopotential height field, but this time we'll use Xarray to open the NetCDF representation.

In [None]:
da = xr.open_dataarray('/spare11/atm533/data/2012103000_z500_era5.nc')

First, let's compare the size of gridded field as represented in a plain-text CSV file and in NetCDF.

In [None]:
! ls -lh /spare11/atm533/data/2012103000_z500_era5.csv
! ls -lh /spare11/atm533/data/2012103000_z500_era5.nc

The NetCDF-formatted file is smaller. While this particular grid is less than 10 MB either way, the space savings become significant as you scale up!

As did Pandas with its `Series` and `DataFrame` core data structures, Xarray also has two "workhorses": the `DataArray` and the `Dataset`. Just as a Pandas `DataFrame` consists of multiple `Series`, an Xarray `Dataset` is made up of `DataArray` objects. Let's first look at our `DataArray`.

In [None]:
# Similar to a Pandas DataFrame, we get a nice (and even more interactive) HTML representation of the object.
da

***
#### The DataArray has the following properties:
1. It is a named *data variable*: 'z'
1. It has three named *dimensions*, in order: time, latitude, and longitude.
1. It has three *coordinate variables*, corresponding to the *dimensions*.
1. It has *attributes* which are the data variable's *metadata*.
1. Its coordinate variables may have their own metadata as well.

### Let's examine each of these five properties.

#### 1. The *data variable*, *z* in this case, is represented by the `DataArray` object itself. We can query various properties of it, with methods similar to Pandas.

In [None]:
# Akin to column and row indices in Pandas:
da.indexes

In [None]:
da.mean()

In [None]:
da.max()

In [None]:
da.min()

### This invocation will return the lat, lon, and time of the maximum value in the DataArray (source: https://stackoverflow.com/questions/40179593/how-to-get-the-coordinates-of-the-maximum-in-xarray)

In [None]:
da.where(da==da.max(), drop=True).coords

In [None]:
# We can use loc to select via dimension values, but note that order of indices must follow dimension order. 
da.loc['2012-10-30 00:00',42.75,286.25]

We can alternatively use Xarray's `sel` indexing technique, where we specify the names of the dimension and the values we are selecting ... can be in any order. 

In [None]:
da.sel(latitude=42.75, longitude = 286.25)

#### 2. Dimension names
In Xarray, dimensions can be thought of as extensions of Pandas' 2-d row/column indices (aka *axes*). We can assign names, or *labels*, to Pandas indexes; in Xarray, these *labeled axes* are a necessary (and excellent) feature.

In [None]:
da.dims

#### 3. Coordinates

*Coordinate variables* in Xarray are 1-dimensional arrays that correspond to the *Data variable*'s dimensions.
In this case, `z` has dimension coordinates of longitude, latitude, and time; each of these three dimension coordinates consist of an array of values, plus metadata.

In [None]:
da.coords

We can assign an object to each coordinate dimension.

In [None]:
lons = da.longitude

In [None]:
lats = da.latitude

In [None]:
times = da.time

In [None]:
lons

#### 4. The data variables will typically have attributes (metadata) attached to them.

In [None]:
da.attrs

In [None]:
da.units

#### 5. The coordinate variables will likely have metadata as well.

In [None]:
times.attrs

In [None]:
lats.attrs

### Just as with Pandas, Xarray has a built-in hook to Matplotlib so we can take a quick look at our data.

In [None]:
da.plot()

#### This NetCDF file that we read in had just a single data variable ... one time, at one vertical level. Typically, gridded model or reanalysis data in NetCDF format will consist of multiple variables ... i.e., multiple Xarray `DataArrays` ... known as a `Dataset`. In the rest of our Xarray notebooks, we will read in, analyze, and visualize examples of `Dataset`s.

---
## Summary
* Xarray builds upon the data models provided by NumPy and Pandas.
* Xarray provides read and write methods for a variety of gridded datasets, such as NetCDF, GRIB, and Zarr.
* Xarray `Datasets` consist of one or more `DataArrays'.
* A `DataArray` typically contains one data variable, with labeled dimensions and coordinates.


### What's Next?
In the next notebook, we'll work with an Xarray `Dataset` and make some plots from it.

## Resources and References
1. [Xarray Documentation](http://xarray.pydata.org)
1. [SciPy 2020 Xarray Tutorial](https://github.com/xarray-contrib/xarray-tutorial/tree/master/scipy-tutorial)