{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center><img src=\"https://github.com/pandas-dev/pandas/raw/main/web/pandas/static/img/pandas.svg\" alt=\"pandas Logo\" style=\"width: 800px;\"/></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas 5: Combining and Writing\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Previously, we worked with NYS Mesonet data during the period when the remnants of Hurricane Ida was impacting parts of the region. We needed to use site location data that existed in a separate file in order to geolocate NYSM sites and to reduce station pressure to sea-level. It would be more convenient if we could have all the necessary data in a single file.\n",
    "#### We'll once again use NYS Mesonet data, but now we'll use real-time info!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview\n",
    "1. Select the most recent time from a current NYSM file.\n",
    "1. Add columns from the NYSM Site Locations file to the new DataFrame.\n",
    "1. Write out the new DataFrame to a csv file.\n",
    "\n",
    "\n",
    "## Prerequisites\n",
    "\n",
    "| Concepts | Importance | Notes |\n",
    "| --- | --- | --- |\n",
    "|Pandas Notebooks 1-5 | Necessary | |\n",
    "\n",
    "* **Time to learn**: 30 minutes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from datetime import datetime "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select the most recent time from a current NYSM dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Open the file containing the most recent hour's worth of NYSM data (12 5-minute timesteps)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use a Linux shell command to take a peek at the current CSV file\n",
    "! head -5 /data1/nysm/latest.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### As we have done before, we will employ a `lambda` function to interpret the data in the time column as Datetime objects, but keep the default row index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First define the format and then define the function\n",
    "timeFormat = \"%Y-%m-%d %H:%M:%S UTC\"\n",
    "\n",
    "# This function will iterate over each string in a 1-d array\n",
    "# and use Pandas' implementation of strptime to convert the string into a datetime object.\n",
    "parseTime = lambda x: datetime.strptime(x, timeFormat)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data = pd.read_csv('/data1/nysm/latest.csv',parse_dates=['time'], date_parser=parseTime)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to select data corresponding to the most recent time, we treat \"maximum\" as \"latest\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "most_recent = nysm_data.time.max()\n",
    "most_recent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set a *condition* on which we will select matching rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "condition = nysm_data.time == most_recent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data[condition]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above cell returns another `DataFrame`. Note its row indices reflect their values in the original `DataFrame`. We need to **reset** this index. For this, we'll use Pandas `reset_index` method."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-info\">\n",
    "<b>Tip:</b> Remember, you can typically view information about objects, functions and methods in Python by appending a <b>?</b>. For example, you could create an execute a cell below that contains the line <code>nysm_data.reset_index?</code></div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uncomment the next line if you wish to get info on the `reset_index` method\n",
    "# nysm_data.reset_index?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data[condition].reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice the second column, called `index`? That contains the row numbers from the original **nysm_data** `DataFrame` which we don't want. Let's repeat the `reset_index` call, but also include the `drop=True` argument to `reset_index`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_data[condition].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have established how to reset the row index, create a new DataFrame object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_latest = nysm_data[condition].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Add columns from the NYSM Site Locations file to the new DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_sites = pd.read_csv('/spare11/atm533/data/nysm.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nysm_sites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To add a new column to a `DataFrame`, just create a new object for it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_latest['lat'] = nysm_sites['lat']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_latest['lon'] = nysm_sites['lon']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_latest['elevation'] = nysm_sites['elevation']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_latest['name'] = nysm_sites['name']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at what we have now. If you scroll over to the right, you will see the four columns have been appended to the `df_latest` `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df_latest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at all the column names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_latest.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Write out the new DataFrame to a csv file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll employ Pandas' `to_csv` method, and specify `index=False` to avoid the creation of a new *dummy* column. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_latest.to_csv('nysm_latest.csv',index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verify that we can read in the new csv file with Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('nysm_latest.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "## Summary\n",
    "* Pandas includes functions to combine columns from multiple `DataFrame`s.\n",
    "* Often, it is necessary to re-index the modified `DataFrame.\n",
    "* Modifed `DataFrame`s are easily saved as **csv** files.\n",
    "\n",
    "### What's Next?\n",
    "In the next notebook, we will create **multiple indexes** on NYSM `DataFrame`s.\n",
    "\n",
    "## Resources and References\n",
    "1. [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 August 2022 Environment",
   "language": "python",
   "name": "aug22"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}