{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\"pandas
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas 5: Combining and Writing\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Previously, we worked with NYS Mesonet data during the period when the remnants of Hurricane Ida was impacting parts of the region. We needed to use site location data that existed in a separate file in order to geolocate NYSM sites and to reduce station pressure to sea-level. It would be more convenient if we could have all the necessary data in a single file.\n", "#### We'll once again use NYS Mesonet data, but now we'll use real-time info!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview\n", "1. Select the most recent time from a current NYSM file.\n", "1. Add columns from the NYSM Site Locations file to the new DataFrame.\n", "1. Write out the new DataFrame to a csv file.\n", "\n", "\n", "## Prerequisites\n", "\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "|Pandas Notebooks 1-5 | Necessary | |\n", "\n", "* **Time to learn**: 30 minutes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from datetime import datetime " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select the most recent time from a current NYSM dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Open the file containing the most recent hour's worth of NYSM data (12 5-minute timesteps)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use a Linux shell command to take a peek at the current CSV file\n", "! head -5 /data1/nysm/latest.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### As we have done before, we will employ a `lambda` function to interpret the data in the time column as Datetime objects, but keep the default row index." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First define the format and then define the function\n", "timeFormat = \"%Y-%m-%d %H:%M:%S UTC\"\n", "\n", "# This function will iterate over each string in a 1-d array\n", "# and use Pandas' implementation of strptime to convert the string into a datetime object.\n", "parseTime = lambda x: datetime.strptime(x, timeFormat)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data = pd.read_csv('/data1/nysm/latest.csv',parse_dates=['time'], date_parser=parseTime)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to select data corresponding to the most recent time, we treat \"maximum\" as \"latest\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "most_recent = nysm_data.time.max()\n", "most_recent" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set a *condition* on which we will select matching rows." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "condition = nysm_data.time == most_recent" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data[condition]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above cell returns another `DataFrame`. Note its row indices reflect their values in the original `DataFrame`. We need to **reset** this index. For this, we'll use Pandas `reset_index` method." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "Tip: Remember, you can typically view information about objects, functions and methods in Python by appending a ?. For example, you could create an execute a cell below that contains the line nysm_data.reset_index?
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Uncomment the next line if you wish to get info on the `reset_index` method\n", "# nysm_data.reset_index?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data[condition].reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the second column, called `index`? That contains the row numbers from the original **nysm_data** `DataFrame` which we don't want. Let's repeat the `reset_index` call, but also include the `drop=True` argument to `reset_index`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_data[condition].reset_index(drop=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have established how to reset the row index, create a new DataFrame object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_latest = nysm_data[condition].reset_index(drop=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Add columns from the NYSM Site Locations file to the new DataFrame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_sites = pd.read_csv('/spare11/atm533/data/nysm.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nysm_sites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To add a new column to a `DataFrame`, just create a new object for it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_latest['lat'] = nysm_sites['lat']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_latest['lon'] = nysm_sites['lon']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_latest['elevation'] = nysm_sites['elevation']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_latest['name'] = nysm_sites['name']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at what we have now. If you scroll over to the right, you will see the four columns have been appended to the `df_latest` `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "df_latest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at all the column names." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_latest.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Write out the new DataFrame to a csv file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll employ Pandas' `to_csv` method, and specify `index=False` to avoid the creation of a new *dummy* column. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_latest.to_csv('nysm_latest.csv',index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify that we can read in the new csv file with Pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('nysm_latest.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Summary\n", "* Pandas includes functions to combine columns from multiple `DataFrame`s.\n", "* Often, it is necessary to re-index the modified `DataFrame.\n", "* Modifed `DataFrame`s are easily saved as **csv** files.\n", "\n", "### What's Next?\n", "In the next notebook, we will create **multiple indexes** on NYSM `DataFrame`s.\n", "\n", "## Resources and References\n", "1. [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 August 2022 Environment", "language": "python", "name": "aug22" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }