"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas 5: Combining and Writing\n",
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Previously, we worked with NYS Mesonet data during the period when the remnants of Hurricane Ida was impacting parts of the region. We needed to use site location data that existed in a separate file in order to geolocate NYSM sites and to reduce station pressure to sea-level. It would be more convenient if we could have all the necessary data in a single file.\n",
"#### We'll once again use NYS Mesonet data, but now we'll use real-time info!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview\n",
"1. Select the most recent time from a current NYSM file.\n",
"1. Add columns from the NYSM Site Locations file to the new DataFrame.\n",
"1. Write out the new DataFrame to a csv file.\n",
"\n",
"\n",
"## Prerequisites\n",
"\n",
"| Concepts | Importance | Notes |\n",
"| --- | --- | --- |\n",
"|Pandas Notebooks 1-5 | Necessary | |\n",
"\n",
"* **Time to learn**: 30 minutes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from datetime import datetime "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Select the most recent time from a current NYSM dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open the file containing the most recent hour's worth of NYSM data (12 5-minute timesteps)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Use a Linux shell command to take a peek at the current CSV file\n",
"! head -5 /data1/nysm/latest.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### As we have done before, we will employ a `lambda` function to interpret the data in the time column as Datetime objects, but keep the default row index."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# First define the format and then define the function\n",
"timeFormat = \"%Y-%m-%d %H:%M:%S UTC\"\n",
"\n",
"# This function will iterate over each string in a 1-d array\n",
"# and use Pandas' implementation of strptime to convert the string into a datetime object.\n",
"parseTime = lambda x: datetime.strptime(x, timeFormat)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data = pd.read_csv('/data1/nysm/latest.csv',parse_dates=['time'], date_parser=parseTime)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to select data corresponding to the most recent time, we treat \"maximum\" as \"latest\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"most_recent = nysm_data.time.max()\n",
"most_recent"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set a *condition* on which we will select matching rows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"condition = nysm_data.time == most_recent"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data[condition]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above cell returns another `DataFrame`. Note its row indices reflect their values in the original `DataFrame`. We need to **reset** this index. For this, we'll use Pandas `reset_index` method."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"Tip: Remember, you can typically view information about objects, functions and methods in Python by appending a ?. For example, you could create an execute a cell below that contains the line nysm_data.reset_index?
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Uncomment the next line if you wish to get info on the `reset_index` method\n",
"# nysm_data.reset_index?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data[condition].reset_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice the second column, called `index`? That contains the row numbers from the original **nysm_data** `DataFrame` which we don't want. Let's repeat the `reset_index` call, but also include the `drop=True` argument to `reset_index`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_data[condition].reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have established how to reset the row index, create a new DataFrame object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_latest = nysm_data[condition].reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Add columns from the NYSM Site Locations file to the new DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_sites = pd.read_csv('/spare11/atm533/data/nysm.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nysm_sites"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To add a new column to a `DataFrame`, just create a new object for it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_latest['lat'] = nysm_sites['lat']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_latest['lon'] = nysm_sites['lon']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_latest['elevation'] = nysm_sites['elevation']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_latest['name'] = nysm_sites['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at what we have now. If you scroll over to the right, you will see the four columns have been appended to the `df_latest` `DataFrame`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"df_latest"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at all the column names."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_latest.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Write out the new DataFrame to a csv file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll employ Pandas' `to_csv` method, and specify `index=False` to avoid the creation of a new *dummy* column. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_latest.to_csv('nysm_latest.csv',index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Verify that we can read in the new csv file with Pandas"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('nysm_latest.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Summary\n",
"* Pandas includes functions to combine columns from multiple `DataFrame`s.\n",
"* Often, it is necessary to re-index the modified `DataFrame.\n",
"* Modifed `DataFrame`s are easily saved as **csv** files.\n",
"\n",
"### What's Next?\n",
"In the next notebook, we will create **multiple indexes** on NYSM `DataFrame`s.\n",
"\n",
"## Resources and References\n",
"1. [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html#user-guide)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 August 2022 Environment",
"language": "python",
"name": "aug22"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}