{ "cells": [ { "cell_type": "markdown", "id": "6e124235-3846-4fcb-b533-10fa5856b4b4", "metadata": {}, "source": [ "\n", "# Pandas: Working with a JSON file" ] }, { "cell_type": "markdown", "id": "6e75bb80-da84-47a9-ae2d-210eb06d492e", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "9d226e98-85a1-4197-a757-448bd2bf4563", "metadata": {}, "source": [ "## Overview\n", "In this notebook, we will create a [Pandas Dataframe](https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe) from a remotely-served [JSON](https://www.json.org/) file. This particular file contains forecasted [solar wind](https://www.swpc.noaa.gov/phenomena/solar-wind) parameters from NOAA's [Space Weather Prediction Center](https://www.swpc.noaa.gov).\n", "\n", "1. Read in a JSON file\n", "1. Reformat the `Dataframe`\n", "1. Visualize the dataset" ] }, { "cell_type": "markdown", "id": "daeabc7d-d4f6-4f9f-aad7-679123b9d2fb", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "| Concepts | Importance | Notes |\n", "| --- | --- | --- |\n", "| [Pandas](https://foundations.projectpythia.org/core/pandas/pandas.html) | Necessary | |\n", "\n", "- **Time to learn**: 10 minutes\n" ] }, { "cell_type": "markdown", "id": "7b875611-44ef-4b1f-9453-f2fa84bb4d82", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "c732c9d1-0e00-4d9b-8e29-73ce71c99499", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "4d48491f-4332-4eff-af72-0382c6c5794a", "metadata": { "tags": [] }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "690e69eb-5ec0-458d-962f-a25cdf8983dc", "metadata": {}, "source": [ "## Read in a JSON file" ] }, { "cell_type": "markdown", "id": "630ace3c-7e81-45a8-a884-a045d7afbae6", "metadata": {}, "source": [ "### NOAA's SWPC has a variety of forecast output in JSON format. Here, we create a `Dataframe` Pandas' `read_json` method from the current 1-day plasma forecast." ] }, { "cell_type": "code", "execution_count": null, "id": "3be45f7e-8f1b-4645-8eff-6d7d0c6976a4", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = pd.read_json(\"https://services.swpc.noaa.gov/products/solar-wind/plasma-1-day.json\")" ] }, { "cell_type": "markdown", "id": "2128b21f-7d34-45f6-b08f-23287a7761ea", "metadata": {}, "source": [ "Examine the `Dataframe`" ] }, { "cell_type": "code", "execution_count": null, "id": "deea1ad0-78ef-4b47-8ed4-951b2f8e2a5b", "metadata": { "tags": [] }, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "id": "1b34e2ce-19a9-46b6-a346-da9cc814a8a6", "metadata": {}, "source": [ "## Reformat the `Dataframe`" ] }, { "cell_type": "markdown", "id": "61ff9856-2a64-4606-8c13-4cc1bbe6f384", "metadata": {}, "source": [ "Notice that the column headers look to be in the `Dataframe`'s first row. Let's modify it." ] }, { "cell_type": "code", "execution_count": null, "id": "5d198bce-c48e-4c77-9785-ab26bfaac669", "metadata": { "tags": [] }, "outputs": [], "source": [ "# Set the columns to be the values of the first row. Then drop that first row.\n", "df = df.rename(columns=df.iloc[0]).drop(df.index[0])" ] }, { "cell_type": "markdown", "id": "e3c8f341-a5f6-46f0-ac1d-2d1feeb4631d", "metadata": {}, "source": [ "Examine the reformatted `Dataframe`" ] }, { "cell_type": "code", "execution_count": null, "id": "7a457cb3-6203-45d5-9467-c8c5ffc07e52", "metadata": { "tags": [] }, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "id": "43bfed63-0fa2-4d3a-8a0b-162386dace6f", "metadata": {}, "source": [ "### Set the `Dataframe`'s index to the timestamped column." ] }, { "cell_type": "code", "execution_count": null, "id": "5e1f2933-6193-4f2d-b012-7c5f18a3999b", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.index" ] }, { "cell_type": "markdown", "id": "4b3ade9e-eebd-49d0-a321-624939d42d7d", "metadata": {}, "source": [ "Currently, the `Dataframe` has a *default index* (i.e., a range of integers). For time series data (i.e., time is the independent variable), it is [good practice](https://pandas.pydata.org/docs/user_guide/timeseries.html) to use a time-based column as the index." ] }, { "cell_type": "code", "execution_count": null, "id": "83cdc00a-2a38-4ef7-ac33-c2f6875c2dc3", "metadata": { "tags": [] }, "outputs": [], "source": [ "df = df.set_index('time_tag')" ] }, { "cell_type": "code", "execution_count": null, "id": "9abc6605-9676-48ef-a9c5-57ef13b21eed", "metadata": { "tags": [] }, "outputs": [], "source": [ "df" ] }, { "cell_type": "markdown", "id": "7d4e6843-638b-4801-af35-9ddbc191378a", "metadata": {}, "source": [ "### Check and edit the `dtypes` of the independent and dependent variables" ] }, { "cell_type": "markdown", "id": "7fc15d05-b3cb-4575-a4c5-c25123613bf7", "metadata": {}, "source": [ "In this case, the `Dataframe`'s index corresponds to the independent variable, and the columns correspond to the dependent variables." ] }, { "cell_type": "code", "execution_count": null, "id": "b8341c2d-7d4b-4960-93e6-b90670aab9a3", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.index" ] }, { "cell_type": "code", "execution_count": null, "id": "0e165dc3-1a51-406d-9b31-b92a16721a1d", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "id": "90c69195-8111-4c0f-bb6f-17aa036b7147", "metadata": { "tags": [] }, "source": [ "They are all `object`s ... and as a result won't be amenable to typical time-series visualization methods. Change them to more appropriate `dtype`s ... `float32` for the dependent variables, and `datetime64` for the time-based one." ] }, { "cell_type": "code", "execution_count": null, "id": "32beea01-3390-4eb6-9d62-090ad70ba2a9", "metadata": { "tags": [] }, "outputs": [], "source": [ "for col in df.columns:\n", " df[col] = df[col].astype(\"float32\")\n", "df.index = pd.to_datetime(df.index)" ] }, { "cell_type": "code", "execution_count": null, "id": "eba917fd-6167-4726-ae0f-79c064d4ef72", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.index = pd.to_datetime(df.index)" ] }, { "cell_type": "markdown", "id": "38611e12-3f66-4e39-ac83-2021df2bf63e", "metadata": {}, "source": [ "## Visualize the dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "634749f1-f613-4e90-aff8-6efdd038d251", "metadata": { "tags": [] }, "outputs": [], "source": [ "df.temperature.plot(figsize=(10, 8));" ] }, { "cell_type": "markdown", "id": "5d085493-0e98-4e6f-a190-a948bd44f53a", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "id": "5a43f367-1bdf-46b6-b253-f40ff965ff02", "metadata": {}, "source": [ "## Summary\n", "Pandas has several reader functions for differently-formatted tabular datasets. In this notebook, we created a `Dataframe` via Pandas `read_json` function, and then manipulated the `Dataframe` to allow for a useful time-series visualization." ] }, { "cell_type": "markdown", "id": "419af5fe-4cfc-466a-a4b0-1df66ddae8f8", "metadata": {}, "source": [ "
Note:
\n", " There is no strict format specification for JSON files. The strategy we followed to create and reformat theDataframe
in this notebook will likely need to change for other JSON-formatted datasets you may encounter!\n",
"