Manipulating and Explaining Weather Data

We will be working with data from a multitude of different weather stations from around the world that have kept track of the average temperature per month for many years. We will be using this data to answer a few questions.

Importing Data

To begin, we will first import all the neccessary packages as well as the needed files. Note that “temps.csv” is a downloaded file and is not coming from a link.

import sqlite3 # to work with databases
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from plotly import express as px # to make nice plots
import calendar # easy date manipulation
from sklearn.linear_model import LinearRegression # for doing linear regression

First, we will create a connection with an SQL database called “temps.db”.

conn = sqlite3.connect("temps.db") # create or open a database named "temps.db"

Let’s now import our data on countries and their FIPS 10-4 codes; we will do it on all the weather stations as well.

countries_url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(countries_url)
countries = countries.rename(columns={"Name":"Country"})
# Import the data into our database:
countries.to_sql("countries", conn, if_exists="replace", index=False) 

#R Repeat for weather stations:
stations_url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(stations_url)
stations.to_sql("stations", conn, if_exists="replace", index=False)

Before inputting the temperature data into a database table, we will manipulate it so that it is easier to view and work with.

def manip_df(df):
    """
    Takes a pandas dataframe and alters columns
    Returns pandas dataframe
    Adds column for FIPS 10-4 code, changes months to columns
    """
    df = df.set_index(keys=["ID", "Year"])
    df = df.stack()
    df = df.reset_index()
    df = df.rename(columns={"level_2":"Month", 0:"Temp"})
    df["Temp"] = df["Temp"]/100 # Gives temp in Celsius
    df["Month"] = df["Month"].str[5:].astype(int) # replaces VALUE_ with _, where _ is an int
    df["FIPS 10-4"] = df["ID"].str[0:2]
    return df

Now, we put this data into a table called “temperatures” in bite-sized chunks of 100,000 rows.

df_iter = pd.read_csv("temps.csv", chunksize=100000)
for df_chunk in df_iter:
    df_chunk = manip_df(df_chunk)
    df_chunk.to_sql("temperatures", conn, if_exists="append", index=False)

And just like that, all of our data is stored nice and neatly in our database!

Analyzing Change In Temperature

For our first problem, we will analyze how the temperature changes on average within a given country. We will use the following function to aid in retrieving the relevant data from our database.

def query_climate_database(country, year_begin, year_end, month):
    """
    Input: string, int, int, int
    Retrieves the name, lat, lon, country, year, month, temp
    from the database with the provided country name, between the two years inclusive,
    and for that month only. returns dataframe with that data
    """
    
    conn = sqlite3.connect("temps.db")
    cmd = \
    f"""
    SELECT S.NAME, S.LATITUDE, S.LONGITUDE, C.Country, T.Year, T.Month, T.Temp
    FROM temperatures T
    LEFT JOIN stations S ON T.ID=S.ID
    LEFT JOIN countries C ON T.`FIPS 10-4` = C.`FIPS 10-4`
    WHERE C.Country = '{country}' AND T.Year>={year_begin} AND T.Year<={year_end} AND T.month={month}
    """
    df = pd.read_sql_query(cmd,conn)
    conn.close()
    return df

Next, we will use a model to estimate the change in temperature per year using linear regression.

def coef(data_group):
    """
    Applies a linear regression model to a df
    returns the slope of the line
    """
    x = data_group[["Year"]] # 2 brackets because X should be a df
    y = data_group["Temp"]   # 1 bracket because y should be a series
    LR = LinearRegression()
    LR.fit(x, y)
    return LR.coef_[0]

Lastly, we define a user-friendly way of creating a geographical plot.

def temperature_coefficient_plot(country, year_begin, year_end, 
                                 month, min_obs, **kwargs):
    """
    Input: string, int, int, int, int, **kwargs
    min_obs is the min number of observations for a specific station
    all kwargs are used for the px.scatter_mapbox() method
    returns px.scatter_mapbox()
    """
    
    df = query_climate_database(country, year_begin, year_end, month)
    df["Count"] = df.groupby("NAME")["Year"].transform(len)
    df = df[ df["Count"] >= min_obs]
    df = df.drop(columns = ["Count"])
    
    coefs = df.groupby(["NAME", "Month"]).apply(coef)
    coefs = coefs.reset_index()
    coefs = coefs.round(4)
    
    df = pd.merge(df, coefs, on = ["NAME"]).dropna()
    #print(coefs)
    df = df.rename(columns={0:"Estimated Yearly Change (°C)"})
    #print(df)
    fig = px.scatter_mapbox(df, lat = "LATITUDE", lon = "LONGITUDE",
                            color="Estimated Yearly Change (°C)",
                            hover_name = "NAME", color_continuous_midpoint=0, 
                            title= f"Estimated Yearly Change in Temperature (°C) " 
                            f"in {calendar.month_name[month]} <br>from Stations in {country} "
                            f"from {year_begin}-{year_end}", **kwargs)
    return fig

Now, lets test it out! We know get an interactive map of the weather stations and the estimated annual change in temperature.

color_map = px.colors.diverging.RdGy_r
fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()
from plotly.io import write_html

Highest and Lowest Temperatures

In any given country, what is the highest and lowest temperature during a given time period? The following functions will help answer that question, giving the user the ability to select whichever country and time period they please and analyze the highest and lowest temperatures in some country.

The following two functions aid in retrieving data from our database and then manipulating it so that it is in a form we can easily use when plotting our data.

The third and final function returns a figure that displays the visualization. The function also allows the user to input any keywords they like that will be passed to the px.line() plot.

def query_climate_database2(country, year_begin, year_end):
    """
    Input: string, int, int
    Retrieves all relevent data with the min and max temperature
    of the country in a particular year
    returns a dataframe with that data
    """
    conn = sqlite3.connect("temps.db")
    cmd = \
    f"""
    SELECT C.Country, T.Year, MIN(T.Temp) AS `Min. Temp`, MAX(T.Temp) as `Max. Temp`
    FROM temperatures T
    LEFT JOIN stations S ON T.ID=S.ID
    LEFT JOIN countries C ON T.`FIPS 10-4` = C.`FIPS 10-4`
    WHERE T.Year>={year_begin} AND T.Year<={year_end}
    GROUP BY C.Country, T.Year
    """
    df = pd.read_sql_query(cmd,conn)
    conn.close()
    df = df[df["Country"]==country]
    return df

def manip_df2(df):
    """
    Input: dataframe
    returns: dataframe
    Adds the min and max temp to the df 
    so that it can easily be used with a facet
    """
    df = df.reset_index()
    df = df.iloc[np.arange(len(df)).repeat(2)]
    df = df.reset_index()
    df["Heat"] = np.where(df.index % 2, 'Hottest', 'Coldest')
    df["Min. Temp"] = np.where(df.index % 2, df["Max. Temp"], df["Min. Temp"])
    df = df.drop(columns = ["Max. Temp", "level_0","index"])
    df = df.rename(columns={"Min. Temp":"Temp"})
    return df

def extreme_temp_line(country, year_begin, year_end, **kwargs):
    """
    Input: string, int, int, **kwargs
    returns: px.line() object
    Takes the inputs and makes a line plot
    all kwargs are used in px.line()
    """
    df = query_climate_database2(country, year_begin, year_end)
    df = manip_df2(df)
    fig = px.line(df, x="Year", y="Temp", facet_col="Heat", height=400, width=800, 
                  title=f"Hottest and Coldest Temperature (°C) in {country} from {year_begin} to {year_end}", 
                  color="Heat", **kwargs)
    return fig

Let’s see it in action! We can see the two temperatures side by side and change the scales/axes however we like.

fig = extreme_temp_line("Yemen", 1900, 2020)
fig.show()

Testing Temperature From the Equator

Does the temperature of a location depend on its distance from the equator? We know that locations on the equator are usually hot, and those at the poles are quite cold. So, we will be testing if there is any correlation between latitude and average temperature. The following function filters out our temperature data to include everything in the range of years and only that month.

def query_climate_database3(year_begin, year_end, month):
    """
    input: int, int, int
    return: dataframe
    Retrieves the releveant info based only on year period and month
    """
    conn = sqlite3.connect("temps.db")
    cmd = \
    f"""
    SELECT S.NAME, S.LATITUDE, S.LONGITUDE, C.Country, T.Year, T.Month, T.Temp
    FROM temperatures T
    LEFT JOIN stations S ON T.ID=S.ID
    LEFT JOIN countries C ON T.`FIPS 10-4` = C.`FIPS 10-4`
    WHERE T.Year>={year_begin} AND T.Year<={year_end} AND T.month={month}
    """
    df = pd.read_sql_query(cmd,conn)
    conn.close()
    return df

Next, we must manipulate our data so that it is ready to be plotted.

def manip_df3(df):
    """
    Input: dataframe
    returns: dataframe
    Takes a df and adds a column with the average temp at that latitude
    """
    df_lat = df.groupby("LATITUDE", as_index=False)["Temp"].mean()
    df = pd.merge(df, df_lat, on = ["LATITUDE"]).dropna()
    df = df.rename(columns = {"Temp_x": "Station Temp", "Temp_y":"Temp. at Lat."})
    return df

Lastly, we make a nice user interface for the user to interact with.

def worldwide_scatter(year_begin, year_end, month, **kwargs):
    """
    Input: int, int, int, **kwargs
    returns: px.scatter() object
    Takes inputs, gets the dataframe from above and makes the plot
    """
    df = query_climate_database3(year_begin, year_end, month)
    df = manip_df3(df)
    fig = px.scatter(data_frame = df, x="LATITUDE", y="Temp. at Lat.", color = "Temp. at Lat.",
                color_continuous_midpoint=0,
                title=f"Worldwide Temperatures (°C) in {calendar.month_name[month]} from {year_begin}-{year_end}",
                **kwargs)
    fig.update_layout(xaxis_title="Latitude", yaxis_title="Temperature (°C)")
    return fig

Let’s try it now! We’ll test from February 2010-2015.

fig = worldwide_scatter(2010,2015,2)
fig.show()

Clearly, we see that the locations near the equator are far hotter than those near the poles. Our suspicions seem to be correct!

Written on October 15, 2021