Manipulating and Explaining Weather Data
We will be working with data from a multitude of different weather stations from around the world that have kept track of the average temperature per month for many years. We will be using this data to answer a few questions.
Importing Data
To begin, we will first import all the neccessary packages as well as the needed files. Note that “temps.csv” is a downloaded file and is not coming from a link.
import sqlite3 # to work with databases
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from plotly import express as px # to make nice plots
import calendar # easy date manipulation
from sklearn.linear_model import LinearRegression # for doing linear regression
First, we will create a connection with an SQL database called “temps.db”.
conn = sqlite3.connect("temps.db") # create or open a database named "temps.db"
Let’s now import our data on countries and their FIPS 10-4 codes; we will do it on all the weather stations as well.
countries_url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(countries_url)
countries = countries.rename(columns={"Name":"Country"})
# Import the data into our database:
countries.to_sql("countries", conn, if_exists="replace", index=False)
#R Repeat for weather stations:
stations_url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv"
stations = pd.read_csv(stations_url)
stations.to_sql("stations", conn, if_exists="replace", index=False)
Before inputting the temperature data into a database table, we will manipulate it so that it is easier to view and work with.
def manip_df(df):
"""
Takes a pandas dataframe and alters columns
Returns pandas dataframe
Adds column for FIPS 10-4 code, changes months to columns
"""
df = df.set_index(keys=["ID", "Year"])
df = df.stack()
df = df.reset_index()
df = df.rename(columns={"level_2":"Month", 0:"Temp"})
df["Temp"] = df["Temp"]/100 # Gives temp in Celsius
df["Month"] = df["Month"].str[5:].astype(int) # replaces VALUE_ with _, where _ is an int
df["FIPS 10-4"] = df["ID"].str[0:2]
return df
Now, we put this data into a table called “temperatures” in bite-sized chunks of 100,000 rows.
df_iter = pd.read_csv("temps.csv", chunksize=100000)
for df_chunk in df_iter:
df_chunk = manip_df(df_chunk)
df_chunk.to_sql("temperatures", conn, if_exists="append", index=False)
And just like that, all of our data is stored nice and neatly in our database!
Analyzing Change In Temperature
For our first problem, we will analyze how the temperature changes on average within a given country. We will use the following function to aid in retrieving the relevant data from our database.
def query_climate_database(country, year_begin, year_end, month):
"""
Input: string, int, int, int
Retrieves the name, lat, lon, country, year, month, temp
from the database with the provided country name, between the two years inclusive,
and for that month only. returns dataframe with that data
"""
conn = sqlite3.connect("temps.db")
cmd = \
f"""
SELECT S.NAME, S.LATITUDE, S.LONGITUDE, C.Country, T.Year, T.Month, T.Temp
FROM temperatures T
LEFT JOIN stations S ON T.ID=S.ID
LEFT JOIN countries C ON T.`FIPS 10-4` = C.`FIPS 10-4`
WHERE C.Country = '{country}' AND T.Year>={year_begin} AND T.Year<={year_end} AND T.month={month}
"""
df = pd.read_sql_query(cmd,conn)
conn.close()
return df
Next, we will use a model to estimate the change in temperature per year using linear regression.
def coef(data_group):
"""
Applies a linear regression model to a df
returns the slope of the line
"""
x = data_group[["Year"]] # 2 brackets because X should be a df
y = data_group["Temp"] # 1 bracket because y should be a series
LR = LinearRegression()
LR.fit(x, y)
return LR.coef_[0]
Lastly, we define a user-friendly way of creating a geographical plot.
def temperature_coefficient_plot(country, year_begin, year_end,
month, min_obs, **kwargs):
"""
Input: string, int, int, int, int, **kwargs
min_obs is the min number of observations for a specific station
all kwargs are used for the px.scatter_mapbox() method
returns px.scatter_mapbox()
"""
df = query_climate_database(country, year_begin, year_end, month)
df["Count"] = df.groupby("NAME")["Year"].transform(len)
df = df[ df["Count"] >= min_obs]
df = df.drop(columns = ["Count"])
coefs = df.groupby(["NAME", "Month"]).apply(coef)
coefs = coefs.reset_index()
coefs = coefs.round(4)
df = pd.merge(df, coefs, on = ["NAME"]).dropna()
#print(coefs)
df = df.rename(columns={0:"Estimated Yearly Change (°C)"})
#print(df)
fig = px.scatter_mapbox(df, lat = "LATITUDE", lon = "LONGITUDE",
color="Estimated Yearly Change (°C)",
hover_name = "NAME", color_continuous_midpoint=0,
title= f"Estimated Yearly Change in Temperature (°C) "
f"in {calendar.month_name[month]} <br>from Stations in {country} "
f"from {year_begin}-{year_end}", **kwargs)
return fig
Now, lets test it out! We know get an interactive map of the weather stations and the estimated annual change in temperature.
color_map = px.colors.diverging.RdGy_r
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map)
fig.show()
from plotly.io import write_html
Highest and Lowest Temperatures
In any given country, what is the highest and lowest temperature during a given time period? The following functions will help answer that question, giving the user the ability to select whichever country and time period they please and analyze the highest and lowest temperatures in some country.
The following two functions aid in retrieving data from our database and then manipulating it so that it is in a form we can easily use when plotting our data.
The third and final function returns a figure that displays the visualization. The function also allows the user to input any keywords they like that will be passed to the px.line()
plot.
def query_climate_database2(country, year_begin, year_end):
"""
Input: string, int, int
Retrieves all relevent data with the min and max temperature
of the country in a particular year
returns a dataframe with that data
"""
conn = sqlite3.connect("temps.db")
cmd = \
f"""
SELECT C.Country, T.Year, MIN(T.Temp) AS `Min. Temp`, MAX(T.Temp) as `Max. Temp`
FROM temperatures T
LEFT JOIN stations S ON T.ID=S.ID
LEFT JOIN countries C ON T.`FIPS 10-4` = C.`FIPS 10-4`
WHERE T.Year>={year_begin} AND T.Year<={year_end}
GROUP BY C.Country, T.Year
"""
df = pd.read_sql_query(cmd,conn)
conn.close()
df = df[df["Country"]==country]
return df
def manip_df2(df):
"""
Input: dataframe
returns: dataframe
Adds the min and max temp to the df
so that it can easily be used with a facet
"""
df = df.reset_index()
df = df.iloc[np.arange(len(df)).repeat(2)]
df = df.reset_index()
df["Heat"] = np.where(df.index % 2, 'Hottest', 'Coldest')
df["Min. Temp"] = np.where(df.index % 2, df["Max. Temp"], df["Min. Temp"])
df = df.drop(columns = ["Max. Temp", "level_0","index"])
df = df.rename(columns={"Min. Temp":"Temp"})
return df
def extreme_temp_line(country, year_begin, year_end, **kwargs):
"""
Input: string, int, int, **kwargs
returns: px.line() object
Takes the inputs and makes a line plot
all kwargs are used in px.line()
"""
df = query_climate_database2(country, year_begin, year_end)
df = manip_df2(df)
fig = px.line(df, x="Year", y="Temp", facet_col="Heat", height=400, width=800,
title=f"Hottest and Coldest Temperature (°C) in {country} from {year_begin} to {year_end}",
color="Heat", **kwargs)
return fig
Let’s see it in action! We can see the two temperatures side by side and change the scales/axes however we like.
fig = extreme_temp_line("Yemen", 1900, 2020)
fig.show()
Testing Temperature From the Equator
Does the temperature of a location depend on its distance from the equator? We know that locations on the equator are usually hot, and those at the poles are quite cold. So, we will be testing if there is any correlation between latitude and average temperature. The following function filters out our temperature data to include everything in the range of years and only that month.
def query_climate_database3(year_begin, year_end, month):
"""
input: int, int, int
return: dataframe
Retrieves the releveant info based only on year period and month
"""
conn = sqlite3.connect("temps.db")
cmd = \
f"""
SELECT S.NAME, S.LATITUDE, S.LONGITUDE, C.Country, T.Year, T.Month, T.Temp
FROM temperatures T
LEFT JOIN stations S ON T.ID=S.ID
LEFT JOIN countries C ON T.`FIPS 10-4` = C.`FIPS 10-4`
WHERE T.Year>={year_begin} AND T.Year<={year_end} AND T.month={month}
"""
df = pd.read_sql_query(cmd,conn)
conn.close()
return df
Next, we must manipulate our data so that it is ready to be plotted.
def manip_df3(df):
"""
Input: dataframe
returns: dataframe
Takes a df and adds a column with the average temp at that latitude
"""
df_lat = df.groupby("LATITUDE", as_index=False)["Temp"].mean()
df = pd.merge(df, df_lat, on = ["LATITUDE"]).dropna()
df = df.rename(columns = {"Temp_x": "Station Temp", "Temp_y":"Temp. at Lat."})
return df
Lastly, we make a nice user interface for the user to interact with.
def worldwide_scatter(year_begin, year_end, month, **kwargs):
"""
Input: int, int, int, **kwargs
returns: px.scatter() object
Takes inputs, gets the dataframe from above and makes the plot
"""
df = query_climate_database3(year_begin, year_end, month)
df = manip_df3(df)
fig = px.scatter(data_frame = df, x="LATITUDE", y="Temp. at Lat.", color = "Temp. at Lat.",
color_continuous_midpoint=0,
title=f"Worldwide Temperatures (°C) in {calendar.month_name[month]} from {year_begin}-{year_end}",
**kwargs)
fig.update_layout(xaxis_title="Latitude", yaxis_title="Temperature (°C)")
return fig
Let’s try it now! We’ll test from February 2010-2015.
fig = worldwide_scatter(2010,2015,2)
fig.show()