Conquering Databases and Interactive Visualizations ft. NOAA Climate Data Set
In this post I will outline how to create several interesting, interactive data graphics using the NOAA climate data set.
Downloading the Data
As always, the first thing we have to do before constructing any sort of data visualisation is to download and read in the data. To read in the temperature, station-metadata, and countries data, I will be used the Pandas package and running the following lines of code:
import pandas as pd
temperatures = pd.read_csv(data_path+"temps.csv")
stations = pd.read_csv(data_path+"station-metadata.csv")
countries = pd.read_csv(data_path+"countries.csv")
After reading the data into Jupyter Notebook, it is always a good practice to make sure that the data has been read in correctly by using the .head()
function to check the first few rows of the data frame and the .shape
function to check the shape of the data frame.
# Checking the temperatures data
temperatures.head()
ID | Year | VALUE1 | VALUE2 | VALUE3 | VALUE4 | VALUE5 | VALUE6 | VALUE7 | VALUE8 | VALUE9 | VALUE10 | VALUE11 | VALUE12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACW00011604 | 1961 | -89.0 | 236.0 | 472.0 | 773.0 | 1128.0 | 1599.0 | 1570.0 | 1481.0 | 1413.0 | 1174.0 | 510.0 | -39.0 |
1 | ACW00011604 | 1962 | 113.0 | 85.0 | -154.0 | 635.0 | 908.0 | 1381.0 | 1510.0 | 1393.0 | 1163.0 | 994.0 | 323.0 | -126.0 |
2 | ACW00011604 | 1963 | -713.0 | -553.0 | -99.0 | 541.0 | 1224.0 | 1627.0 | 1620.0 | 1596.0 | 1332.0 | 940.0 | 566.0 | -108.0 |
3 | ACW00011604 | 1964 | 62.0 | -85.0 | 55.0 | 738.0 | 1219.0 | 1442.0 | 1506.0 | 1557.0 | 1221.0 | 788.0 | 546.0 | 112.0 |
4 | ACW00011604 | 1965 | 44.0 | -105.0 | 38.0 | 590.0 | 987.0 | 1500.0 | 1487.0 | 1477.0 | 1377.0 | 974.0 | 31.0 | -178.0 |
# Checking the shape of the temperatures data frame
temperatures.shape
(1359937, 14)
# Checking the stations data
stations.head()
ID | LATITUDE | LONGITUDE | STNELEV | NAME | |
---|---|---|---|---|---|
0 | ACW00011604 | 57.7667 | 11.8667 | 18.0 | SAVE |
1 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH_INTER_AIRP |
2 | AEM00041184 | 25.6170 | 55.9330 | 31.0 | RAS_AL_KHAIMAH_INTE |
3 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI_INTL |
4 | AEM00041216 | 24.4300 | 54.4700 | 3.0 | ABU_DHABI_BATEEN_AIR |
# Checking the shape of the stations data frame
stations.shape
(27585, 5)
# Checking the countries data
countries.head()
FIPS 10-4 | ISO 3166 | Name | |
---|---|---|---|
0 | AF | AF | Afghanistan |
1 | AX | - | Akrotiri |
2 | AL | AL | Albania |
3 | AG | DZ | Algeria |
4 | AQ | AS | American Samoa |
# Checking the shape of the countries data frame
countries.shape
(279, 3)
Creating a Database to Store the Data
Notice that the temperatures
and stations
data frames are rather large data sets. Generally speaking, larger data sets tend to be noticeably slow to load into memory. While these particular data sets are not too large to fit in memory, they are large enough that it takes a while to load them into Pandas.
And yet, while these data sets are large, it is rare that we would have to operate on the entire data set at once. Instead, we would instead work with parts of the data at any given time. Hence, it would be easier for us to create a database using Python’s sqlite3
module, which would provide us with a structured way to move subsets of data from storage into memory.
import sqlite3
# create a database in current directory called data.db
# to store temperatures, stations, and countries tables
conn = sqlite3.connect("data.db")
Now, before appending our data to this database, let us first clean and manipulate it a little bit to make it more readable! First consider the following temperatures data frame:
temperatures.head()
ID | Year | VALUE1 | VALUE2 | VALUE3 | VALUE4 | VALUE5 | VALUE6 | VALUE7 | VALUE8 | VALUE9 | VALUE10 | VALUE11 | VALUE12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACW00011604 | 1961 | -89.0 | 236.0 | 472.0 | 773.0 | 1128.0 | 1599.0 | 1570.0 | 1481.0 | 1413.0 | 1174.0 | 510.0 | -39.0 |
1 | ACW00011604 | 1962 | 113.0 | 85.0 | -154.0 | 635.0 | 908.0 | 1381.0 | 1510.0 | 1393.0 | 1163.0 | 994.0 | 323.0 | -126.0 |
2 | ACW00011604 | 1963 | -713.0 | -553.0 | -99.0 | 541.0 | 1224.0 | 1627.0 | 1620.0 | 1596.0 | 1332.0 | 940.0 | 566.0 | -108.0 |
3 | ACW00011604 | 1964 | 62.0 | -85.0 | 55.0 | 738.0 | 1219.0 | 1442.0 | 1506.0 | 1557.0 | 1221.0 | 788.0 | 546.0 | 112.0 |
4 | ACW00011604 | 1965 | 44.0 | -105.0 | 38.0 | 590.0 | 987.0 | 1500.0 | 1487.0 | 1477.0 | 1377.0 | 974.0 | 31.0 | -178.0 |
The first thing we want to do is explicitly extract the country code, i.e., the first two letters of each ID
number. We can do so by executing following chunk of code that slices the ID
number:
temperatures['ID'] = temperatures["ID"].str[0:2]
temperatures.head()
ID | Year | VALUE1 | VALUE2 | VALUE3 | VALUE4 | VALUE5 | VALUE6 | VALUE7 | VALUE8 | VALUE9 | VALUE10 | VALUE11 | VALUE12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AC | 1961 | -89.0 | 236.0 | 472.0 | 773.0 | 1128.0 | 1599.0 | 1570.0 | 1481.0 | 1413.0 | 1174.0 | 510.0 | -39.0 |
1 | AC | 1962 | 113.0 | 85.0 | -154.0 | 635.0 | 908.0 | 1381.0 | 1510.0 | 1393.0 | 1163.0 | 994.0 | 323.0 | -126.0 |
2 | AC | 1963 | -713.0 | -553.0 | -99.0 | 541.0 | 1224.0 | 1627.0 | 1620.0 | 1596.0 | 1332.0 | 940.0 | 566.0 | -108.0 |
3 | AC | 1964 | 62.0 | -85.0 | 55.0 | 738.0 | 1219.0 | 1442.0 | 1506.0 | 1557.0 | 1221.0 | 788.0 | 546.0 | 112.0 |
4 | AC | 1965 | 44.0 | -105.0 | 38.0 | 590.0 | 987.0 | 1500.0 | 1487.0 | 1477.0 | 1377.0 | 974.0 | 31.0 | -178.0 |
Now, notice that each of the twelve VALUE
columns in the above table correspond to the months of the year. To make this data set a little more readable, we might want to create a seperate month column and stack the temperatures accordingly.
temperatures = temperatures.set_index(keys=["ID", "Year"])
temperatures = temperatures.stack() # stacking the months and temperatures
temperatures = temperatures.reset_index()
temperatures.head()
ID | Year | level_2 | 0 | |
---|---|---|---|---|
0 | AC | 1961 | VALUE1 | -89.0 |
1 | AC | 1961 | VALUE2 | 236.0 |
2 | AC | 1961 | VALUE3 | 472.0 |
3 | AC | 1961 | VALUE4 | 773.0 |
4 | AC | 1961 | VALUE5 | 1128.0 |
While stacking the temperatures, Python has automatically renamed the columns of the data frame. For the sake of readability, let us rename the level_2
and 0
columns.
temperatures = temperatures.rename(columns = {"level_2" : "Month", 0 : "Temp"})
temperatures.head()
ID | Year | Month | Temp | |
---|---|---|---|---|
0 | AC | 1961 | VALUE1 | -89.0 |
1 | AC | 1961 | VALUE2 | 236.0 |
2 | AC | 1961 | VALUE3 | 472.0 |
3 | AC | 1961 | VALUE4 | 773.0 |
4 | AC | 1961 | VALUE5 | 1128.0 |
Notice that our month column comprises of values: VALUE1, ..., VALUE12
. While it is possible to understand that “VALUE1” corresponds to “January”, “VALUE2” corresponds to “February”, and so on, the data frame would become more interpretable if we replace each of those values with the integer value corresponding to each month (i.e., 1 correponds to “January”, 2 corresponds to “February”, and so on).
temperatures["Month"] = temperatures["Month"].str[5:].astype(int)
# Dividing the temperatures by 100 to extract temperatures of the correct unit.
temperatures["Temp"] = temperatures["Temp"] / 100
temperatures.head()
ID | Year | Month | Temp | |
---|---|---|---|---|
0 | AC | 1961 | 1 | -0.89 |
1 | AC | 1961 | 2 | 2.36 |
2 | AC | 1961 | 3 | 4.72 |
3 | AC | 1961 | 4 | 7.73 |
4 | AC | 1961 | 5 | 11.28 |
Now that we have successfully cleaned and manipulated the temperatures
data set, we can now append it to the data base as follows:
temperatures.to_sql("temperatures", conn, if_exists = "append", index=False)
It is good practice to check whether the data frame has been added to the database! So, let’s do that:
# Checking that we have a "temperatures" table within the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',)]
Now, consider the stations-metadata data frame.
stations.head()
ID | LATITUDE | LONGITUDE | STNELEV | NAME | |
---|---|---|---|---|---|
0 | ACW00011604 | 57.7667 | 11.8667 | 18.0 | SAVE |
1 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH_INTER_AIRP |
2 | AEM00041184 | 25.6170 | 55.9330 | 31.0 | RAS_AL_KHAIMAH_INTE |
3 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI_INTL |
4 | AEM00041216 | 24.4300 | 54.4700 | 3.0 | ABU_DHABI_BATEEN_AIR |
As we did with the temperatures data set, the first thing we want to do is explicitly extract the country code, i.e., the first two letters of each ID
number, which will make it easier for us to compare table values later on.
stations['ID'] = stations["ID"].str[0:2]
stations.head()
ID | LATITUDE | LONGITUDE | STNELEV | NAME | |
---|---|---|---|---|---|
0 | AC | 57.7667 | 11.8667 | 18.0 | SAVE |
1 | AE | 25.3330 | 55.5170 | 34.0 | SHARJAH_INTER_AIRP |
2 | AE | 25.6170 | 55.9330 | 31.0 | RAS_AL_KHAIMAH_INTE |
3 | AE | 25.2550 | 55.3640 | 10.4 | DUBAI_INTL |
4 | AE | 24.4300 | 54.4700 | 3.0 | ABU_DHABI_BATEEN_AIR |
Now, let’s go ahead and also add this data set to the database as well!
stations.to_sql("stations", conn, if_exists = "replace", index=False)
Lastly, we will consider the countries data set.
countries.head()
FIPS 10-4 | ISO 3166 | Name | |
---|---|---|---|
0 | AF | AF | Afghanistan |
1 | AX | - | Akrotiri |
2 | AL | AL | Albania |
3 | AG | DZ | Algeria |
4 | AQ | AS | American Samoa |
The first thing we want to do is remove the ISO 3166
column which corresponds to the general political region. We are only interested in the FIPS 10-4
column which corresponds to the country code and the Name
of the country.
# dropping the ISO 3166 column
countries = countries.drop('ISO 3166', axis=1)
countries.head()
FIPS 10-4 | Name | |
---|---|---|
0 | AF | Afghanistan |
1 | AX | Akrotiri |
2 | AL | Albania |
3 | AG | Algeria |
4 | AQ | American Samoa |
Let us rename the FIPS 10-4
column to improve the readability of the data set.
countries = countries.rename(columns = {"FIPS 10-4" : "ID", "ISO 3166" : "ISO-3166"})
countries.head()
ID | Name | |
---|---|---|
0 | AF | Afghanistan |
1 | AX | Akrotiri |
2 | AL | Albania |
3 | AG | Algeria |
4 | AQ | American Samoa |
Finally, we are ready to add our last data set to the database!
countries.to_sql("countries", conn, if_exists="replace", index="False")
When we check our database, we should be able to see three seperate data tables: temperatures, stations, and countries.
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]
To be safe, let us also look at the columns within each of these tables:
cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")
for result in cursor.fetchall():
print(result[0])
CREATE TABLE "temperatures" (
"ID" TEXT,
"Year" INTEGER,
"Month" INTEGER,
"Temp" REAL
)
CREATE TABLE "stations" (
"ID" TEXT,
"LATITUDE" REAL,
"LONGITUDE" REAL,
"STNELEV" REAL,
"NAME" TEXT
)
CREATE TABLE "countries" (
"False" INTEGER,
"ID" TEXT,
"Name" TEXT
)
Now that we have finished constructing and checking our database, it is always good practice to close the database connection using the .close()
function.
conn.close()
Writing a Query Function
The prior section of this blog post outlined how we might read in the appropriate data and create a database, "data.db"
. Using this database, we now want to write a function that will query the appropriate temperature data for us. In other words, we will write a function that returns a Pandas DataFrame of temperature readings for a specified country, within a specified date range, and in the specified month of the given years.
As a general note, always remember to document your function appropriately so that other users find it easy to understand and implement the function you have written!
def query_climate_database(country, year_begin, year_end, month):
"""Function returns a dataframe of temperature readings for the
specified country, in the specific date range, in the specified month
of the year
Parameters
----------
country : string
the name of a country for which data should be returned
year_begin : integer
the earliest year for which data should be returned
year_end : integer
the latest year for which data should be returned
month : integer
month of the year for which data should be returned
Returns
----------
DataFrame
Pandas dataframe of temperature readings for the specified country,
in the specified date range, in the specified month of the year
"""
# opening the database
conn = sqlite3.connect("data.db")
cmd = \
f"""
SELECT s.NAME, s.LATITUDE, s.LONGITUDE, c.Name AS Country, t.Year, t.Month, t.Temp
FROM temperatures t
LEFT JOIN countries c ON t.ID = c.ID
LEFT JOIN stations s ON t.ID = s.ID
WHERE t.Year>={str(year_begin)} AND t.Year<={str(year_end)} AND t.Month={str(month)} AND c.Name='{country}'
"""
df = pd.read_sql_query(cmd, conn) # extracting the appropriate dataframe
conn.close() # closing the database
return df
Whenever we write a function, we always need to make sure it works properly! Let’s go ahead and test the above function!
# TEST CASE
country = "India"
year_begin = 1980
year_end = 2020
month = 1 # The month of January
query_climate_database(country, year_begin, year_end, month)
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | MINICOYOBSY | 8.3000 | 73.0000 | India | 1980 | 1 | 2.08 |
1 | MINICOY | 8.3000 | 73.1500 | India | 1980 | 1 | 2.08 |
2 | THIRUVANANTHAPURAM | 8.4670 | 76.9500 | India | 1980 | 1 | 2.08 |
3 | THIRUVANANTHAPURAM | 8.4830 | 76.9500 | India | 1980 | 1 | 2.08 |
4 | TRIVANDRUM | 8.5000 | 77.0000 | India | 1980 | 1 | 2.08 |
... | ... | ... | ... | ... | ... | ... | ... |
371931 | LUDHIANA | 30.9333 | 75.8667 | India | 2020 | 1 | 29.35 |
371932 | SHIMLA | 31.1000 | 77.1670 | India | 2020 | 1 | 29.35 |
371933 | AMRITSAR | 31.7100 | 74.7970 | India | 2020 | 1 | 29.35 |
371934 | SRINAGAR | 34.0830 | 74.8330 | India | 2020 | 1 | 29.35 |
371935 | LEH_KASHMIR | 34.2000 | 77.7000 | India | 2020 | 1 | 29.35 |
371936 rows × 7 columns
Writing a Geographic Scatter Function for Yearly Temperature Increases
Now that we have written a function that will help us query the appropriate data, we want to write a function that will help us address the following question: “How does the average yearly change in temperature vary within a given country?”
In other words, we would like to write a function that returns an interactive geographic scatterplot that reflects the yearly change in temperature during the specified month and time period at that station. Luckily for us, we have a very handy library, Plotly Express, to help us with our visualization. To find the average yearly change in temperature (in a given month), we might want to compute the first coefficient of a linear regression model (at a given station). To construct our linear regression model we could use the Scikit-Learn library!
from plotly import express as px
from sklearn.linear_model import LinearRegression
# Other potentially useful libraries
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
Before jumping to our main function that will generate a geographic scatterplot, we may first want to write our helper function coef
that would help us determine the yearly average temperature change in a given month, using linear regression.
def coef(data_group):
# extracting x and y for linear regression
x = data_group[['Year']]
y = data_group['Temp']
# The linear regression model
LR = LinearRegression()
LR.fit(x,y)
# returning the coefficients
return LR.coef_[0]
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
"""Function generates an interactive geographic scatterplot,
with a point for each station, such that the color of the point reflects
an estimate of the yearly change in temperature during the specified
month and time period at that station.
Parameters
----------
country : string
the name of a country for which data should be returned
year_begin : integer
the earliest year for which data should be returned
year_end : integer
the latest year for which data should be returned
month : integer
month of the year for which data should be returned
min_obs : integer
minimum required number of years of data for any given station
**kwargs
additional keyword arguments passed to px.scatter_mapbox().
These can be used to control the colormap used, the mapbox style, etc.
Returns
----------
scatterplot
An interactive geographic scatterplot, constructed using Plotly Express,
with a point for each station, such that the color of the point reflects
an estimate of the yearly change in temperature during the specified
month and time period at that station
"""
# extracting the temperature reading data frame for the given
# country, month, and years
df = query_climate_database(country, year_begin, year_end, month)
# extracting a list of unique station names
station_names = df['NAME'].unique()
stations_to_keep = []
for name in station_names:
temp_df = df[df['NAME'] == name]
# checking the years of data available for
# the specified month
number_of_years = temp_df['Year'].nunique()
if number_of_years >= min_obs:
stations_to_keep.append(name)
# Now that we have a list of the stations to keep
# we create another data frame
df_new = pd.DataFrame()
for station_name in stations_to_keep:
tempdf = df[df['NAME'] == station_name]
# concatenating the temporary data frame
# with the new data frame
df_new = pd.concat([df_new, tempdf])
# Extracting the average temperatures per month, per year
df_new = df_new.groupby(["NAME", "LATITUDE", "LONGITUDE", "Country", 'Year', "Month"])["Temp"].aggregate(np.mean)
df_new = pd.DataFrame(df_new) # appending the above to a data frame
df_new = df_new.reset_index() # and resetting the index appropriately
# array of station names from the new data frame
new_station_names = df_new['NAME'].unique()
# Arrays to store the coefficients, latitudes, and longitudes
coefs = np.array([])
lats = np.array([])
longs = np.array([])
for sname in station_names:
df_temp = df_new[df_new['NAME'] == sname]
lats = np.append(lats, df_temp['LATITUDE'].unique()[0])
longs = np.append(longs, df_temp['LONGITUDE'].unique()[0])
coefs = np.append(coefs, coef(df_temp))
# Creating a data frame of coefficients representing yearly temp change
coef_df = pd.DataFrame()
coef_df['NAME'] = station_names
coef_df['LATITUDE'] = lats
coef_df['LONGITUDE'] = longs
coef_df['Estimated Yearly Increase (°C)'] = np.round(coefs, 5)
# list of months
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
# Generating the scatter plot
fig = px.scatter_mapbox(coef_df,
title = "Estimated Yearly Increase in Temperature in "+months[month-1],
lat = "LATITUDE",
lon = "LONGITUDE",
hover_name = "NAME",
color = "Estimated Yearly Increase (°C)",
**kwargs)
fig.update(layout={"title": {"x": 0.85, "y": 0.97}})
fig.update_layout(margin={"r":0,"t":30,"l":0,"b":10})
return fig
As always, it is good practice to test our functions to make sure that they are working properly!
# TEST CASE
country = "India"
year_begin = 1980
year_end = 2020
month = 1
min_obs = 10
# Implementing the function
fig = temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, zoom = 3,
mapbox_style = "carto-positron")
fig.show()
Creating Other Interesting Visualizations
Plotly Express not only helps us with interactive geographic scatterplots (depicted above), we can also plot our usual graphs (like histograms, line graphs, scatterplots, etc.) that we normally would do with matplotlib or seaborn. However, the interesting thing about Plotly is that we can take our usual visualizations to the next level by making them interactive!
To illustrate this capability let us plot a regular scatter plot of temperature readings, depicting the yearly temperature trend (Note that a latitude above 0°N is above the equator while a latitude below 0°N is below the equator). We first want to re-write our query function to extract only the data we need!
def query_climate_database_v2(country, year_begin, year_end):
"""Function returns a dataframe of temperature readings for the
specified country, in the specific date range, in the specified month
of the year
Parameters
----------
country : string
the name of a country for which data should be returned
year_begin : integer
the earliest year for which data should be returned
year_end : integer
the latest year for which data should be returned
Returns
----------
DataFrame
Pandas dataframe of temperature readings for the specified country,
in the specified date range.
"""
# opening the database
conn = sqlite3.connect("data.db")
cmd = \
f"""
SELECT s.LATITUDE, s.LONGITUDE, c.Name AS Country, s.NAME, t.Year, t.Month, t.Temp
FROM temperatures t
LEFT JOIN countries c ON t.ID = c.ID
LEFT JOIN stations s ON t.ID = s.ID
WHERE t.Year>={str(year_begin)} AND t.Year<{str(year_end)} AND c.Name='{country}'
"""
df = pd.read_sql_query(cmd, conn) # extracting the appropriate dataframe
conn.close() # closing the database
return df
# TEST CASE: Let us pick a country that falls on the equator, like Brazil!
df2 = query_climate_database_v2("Brazil", 1990, 2020)
df2
LATITUDE | LONGITUDE | Country | NAME | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | -33.517 | -53.350 | Brazil | STVITORIA_DO_PALMAR | 1990 | 1 | 18.78 |
1 | -32.017 | -52.083 | Brazil | RIO_GRANDE | 1990 | 1 | 18.78 |
2 | -31.780 | -52.410 | Brazil | PELOTAS | 1990 | 1 | 18.78 |
3 | -31.333 | -54.100 | Brazil | BAGE | 1990 | 1 | 18.78 |
4 | -30.830 | -55.600 | Brazil | SANTANA_DO_LIVRAMENTO | 1990 | 1 | 18.78 |
... | ... | ... | ... | ... | ... | ... | ... |
22481195 | 0.610 | -69.180 | Brazil | IAUARETE | 2019 | 12 | 31.10 |
22481196 | 1.833 | -61.133 | Brazil | CARACARAI | 2019 | 12 | 31.10 |
22481197 | 2.067 | -50.850 | Brazil | AMAPA | 2019 | 12 | 31.10 |
22481198 | 2.817 | -60.650 | Brazil | BOA_VISTA | 2019 | 12 | 31.10 |
22481199 | 2.833 | -60.700 | Brazil | BOA_VISTAAEROPORT | 2019 | 12 | 31.10 |
22481200 rows × 7 columns
# Extracting the average temperatures per month, per year
df2 = df2.groupby(["LATITUDE", "LONGITUDE", "Country", "NAME", 'Year', "Month"])["Temp"].aggregate(np.mean)
df2 = pd.DataFrame(df2) # appending the above to a data frame
df2 = df2.reset_index() # and resetting the index appropriately
df2
LATITUDE | LONGITUDE | Country | NAME | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 1 | 25.940840 |
1 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 2 | 25.650976 |
2 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 3 | 25.886124 |
3 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 4 | 25.287054 |
4 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 5 | 22.841311 |
... | ... | ... | ... | ... | ... | ... | ... |
125995 | 2.833 | -60.70 | Brazil | BOA_VISTAAEROPORT | 2019 | 8 | 23.764634 |
125996 | 2.833 | -60.70 | Brazil | BOA_VISTAAEROPORT | 2019 | 9 | 25.871429 |
125997 | 2.833 | -60.70 | Brazil | BOA_VISTAAEROPORT | 2019 | 10 | 26.625412 |
125998 | 2.833 | -60.70 | Brazil | BOA_VISTAAEROPORT | 2019 | 11 | 26.701905 |
125999 | 2.833 | -60.70 | Brazil | BOA_VISTAAEROPORT | 2019 | 12 | 26.719643 |
126000 rows × 7 columns
# Add a column to determine whether the point is above or
# below the equator
latitudes = np.array(df2['LATITUDE'])
# above or below or on equator?
equator = np.array([])
for lat in latitudes:
if lat < 0:
equator = np.append(equator, 'below')
elif lat > 0:
equator = np.append(equator, 'above')
else:
equator = np.append(equator, 'on')
df2['Position Relative to Equator (0°N)'] = equator
df2.head()
LATITUDE | LONGITUDE | Country | NAME | Year | Month | Temp | Position Relative to Equator (0°N) | |
---|---|---|---|---|---|---|---|---|
0 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 1 | 25.940840 | below |
1 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 2 | 25.650976 | below |
2 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 3 | 25.886124 | below |
3 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 4 | 25.287054 | below |
4 | -33.517 | -53.35 | Brazil | STVITORIA_DO_PALMAR | 1990 | 5 | 22.841311 | below |
Now that we have a clean data set with all the information we need, we are ready to create our visualization using Plotly!
fig = px.scatter(df2,
title = "Yearly Temperature Trends (1990 - 2020)"
x = "Month",
y = "Temp",
color = "Year",
width = 800,
height = 400,
opacity = 0.5,
facet_col = "Position Relative to Equator (0°N)")
# reduce whitespace
fig.update(layout={"title": {"x": 0.5, "y": 0.97}})
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
# show the plot
fig.show()
This visualization is quite useful as it depicts a number of things with respect to the temperature data. For one thing, as the year increases (from 1990 to 2020), we note that the average temperature per year also increases (which is expected given the current circumstances and global warming). Additionally, we are also clearly able to see the average temperature change over the year (Higher temperatures in January-December and lower temperatures in June-July). Interestingly, we note that the temperature does not fluctuate depending on the latitude - in this case, this is to be expected since Brazil is closer to the equator and hence in an equatorial climate zone.
Another way of visualizing the temperature data in Brazil would be through a histogram, as shown below!
fig = px.histogram(df2,
title = "Distribution of Temperatures Below & Above the Equator",
x = "Temp",
color = "Position Relative to Equator (0°N)",
opacity = 0.2,
nbins = 35,
barmode='stack',
width = 700,
height = 500)
# reduce whitespace
fig.update(layout={"title": {"x": 0.7, "y": 0.98}})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()
Interestingly, we note that the temperatures above the equator seem to be a little higher than the temperatures below the equator - an observation we were not able to derive from the previous visualization! We also note that temperatures closer to 26°C are far more common, which is to be expected since Brazil falls on the equator.
We might also consider viewing the temperature data using a box plot, as shown below!
fig = px.box(df2,
title = "Temperature Distributions Below & Above the Equator Over Ten Years",
x = "Year",
y = "Temp",
color = "Position Relative to Equator (0°N)",
width = 1000,
height = 400)
# reduce whitespace
fig.update(layout={"title": {"x": 0.7, "y": 0.97}})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()
The above interactive visualization is extremely useful as we clearly see the upward incline in average temperatures - again, this is to be expected as average temperatures have increased on account of global warming. However, unlike the above histogram plot, we are not able to qualitatively determine that the temperatures at stations located above the equator might be slightly higher than the temperatures at stations located below the equator.
As far as this particular project was concerned, undoubtedly the most tedious part invovled data manipulation and querying the appropriate data. And though this part of the process was a tad bit challenging, there is not doubt that generating visualizations - even interactive ones at that - are far easier than if the were not manipulated and queried appropriately.
Additionally, I found querying the data a little bit challenging in the sense that it was important for me to plan out my visualizations before extracting the data. Often, I would have to re-query the appropriate data because I did not acquire all the data I needed to for the visualization. To be sure, a lesson learned is to plan out your visualizations in detail before attempting to query the data and construct it.
Disclaimer
This blog post was created for my PIC16B class at UCLA (Previously titled: Blog Post 1 - onquering Databases and Interactive Visualizations ft. NOAA Climate Data Set).