Conquering Databases and Interactive Visualizations ft. NOAA Climate Data Set

In this post I will outline how to create several interesting, interactive data graphics using the NOAA climate data set.

Downloading the Data

As always, the first thing we have to do before constructing any sort of data visualisation is to download and read in the data. To read in the temperature, station-metadata, and countries data, I will be used the Pandas package and running the following lines of code:

import pandas as pd

temperatures = pd.read_csv(data_path+"temps.csv")
stations = pd.read_csv(data_path+"station-metadata.csv")
countries = pd.read_csv(data_path+"countries.csv")

After reading the data into Jupyter Notebook, it is always a good practice to make sure that the data has been read in correctly by using the .head() function to check the first few rows of the data frame and the .shape function to check the shape of the data frame.

# Checking the temperatures data
temperatures.head()
ID Year VALUE1 VALUE2 VALUE3 VALUE4 VALUE5 VALUE6 VALUE7 VALUE8 VALUE9 VALUE10 VALUE11 VALUE12
0 ACW00011604 1961 -89.0 236.0 472.0 773.0 1128.0 1599.0 1570.0 1481.0 1413.0 1174.0 510.0 -39.0
1 ACW00011604 1962 113.0 85.0 -154.0 635.0 908.0 1381.0 1510.0 1393.0 1163.0 994.0 323.0 -126.0
2 ACW00011604 1963 -713.0 -553.0 -99.0 541.0 1224.0 1627.0 1620.0 1596.0 1332.0 940.0 566.0 -108.0
3 ACW00011604 1964 62.0 -85.0 55.0 738.0 1219.0 1442.0 1506.0 1557.0 1221.0 788.0 546.0 112.0
4 ACW00011604 1965 44.0 -105.0 38.0 590.0 987.0 1500.0 1487.0 1477.0 1377.0 974.0 31.0 -178.0
# Checking the shape of the temperatures data frame
temperatures.shape
(1359937, 14)
# Checking the stations data
stations.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR
# Checking the shape of the stations data frame
stations.shape
(27585, 5)
# Checking the countries data
countries.head()
FIPS 10-4 ISO 3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa
# Checking the shape of the countries data frame
countries.shape
(279, 3)

Creating a Database to Store the Data

Notice that the temperatures and stations data frames are rather large data sets. Generally speaking, larger data sets tend to be noticeably slow to load into memory. While these particular data sets are not too large to fit in memory, they are large enough that it takes a while to load them into Pandas.

And yet, while these data sets are large, it is rare that we would have to operate on the entire data set at once. Instead, we would instead work with parts of the data at any given time. Hence, it would be easier for us to create a database using Python’s sqlite3 module, which would provide us with a structured way to move subsets of data from storage into memory.

import sqlite3
# create a database in current directory called data.db
# to store temperatures, stations, and countries tables
conn = sqlite3.connect("data.db") 

Now, before appending our data to this database, let us first clean and manipulate it a little bit to make it more readable! First consider the following temperatures data frame:

temperatures.head()
ID Year VALUE1 VALUE2 VALUE3 VALUE4 VALUE5 VALUE6 VALUE7 VALUE8 VALUE9 VALUE10 VALUE11 VALUE12
0 ACW00011604 1961 -89.0 236.0 472.0 773.0 1128.0 1599.0 1570.0 1481.0 1413.0 1174.0 510.0 -39.0
1 ACW00011604 1962 113.0 85.0 -154.0 635.0 908.0 1381.0 1510.0 1393.0 1163.0 994.0 323.0 -126.0
2 ACW00011604 1963 -713.0 -553.0 -99.0 541.0 1224.0 1627.0 1620.0 1596.0 1332.0 940.0 566.0 -108.0
3 ACW00011604 1964 62.0 -85.0 55.0 738.0 1219.0 1442.0 1506.0 1557.0 1221.0 788.0 546.0 112.0
4 ACW00011604 1965 44.0 -105.0 38.0 590.0 987.0 1500.0 1487.0 1477.0 1377.0 974.0 31.0 -178.0

The first thing we want to do is explicitly extract the country code, i.e., the first two letters of each ID number. We can do so by executing following chunk of code that slices the ID number:

temperatures['ID'] = temperatures["ID"].str[0:2]
temperatures.head()
ID Year VALUE1 VALUE2 VALUE3 VALUE4 VALUE5 VALUE6 VALUE7 VALUE8 VALUE9 VALUE10 VALUE11 VALUE12
0 AC 1961 -89.0 236.0 472.0 773.0 1128.0 1599.0 1570.0 1481.0 1413.0 1174.0 510.0 -39.0
1 AC 1962 113.0 85.0 -154.0 635.0 908.0 1381.0 1510.0 1393.0 1163.0 994.0 323.0 -126.0
2 AC 1963 -713.0 -553.0 -99.0 541.0 1224.0 1627.0 1620.0 1596.0 1332.0 940.0 566.0 -108.0
3 AC 1964 62.0 -85.0 55.0 738.0 1219.0 1442.0 1506.0 1557.0 1221.0 788.0 546.0 112.0
4 AC 1965 44.0 -105.0 38.0 590.0 987.0 1500.0 1487.0 1477.0 1377.0 974.0 31.0 -178.0

Now, notice that each of the twelve VALUE columns in the above table correspond to the months of the year. To make this data set a little more readable, we might want to create a seperate month column and stack the temperatures accordingly.

temperatures = temperatures.set_index(keys=["ID", "Year"])
temperatures = temperatures.stack() # stacking the months and temperatures
temperatures = temperatures.reset_index()
temperatures.head()
ID Year level_2 0
0 AC 1961 VALUE1 -89.0
1 AC 1961 VALUE2 236.0
2 AC 1961 VALUE3 472.0
3 AC 1961 VALUE4 773.0
4 AC 1961 VALUE5 1128.0

While stacking the temperatures, Python has automatically renamed the columns of the data frame. For the sake of readability, let us rename the level_2 and 0 columns.

temperatures = temperatures.rename(columns = {"level_2" : "Month", 0 : "Temp"})
temperatures.head()
ID Year Month Temp
0 AC 1961 VALUE1 -89.0
1 AC 1961 VALUE2 236.0
2 AC 1961 VALUE3 472.0
3 AC 1961 VALUE4 773.0
4 AC 1961 VALUE5 1128.0

Notice that our month column comprises of values: VALUE1, ..., VALUE12. While it is possible to understand that “VALUE1” corresponds to “January”, “VALUE2” corresponds to “February”, and so on, the data frame would become more interpretable if we replace each of those values with the integer value corresponding to each month (i.e., 1 correponds to “January”, 2 corresponds to “February”, and so on).

temperatures["Month"] = temperatures["Month"].str[5:].astype(int)
# Dividing the temperatures by 100 to extract temperatures of the correct unit.
temperatures["Temp"]  = temperatures["Temp"] / 100
temperatures.head()
ID Year Month Temp
0 AC 1961 1 -0.89
1 AC 1961 2 2.36
2 AC 1961 3 4.72
3 AC 1961 4 7.73
4 AC 1961 5 11.28

Now that we have successfully cleaned and manipulated the temperatures data set, we can now append it to the data base as follows:

temperatures.to_sql("temperatures", conn, if_exists = "append", index=False)

It is good practice to check whether the data frame has been added to the database! So, let’s do that:

# Checking that we have a "temperatures" table within the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',)]

Now, consider the stations-metadata data frame.

stations.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR

As we did with the temperatures data set, the first thing we want to do is explicitly extract the country code, i.e., the first two letters of each ID number, which will make it easier for us to compare table values later on.

stations['ID'] = stations["ID"].str[0:2]
stations.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 AC 57.7667 11.8667 18.0 SAVE
1 AE 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AE 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AE 25.2550 55.3640 10.4 DUBAI_INTL
4 AE 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR

Now, let’s go ahead and also add this data set to the database as well!

stations.to_sql("stations", conn, if_exists = "replace", index=False)

Lastly, we will consider the countries data set.

countries.head()
FIPS 10-4 ISO 3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa

The first thing we want to do is remove the ISO 3166 column which corresponds to the general political region. We are only interested in the FIPS 10-4 column which corresponds to the country code and the Name of the country.

# dropping the ISO 3166 column
countries = countries.drop('ISO 3166', axis=1)
countries.head()
FIPS 10-4 Name
0 AF Afghanistan
1 AX Akrotiri
2 AL Albania
3 AG Algeria
4 AQ American Samoa

Let us rename the FIPS 10-4 column to improve the readability of the data set.

countries = countries.rename(columns = {"FIPS 10-4" : "ID", "ISO 3166" : "ISO-3166"})
countries.head()
ID Name
0 AF Afghanistan
1 AX Akrotiri
2 AL Albania
3 AG Algeria
4 AQ American Samoa

Finally, we are ready to add our last data set to the database!

countries.to_sql("countries", conn, if_exists="replace", index="False")

When we check our database, we should be able to see three seperate data tables: temperatures, stations, and countries.

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]

To be safe, let us also look at the columns within each of these tables:

cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")

for result in cursor.fetchall():
    print(result[0])
CREATE TABLE "temperatures" (
"ID" TEXT,
  "Year" INTEGER,
  "Month" INTEGER,
  "Temp" REAL
)
CREATE TABLE "stations" (
"ID" TEXT,
  "LATITUDE" REAL,
  "LONGITUDE" REAL,
  "STNELEV" REAL,
  "NAME" TEXT
)
CREATE TABLE "countries" (
"False" INTEGER,
  "ID" TEXT,
  "Name" TEXT
)

Now that we have finished constructing and checking our database, it is always good practice to close the database connection using the .close() function.

conn.close()

Writing a Query Function

The prior section of this blog post outlined how we might read in the appropriate data and create a database, "data.db". Using this database, we now want to write a function that will query the appropriate temperature data for us. In other words, we will write a function that returns a Pandas DataFrame of temperature readings for a specified country, within a specified date range, and in the specified month of the given years.

As a general note, always remember to document your function appropriately so that other users find it easy to understand and implement the function you have written!

def query_climate_database(country, year_begin, year_end, month):
    """Function returns a dataframe of temperature readings for the 
    specified country, in the specific date range, in the specified month
    of the year
    
    Parameters
    ----------
    country : string
        the name of a country for which data should be returned
    year_begin : integer
        the earliest year for which data should be returned
    year_end : integer
        the latest year for which data should be returned
    month : integer
        month of the year for which data should be returned
    
    Returns
    ----------
    DataFrame 
        Pandas dataframe of temperature readings for the specified country, 
        in the specified date range, in the specified month of the year
    """
    # opening the database
    conn = sqlite3.connect("data.db")
    cmd = \
    f"""
    SELECT s.NAME, s.LATITUDE, s.LONGITUDE, c.Name AS Country, t.Year, t.Month, t.Temp
    FROM temperatures t
    LEFT JOIN countries c ON t.ID = c.ID
    LEFT JOIN stations s ON t.ID = s.ID
    WHERE t.Year>={str(year_begin)} AND t.Year<={str(year_end)} AND t.Month={str(month)} AND c.Name='{country}'
    """
    df = pd.read_sql_query(cmd, conn) # extracting the appropriate dataframe
    conn.close() # closing the database
    
    return df

Whenever we write a function, we always need to make sure it works properly! Let’s go ahead and test the above function!

# TEST CASE
country = "India"
year_begin = 1980
year_end = 2020
month = 1 # The month of January

query_climate_database(country, year_begin, year_end, month)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 MINICOYOBSY 8.3000 73.0000 India 1980 1 2.08
1 MINICOY 8.3000 73.1500 India 1980 1 2.08
2 THIRUVANANTHAPURAM 8.4670 76.9500 India 1980 1 2.08
3 THIRUVANANTHAPURAM 8.4830 76.9500 India 1980 1 2.08
4 TRIVANDRUM 8.5000 77.0000 India 1980 1 2.08
... ... ... ... ... ... ... ...
371931 LUDHIANA 30.9333 75.8667 India 2020 1 29.35
371932 SHIMLA 31.1000 77.1670 India 2020 1 29.35
371933 AMRITSAR 31.7100 74.7970 India 2020 1 29.35
371934 SRINAGAR 34.0830 74.8330 India 2020 1 29.35
371935 LEH_KASHMIR 34.2000 77.7000 India 2020 1 29.35

371936 rows × 7 columns

Writing a Geographic Scatter Function for Yearly Temperature Increases

Now that we have written a function that will help us query the appropriate data, we want to write a function that will help us address the following question: “How does the average yearly change in temperature vary within a given country?”

In other words, we would like to write a function that returns an interactive geographic scatterplot that reflects the yearly change in temperature during the specified month and time period at that station. Luckily for us, we have a very handy library, Plotly Express, to help us with our visualization. To find the average yearly change in temperature (in a given month), we might want to compute the first coefficient of a linear regression model (at a given station). To construct our linear regression model we could use the Scikit-Learn library!

from plotly import express as px
from sklearn.linear_model import LinearRegression

# Other potentially useful libraries
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

Before jumping to our main function that will generate a geographic scatterplot, we may first want to write our helper function coef that would help us determine the yearly average temperature change in a given month, using linear regression.

def coef(data_group):
    # extracting x and y for linear regression
    x = data_group[['Year']]
    y = data_group['Temp']
    # The linear regression model
    LR = LinearRegression()
    LR.fit(x,y)
    # returning the coefficients
    return LR.coef_[0]
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    """Function generates an interactive geographic scatterplot,
    with a point for each station, such that the color of the point reflects 
    an estimate of the yearly change in temperature during the specified 
    month and time period at that station.
    
    Parameters
    ----------
    country : string
        the name of a country for which data should be returned
    year_begin : integer
        the earliest year for which data should be returned
    year_end : integer
        the latest year for which data should be returned
    month : integer
        month of the year for which data should be returned
    min_obs : integer
        minimum required number of years of data for any given station
    **kwargs 
        additional keyword arguments passed to px.scatter_mapbox(). 
        These can be used to control the colormap used, the mapbox style, etc.
    
    Returns
    ----------
    scatterplot
        An interactive geographic scatterplot, constructed using Plotly Express, 
        with a point for each station, such that the color of the point reflects 
        an estimate of the yearly change in temperature during the specified 
        month and time period at that station
    """
    # extracting the temperature reading data frame for the given
    # country, month, and years
    df = query_climate_database(country, year_begin, year_end, month)
    
    # extracting a list of unique station names
    station_names = df['NAME'].unique() 
    stations_to_keep = []
    
    for name in station_names: 
        temp_df = df[df['NAME'] == name]
        # checking the years of data available for
        # the specified month
        number_of_years = temp_df['Year'].nunique()
        if number_of_years >= min_obs:
            stations_to_keep.append(name)
    
    # Now that we have a list of the stations to keep
    # we create another data frame 
    df_new = pd.DataFrame()
    for station_name in stations_to_keep:
        tempdf = df[df['NAME'] == station_name]
        # concatenating the temporary data frame
        # with the new data frame
        df_new = pd.concat([df_new, tempdf])
        
    # Extracting the average temperatures per month, per year
    df_new = df_new.groupby(["NAME", "LATITUDE", "LONGITUDE", "Country", 'Year', "Month"])["Temp"].aggregate(np.mean)
    df_new = pd.DataFrame(df_new) # appending the above to a data frame
    df_new = df_new.reset_index() # and resetting the index appropriately
    
    # array of station names from the new data frame
    new_station_names = df_new['NAME'].unique()
    # Arrays to store the coefficients, latitudes, and longitudes
    coefs = np.array([])
    lats = np.array([])
    longs = np.array([])
    
    for sname in station_names:
        df_temp = df_new[df_new['NAME'] == sname]
        lats = np.append(lats, df_temp['LATITUDE'].unique()[0])
        longs = np.append(longs, df_temp['LONGITUDE'].unique()[0])
        coefs = np.append(coefs, coef(df_temp))
    
    # Creating a data frame of coefficients representing yearly temp change
    coef_df = pd.DataFrame()
    coef_df['NAME'] = station_names
    coef_df['LATITUDE'] = lats
    coef_df['LONGITUDE'] = longs
    coef_df['Estimated Yearly Increase (°C)'] = np.round(coefs, 5)
    
    # list of months
    months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
    
    # Generating the scatter plot
    fig = px.scatter_mapbox(coef_df, 
                            title = "Estimated Yearly Increase in Temperature in "+months[month-1],
                            lat = "LATITUDE",
                            lon = "LONGITUDE", 
                            hover_name = "NAME", 
                            color = "Estimated Yearly Increase (°C)",
                            **kwargs)

    fig.update(layout={"title": {"x": 0.85, "y": 0.97}})
    fig.update_layout(margin={"r":0,"t":30,"l":0,"b":10})
    return fig

As always, it is good practice to test our functions to make sure that they are working properly!

# TEST CASE
country = "India"
year_begin = 1980
year_end = 2020
month = 1
min_obs = 10

# Implementing the function
fig = temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, zoom = 3, 
                                   mapbox_style = "carto-positron")
fig.show()

Creating Other Interesting Visualizations

Plotly Express not only helps us with interactive geographic scatterplots (depicted above), we can also plot our usual graphs (like histograms, line graphs, scatterplots, etc.) that we normally would do with matplotlib or seaborn. However, the interesting thing about Plotly is that we can take our usual visualizations to the next level by making them interactive!

To illustrate this capability let us plot a regular scatter plot of temperature readings, depicting the yearly temperature trend (Note that a latitude above 0°N is above the equator while a latitude below 0°N is below the equator). We first want to re-write our query function to extract only the data we need!

def query_climate_database_v2(country, year_begin, year_end):
    """Function returns a dataframe of temperature readings for the 
    specified country, in the specific date range, in the specified month
    of the year
    
    Parameters
    ----------
    country : string
        the name of a country for which data should be returned
    year_begin : integer
        the earliest year for which data should be returned
    year_end : integer
        the latest year for which data should be returned
    
    Returns
    ----------
    DataFrame 
        Pandas dataframe of temperature readings for the specified country, 
        in the specified date range. 
    """
    # opening the database
    conn = sqlite3.connect("data.db")
    cmd = \
    f"""
    SELECT s.LATITUDE, s.LONGITUDE, c.Name AS Country, s.NAME, t.Year, t.Month, t.Temp
    FROM temperatures t
    LEFT JOIN countries c ON t.ID = c.ID
    LEFT JOIN stations s ON t.ID = s.ID
    WHERE t.Year>={str(year_begin)} AND t.Year<{str(year_end)} AND c.Name='{country}'
    """
    df = pd.read_sql_query(cmd, conn) # extracting the appropriate dataframe
    conn.close() # closing the database
    
    return df
# TEST CASE: Let us pick a country that falls on the equator, like Brazil!
df2 = query_climate_database_v2("Brazil", 1990, 2020)
df2
LATITUDE LONGITUDE Country NAME Year Month Temp
0 -33.517 -53.350 Brazil STVITORIA_DO_PALMAR 1990 1 18.78
1 -32.017 -52.083 Brazil RIO_GRANDE 1990 1 18.78
2 -31.780 -52.410 Brazil PELOTAS 1990 1 18.78
3 -31.333 -54.100 Brazil BAGE 1990 1 18.78
4 -30.830 -55.600 Brazil SANTANA_DO_LIVRAMENTO 1990 1 18.78
... ... ... ... ... ... ... ...
22481195 0.610 -69.180 Brazil IAUARETE 2019 12 31.10
22481196 1.833 -61.133 Brazil CARACARAI 2019 12 31.10
22481197 2.067 -50.850 Brazil AMAPA 2019 12 31.10
22481198 2.817 -60.650 Brazil BOA_VISTA 2019 12 31.10
22481199 2.833 -60.700 Brazil BOA_VISTAAEROPORT 2019 12 31.10

22481200 rows × 7 columns

# Extracting the average temperatures per month, per year
df2 = df2.groupby(["LATITUDE", "LONGITUDE", "Country", "NAME", 'Year', "Month"])["Temp"].aggregate(np.mean)
df2 = pd.DataFrame(df2) # appending the above to a data frame
df2 = df2.reset_index() # and resetting the index appropriately
df2
LATITUDE LONGITUDE Country NAME Year Month Temp
0 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 1 25.940840
1 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 2 25.650976
2 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 3 25.886124
3 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 4 25.287054
4 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 5 22.841311
... ... ... ... ... ... ... ...
125995 2.833 -60.70 Brazil BOA_VISTAAEROPORT 2019 8 23.764634
125996 2.833 -60.70 Brazil BOA_VISTAAEROPORT 2019 9 25.871429
125997 2.833 -60.70 Brazil BOA_VISTAAEROPORT 2019 10 26.625412
125998 2.833 -60.70 Brazil BOA_VISTAAEROPORT 2019 11 26.701905
125999 2.833 -60.70 Brazil BOA_VISTAAEROPORT 2019 12 26.719643

126000 rows × 7 columns

# Add a column to determine whether the point is above or
# below the equator
latitudes = np.array(df2['LATITUDE'])

# above or below or on equator? 
equator = np.array([])

for lat in latitudes:
    if lat < 0:
        equator = np.append(equator, 'below')
    elif lat > 0: 
        equator = np.append(equator, 'above')
    else: 
        equator = np.append(equator, 'on')
        
df2['Position Relative to Equator (0°N)'] = equator
df2.head()
LATITUDE LONGITUDE Country NAME Year Month Temp Position Relative to Equator (0°N)
0 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 1 25.940840 below
1 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 2 25.650976 below
2 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 3 25.886124 below
3 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 4 25.287054 below
4 -33.517 -53.35 Brazil STVITORIA_DO_PALMAR 1990 5 22.841311 below

Now that we have a clean data set with all the information we need, we are ready to create our visualization using Plotly!

fig = px.scatter(df2,
                 title = "Yearly Temperature Trends (1990 - 2020)"
                 x = "Month", 
                 y = "Temp",
                 color = "Year",
                 width = 800,
                 height = 400,
                 opacity = 0.5, 
                 facet_col = "Position Relative to Equator (0°N)")


# reduce whitespace
fig.update(layout={"title": {"x": 0.5, "y": 0.97}})
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
# show the plot
fig.show()

This visualization is quite useful as it depicts a number of things with respect to the temperature data. For one thing, as the year increases (from 1990 to 2020), we note that the average temperature per year also increases (which is expected given the current circumstances and global warming). Additionally, we are also clearly able to see the average temperature change over the year (Higher temperatures in January-December and lower temperatures in June-July). Interestingly, we note that the temperature does not fluctuate depending on the latitude - in this case, this is to be expected since Brazil is closer to the equator and hence in an equatorial climate zone.

Another way of visualizing the temperature data in Brazil would be through a histogram, as shown below!

fig = px.histogram(df2,
                   title = "Distribution of Temperatures Below & Above the Equator",
                   x = "Temp", 
                   color = "Position Relative to Equator (0°N)",
                   opacity = 0.2, 
                   nbins = 35, 
                   barmode='stack',
                   width = 700,
                   height = 500)

# reduce whitespace
fig.update(layout={"title": {"x": 0.7, "y": 0.98}})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()

Interestingly, we note that the temperatures above the equator seem to be a little higher than the temperatures below the equator - an observation we were not able to derive from the previous visualization! We also note that temperatures closer to 26°C are far more common, which is to be expected since Brazil falls on the equator.

We might also consider viewing the temperature data using a box plot, as shown below!

fig = px.box(df2, 
             title = "Temperature Distributions Below & Above the Equator Over Ten Years",
             x = "Year", 
             y = "Temp",
             color = "Position Relative to Equator (0°N)",
             width = 1000,
             height = 400)

# reduce whitespace
fig.update(layout={"title": {"x": 0.7, "y": 0.97}})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()

The above interactive visualization is extremely useful as we clearly see the upward incline in average temperatures - again, this is to be expected as average temperatures have increased on account of global warming. However, unlike the above histogram plot, we are not able to qualitatively determine that the temperatures at stations located above the equator might be slightly higher than the temperatures at stations located below the equator.

As far as this particular project was concerned, undoubtedly the most tedious part invovled data manipulation and querying the appropriate data. And though this part of the process was a tad bit challenging, there is not doubt that generating visualizations - even interactive ones at that - are far easier than if the were not manipulated and queried appropriately.

Additionally, I found querying the data a little bit challenging in the sense that it was important for me to plan out my visualizations before extracting the data. Often, I would have to re-query the appropriate data because I did not acquire all the data I needed to for the visualization. To be sure, a lesson learned is to plan out your visualizations in detail before attempting to query the data and construct it.

Disclaimer
This blog post was created for my PIC16B class at UCLA (Previously titled: Blog Post 1 - onquering Databases and Interactive Visualizations ft. NOAA Climate Data Set).

Written on October 14, 2021