Conquering Databases and Interactive Visualizations ft. NOAA Climate Data Set

In this post I will outline how to create several interesting, interactive data graphics using the NOAA climate data set.

Downloading the Data

As always, the first thing we have to do before constructing any sort of data visualisation is to download and read in the data. To read in the temperature, station-metadata, and countries data, I will be used the Pandas package and running the following lines of code:

import pandas as pd

temperatures = pd.read_csv(data_path+"temps.csv")
stations = pd.read_csv(data_path+"station-metadata.csv")
countries = pd.read_csv(data_path+"countries.csv")

After reading the data into Jupyter Notebook, it is always a good practice to make sure that the data has been read in correctly by using the .head() function to check the first few rows of the data frame and the .shape function to check the shape of the data frame.

# Checking the temperatures data
temperatures.head()

	ID	Year	VALUE1	VALUE2	VALUE3	VALUE4	VALUE5	VALUE6	VALUE7	VALUE8	VALUE9	VALUE10	VALUE11	VALUE12
0	ACW00011604	1961	-89.0	236.0	472.0	773.0	1128.0	1599.0	1570.0	1481.0	1413.0	1174.0	510.0	-39.0
1	ACW00011604	1962	113.0	85.0	-154.0	635.0	908.0	1381.0	1510.0	1393.0	1163.0	994.0	323.0	-126.0
2	ACW00011604	1963	-713.0	-553.0	-99.0	541.0	1224.0	1627.0	1620.0	1596.0	1332.0	940.0	566.0	-108.0
3	ACW00011604	1964	62.0	-85.0	55.0	738.0	1219.0	1442.0	1506.0	1557.0	1221.0	788.0	546.0	112.0
4	ACW00011604	1965	44.0	-105.0	38.0	590.0	987.0	1500.0	1487.0	1477.0	1377.0	974.0	31.0	-178.0

# Checking the shape of the temperatures data frame
temperatures.shape

(1359937, 14)

# Checking the stations data
stations.head()

	ID	LATITUDE	LONGITUDE	STNELEV	NAME
0	ACW00011604	57.7667	11.8667	18.0	SAVE
1	AE000041196	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP
2	AEM00041184	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE
3	AEM00041194	25.2550	55.3640	10.4	DUBAI_INTL
4	AEM00041216	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR

# Checking the shape of the stations data frame
stations.shape

(27585, 5)

# Checking the countries data
countries.head()

	FIPS 10-4	ISO 3166	Name
0	AF	AF	Afghanistan
1	AX	-	Akrotiri
2	AL	AL	Albania
3	AG	DZ	Algeria
4	AQ	AS	American Samoa

# Checking the shape of the countries data frame
countries.shape

(279, 3)

Creating a Database to Store the Data

Notice that the temperatures and stations data frames are rather large data sets. Generally speaking, larger data sets tend to be noticeably slow to load into memory. While these particular data sets are not too large to fit in memory, they are large enough that it takes a while to load them into Pandas.

And yet, while these data sets are large, it is rare that we would have to operate on the entire data set at once. Instead, we would instead work with parts of the data at any given time. Hence, it would be easier for us to create a database using Python’s sqlite3 module, which would provide us with a structured way to move subsets of data from storage into memory.

import sqlite3
# create a database in current directory called data.db
# to store temperatures, stations, and countries tables
conn = sqlite3.connect("data.db") 

Now, before appending our data to this database, let us first clean and manipulate it a little bit to make it more readable! First consider the following temperatures data frame:

temperatures.head()

	ID	Year	VALUE1	VALUE2	VALUE3	VALUE4	VALUE5	VALUE6	VALUE7	VALUE8	VALUE9	VALUE10	VALUE11	VALUE12
0	ACW00011604	1961	-89.0	236.0	472.0	773.0	1128.0	1599.0	1570.0	1481.0	1413.0	1174.0	510.0	-39.0
1	ACW00011604	1962	113.0	85.0	-154.0	635.0	908.0	1381.0	1510.0	1393.0	1163.0	994.0	323.0	-126.0
2	ACW00011604	1963	-713.0	-553.0	-99.0	541.0	1224.0	1627.0	1620.0	1596.0	1332.0	940.0	566.0	-108.0
3	ACW00011604	1964	62.0	-85.0	55.0	738.0	1219.0	1442.0	1506.0	1557.0	1221.0	788.0	546.0	112.0
4	ACW00011604	1965	44.0	-105.0	38.0	590.0	987.0	1500.0	1487.0	1477.0	1377.0	974.0	31.0	-178.0

The first thing we want to do is explicitly extract the country code, i.e., the first two letters of each ID number. We can do so by executing following chunk of code that slices the ID number:

temperatures['ID'] = temperatures["ID"].str[0:2]
temperatures.head()

	ID	Year	VALUE1	VALUE2	VALUE3	VALUE4	VALUE5	VALUE6	VALUE7	VALUE8	VALUE9	VALUE10	VALUE11	VALUE12
0	AC	1961	-89.0	236.0	472.0	773.0	1128.0	1599.0	1570.0	1481.0	1413.0	1174.0	510.0	-39.0
1	AC	1962	113.0	85.0	-154.0	635.0	908.0	1381.0	1510.0	1393.0	1163.0	994.0	323.0	-126.0
2	AC	1963	-713.0	-553.0	-99.0	541.0	1224.0	1627.0	1620.0	1596.0	1332.0	940.0	566.0	-108.0
3	AC	1964	62.0	-85.0	55.0	738.0	1219.0	1442.0	1506.0	1557.0	1221.0	788.0	546.0	112.0
4	AC	1965	44.0	-105.0	38.0	590.0	987.0	1500.0	1487.0	1477.0	1377.0	974.0	31.0	-178.0

Now, notice that each of the twelve VALUE columns in the above table correspond to the months of the year. To make this data set a little more readable, we might want to create a seperate month column and stack the temperatures accordingly.

temperatures = temperatures.set_index(keys=["ID", "Year"])
temperatures = temperatures.stack() # stacking the months and temperatures
temperatures = temperatures.reset_index()
temperatures.head()

	ID	Year	level_2	0
0	AC	1961	VALUE1	-89.0
1	AC	1961	VALUE2	236.0
2	AC	1961	VALUE3	472.0
3	AC	1961	VALUE4	773.0
4	AC	1961	VALUE5	1128.0

While stacking the temperatures, Python has automatically renamed the columns of the data frame. For the sake of readability, let us rename the level_2 and 0 columns.

temperatures = temperatures.rename(columns = {"level_2" : "Month", 0 : "Temp"})
temperatures.head()

	ID	Year	Month	Temp
0	AC	1961	VALUE1	-89.0
1	AC	1961	VALUE2	236.0
2	AC	1961	VALUE3	472.0
3	AC	1961	VALUE4	773.0
4	AC	1961	VALUE5	1128.0

Notice that our month column comprises of values: VALUE1, ..., VALUE12. While it is possible to understand that “VALUE1” corresponds to “January”, “VALUE2” corresponds to “February”, and so on, the data frame would become more interpretable if we replace each of those values with the integer value corresponding to each month (i.e., 1 correponds to “January”, 2 corresponds to “February”, and so on).

temperatures["Month"] = temperatures["Month"].str[5:].astype(int)
# Dividing the temperatures by 100 to extract temperatures of the correct unit.
temperatures["Temp"]  = temperatures["Temp"] / 100
temperatures.head()

	ID	Year	Month	Temp
0	AC	1961	1	-0.89
1	AC	1961	2	2.36
2	AC	1961	3	4.72
3	AC	1961	4	7.73
4	AC	1961	5	11.28

Now that we have successfully cleaned and manipulated the temperatures data set, we can now append it to the data base as follows:

temperatures.to_sql("temperatures", conn, if_exists = "append", index=False)

It is good practice to check whether the data frame has been added to the database! So, let’s do that:

# Checking that we have a "temperatures" table within the database
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('temperatures',)]

Now, consider the stations-metadata data frame.

stations.head()

	ID	LATITUDE	LONGITUDE	STNELEV	NAME
0	ACW00011604	57.7667	11.8667	18.0	SAVE
1	AE000041196	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP
2	AEM00041184	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE
3	AEM00041194	25.2550	55.3640	10.4	DUBAI_INTL
4	AEM00041216	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR

As we did with the temperatures data set, the first thing we want to do is explicitly extract the country code, i.e., the first two letters of each ID number, which will make it easier for us to compare table values later on.

stations['ID'] = stations["ID"].str[0:2]
stations.head()

	ID	LATITUDE	LONGITUDE	STNELEV	NAME
0	AC	57.7667	11.8667	18.0	SAVE
1	AE	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP
2	AE	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE
3	AE	25.2550	55.3640	10.4	DUBAI_INTL
4	AE	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR

Now, let’s go ahead and also add this data set to the database as well!

stations.to_sql("stations", conn, if_exists = "replace", index=False)

Lastly, we will consider the countries data set.

countries.head()

	FIPS 10-4	ISO 3166	Name
0	AF	AF	Afghanistan
1	AX	-	Akrotiri
2	AL	AL	Albania
3	AG	DZ	Algeria
4	AQ	AS	American Samoa

The first thing we want to do is remove the ISO 3166 column which corresponds to the general political region. We are only interested in the FIPS 10-4 column which corresponds to the country code and the Name of the country.

# dropping the ISO 3166 column
countries = countries.drop('ISO 3166', axis=1)
countries.head()

	FIPS 10-4	Name
0	AF	Afghanistan
1	AX	Akrotiri
2	AL	Albania
3	AG	Algeria
4	AQ	American Samoa

Let us rename the FIPS 10-4 column to improve the readability of the data set.

countries = countries.rename(columns = {"FIPS 10-4" : "ID", "ISO 3166" : "ISO-3166"})
countries.head()

	ID	Name
0	AF	Afghanistan
1	AX	Akrotiri
2	AL	Albania
3	AG	Algeria
4	AQ	American Samoa

Finally, we are ready to add our last data set to the database!

countries.to_sql("countries", conn, if_exists="replace", index="False")

When we check our database, we should be able to see three seperate data tables: temperatures, stations, and countries.

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('temperatures',), ('stations',), ('countries',)]

To be safe, let us also look at the columns within each of these tables:

cursor.execute("SELECT sql FROM sqlite_master WHERE type='table';")

for result in cursor.fetchall():
    print(result[0])

CREATE TABLE "temperatures" (
"ID" TEXT,
  "Year" INTEGER,
  "Month" INTEGER,
  "Temp" REAL
)
CREATE TABLE "stations" (
"ID" TEXT,
  "LATITUDE" REAL,
  "LONGITUDE" REAL,
  "STNELEV" REAL,
  "NAME" TEXT
)
CREATE TABLE "countries" (
"False" INTEGER,
  "ID" TEXT,
  "Name" TEXT
)

Now that we have finished constructing and checking our database, it is always good practice to close the database connection using the .close() function.

conn.close()

Writing a Query Function

The prior section of this blog post outlined how we might read in the appropriate data and create a database, "data.db". Using this database, we now want to write a function that will query the appropriate temperature data for us. In other words, we will write a function that returns a Pandas DataFrame of temperature readings for a specified country, within a specified date range, and in the specified month of the given years.

As a general note, always remember to document your function appropriately so that other users find it easy to understand and implement the function you have written!

def query_climate_database(country, year_begin, year_end, month):
    """Function returns a dataframe of temperature readings for the 
    specified country, in the specific date range, in the specified month
    of the year
    
    Parameters
    ----------
    country : string
        the name of a country for which data should be returned
    year_begin : integer
        the earliest year for which data should be returned
    year_end : integer
        the latest year for which data should be returned
    month : integer
        month of the year for which data should be returned
    
    Returns
    ----------
    DataFrame 
        Pandas dataframe of temperature readings for the specified country, 
        in the specified date range, in the specified month of the year
    """
    # opening the database
    conn = sqlite3.connect("data.db")
    cmd = \
    f"""
    SELECT s.NAME, s.LATITUDE, s.LONGITUDE, c.Name AS Country, t.Year, t.Month, t.Temp
    FROM temperatures t
    LEFT JOIN countries c ON t.ID = c.ID
    LEFT JOIN stations s ON t.ID = s.ID
    WHERE t.Year>={str(year_begin)} AND t.Year<={str(year_end)} AND t.Month={str(month)} AND c.Name='{country}'
    """
    df = pd.read_sql_query(cmd, conn) # extracting the appropriate dataframe
    conn.close() # closing the database
    
    return df

Whenever we write a function, we always need to make sure it works properly! Let’s go ahead and test the above function!

# TEST CASE
country = "India"
year_begin = 1980
year_end = 2020
month = 1 # The month of January

query_climate_database(country, year_begin, year_end, month)

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp
0	MINICOYOBSY	8.3000	73.0000	India	1980	1	2.08
1	MINICOY	8.3000	73.1500	India	1980	1	2.08
2	THIRUVANANTHAPURAM	8.4670	76.9500	India	1980	1	2.08
3	THIRUVANANTHAPURAM	8.4830	76.9500	India	1980	1	2.08
4	TRIVANDRUM	8.5000	77.0000	India	1980	1	2.08
...	...	...	...	...	...	...	...
371931	LUDHIANA	30.9333	75.8667	India	2020	1	29.35
371932	SHIMLA	31.1000	77.1670	India	2020	1	29.35
371933	AMRITSAR	31.7100	74.7970	India	2020	1	29.35
371934	SRINAGAR	34.0830	74.8330	India	2020	1	29.35
371935	LEH_KASHMIR	34.2000	77.7000	India	2020	1	29.35

371936 rows × 7 columns

Writing a Geographic Scatter Function for Yearly Temperature Increases

Now that we have written a function that will help us query the appropriate data, we want to write a function that will help us address the following question: “How does the average yearly change in temperature vary within a given country?”

In other words, we would like to write a function that returns an interactive geographic scatterplot that reflects the yearly change in temperature during the specified month and time period at that station. Luckily for us, we have a very handy library, Plotly Express, to help us with our visualization. To find the average yearly change in temperature (in a given month), we might want to compute the first coefficient of a linear regression model (at a given station). To construct our linear regression model we could use the Scikit-Learn library!

from plotly import express as px
from sklearn.linear_model import LinearRegression

# Other potentially useful libraries
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

Before jumping to our main function that will generate a geographic scatterplot, we may first want to write our helper function coef that would help us determine the yearly average temperature change in a given month, using linear regression.

def coef(data_group):
    # extracting x and y for linear regression
    x = data_group[['Year']]
    y = data_group['Temp']
    # The linear regression model
    LR = LinearRegression()
    LR.fit(x,y)
    # returning the coefficients
    return LR.coef_[0]

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    """Function generates an interactive geographic scatterplot,
    with a point for each station, such that the color of the point reflects 
    an estimate of the yearly change in temperature during the specified 
    month and time period at that station.
    
    Parameters
    ----------
    country : string
        the name of a country for which data should be returned
    year_begin : integer
        the earliest year for which data should be returned
    year_end : integer
        the latest year for which data should be returned
    month : integer
        month of the year for which data should be returned
    min_obs : integer
        minimum required number of years of data for any given station
    **kwargs 
        additional keyword arguments passed to px.scatter_mapbox(). 
        These can be used to control the colormap used, the mapbox style, etc.
    
    Returns
    ----------
    scatterplot
        An interactive geographic scatterplot, constructed using Plotly Express, 
        with a point for each station, such that the color of the point reflects 
        an estimate of the yearly change in temperature during the specified 
        month and time period at that station
    """
    # extracting the temperature reading data frame for the given
    # country, month, and years
    df = query_climate_database(country, year_begin, year_end, month)
    
    # extracting a list of unique station names
    station_names = df['NAME'].unique() 
    stations_to_keep = []
    
    for name in station_names: 
        temp_df = df[df['NAME'] == name]
        # checking the years of data available for
        # the specified month
        number_of_years = temp_df['Year'].nunique()
        if number_of_years >= min_obs:
            stations_to_keep.append(name)
    
    # Now that we have a list of the stations to keep
    # we create another data frame 
    df_new = pd.DataFrame()
    for station_name in stations_to_keep:
        tempdf = df[df['NAME'] == station_name]
        # concatenating the temporary data frame
        # with the new data frame
        df_new = pd.concat([df_new, tempdf])
        
    # Extracting the average temperatures per month, per year
    df_new = df_new.groupby(["NAME", "LATITUDE", "LONGITUDE", "Country", 'Year', "Month"])["Temp"].aggregate(np.mean)
    df_new = pd.DataFrame(df_new) # appending the above to a data frame
    df_new = df_new.reset_index() # and resetting the index appropriately
    
    # array of station names from the new data frame
    new_station_names = df_new['NAME'].unique()
    # Arrays to store the coefficients, latitudes, and longitudes
    coefs = np.array([])
    lats = np.array([])
    longs = np.array([])
    
    for sname in station_names:
        df_temp = df_new[df_new['NAME'] == sname]
        lats = np.append(lats, df_temp['LATITUDE'].unique()[0])
        longs = np.append(longs, df_temp['LONGITUDE'].unique()[0])
        coefs = np.append(coefs, coef(df_temp))
    
    # Creating a data frame of coefficients representing yearly temp change
    coef_df = pd.DataFrame()
    coef_df['NAME'] = station_names
    coef_df['LATITUDE'] = lats
    coef_df['LONGITUDE'] = longs
    coef_df['Estimated Yearly Increase (°C)'] = np.round(coefs, 5)
    
    # list of months
    months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
    
    # Generating the scatter plot
    fig = px.scatter_mapbox(coef_df, 
                            title = "Estimated Yearly Increase in Temperature in "+months[month-1],
                            lat = "LATITUDE",
                            lon = "LONGITUDE", 
                            hover_name = "NAME", 
                            color = "Estimated Yearly Increase (°C)",
                            **kwargs)

    fig.update(layout={"title": {"x": 0.85, "y": 0.97}})
    fig.update_layout(margin={"r":0,"t":30,"l":0,"b":10})
    return fig

As always, it is good practice to test our functions to make sure that they are working properly!

# TEST CASE
country = "India"
year_begin = 1980
year_end = 2020
month = 1
min_obs = 10

# Implementing the function
fig = temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, zoom = 3, 
                                   mapbox_style = "carto-positron")
fig.show()

Creating Other Interesting Visualizations

Plotly Express not only helps us with interactive geographic scatterplots (depicted above), we can also plot our usual graphs (like histograms, line graphs, scatterplots, etc.) that we normally would do with matplotlib or seaborn. However, the interesting thing about Plotly is that we can take our usual visualizations to the next level by making them interactive!

To illustrate this capability let us plot a regular scatter plot of temperature readings, depicting the yearly temperature trend (Note that a latitude above 0°N is above the equator while a latitude below 0°N is below the equator). We first want to re-write our query function to extract only the data we need!

def query_climate_database_v2(country, year_begin, year_end):
    """Function returns a dataframe of temperature readings for the 
    specified country, in the specific date range, in the specified month
    of the year
    
    Parameters
    ----------
    country : string
        the name of a country for which data should be returned
    year_begin : integer
        the earliest year for which data should be returned
    year_end : integer
        the latest year for which data should be returned
    
    Returns
    ----------
    DataFrame 
        Pandas dataframe of temperature readings for the specified country, 
        in the specified date range. 
    """
    # opening the database
    conn = sqlite3.connect("data.db")
    cmd = \
    f"""
    SELECT s.LATITUDE, s.LONGITUDE, c.Name AS Country, s.NAME, t.Year, t.Month, t.Temp
    FROM temperatures t
    LEFT JOIN countries c ON t.ID = c.ID
    LEFT JOIN stations s ON t.ID = s.ID
    WHERE t.Year>={str(year_begin)} AND t.Year<{str(year_end)} AND c.Name='{country}'
    """
    df = pd.read_sql_query(cmd, conn) # extracting the appropriate dataframe
    conn.close() # closing the database
    
    return df

# TEST CASE: Let us pick a country that falls on the equator, like Brazil!
df2 = query_climate_database_v2("Brazil", 1990, 2020)
df2

	LATITUDE	LONGITUDE	Country	NAME	Year	Month	Temp
0	-33.517	-53.350	Brazil	STVITORIA_DO_PALMAR	1990	1	18.78
1	-32.017	-52.083	Brazil	RIO_GRANDE	1990	1	18.78
2	-31.780	-52.410	Brazil	PELOTAS	1990	1	18.78
3	-31.333	-54.100	Brazil	BAGE	1990	1	18.78
4	-30.830	-55.600	Brazil	SANTANA_DO_LIVRAMENTO	1990	1	18.78
...	...	...	...	...	...	...	...
22481195	0.610	-69.180	Brazil	IAUARETE	2019	12	31.10
22481196	1.833	-61.133	Brazil	CARACARAI	2019	12	31.10
22481197	2.067	-50.850	Brazil	AMAPA	2019	12	31.10
22481198	2.817	-60.650	Brazil	BOA_VISTA	2019	12	31.10
22481199	2.833	-60.700	Brazil	BOA_VISTAAEROPORT	2019	12	31.10

22481200 rows × 7 columns

# Extracting the average temperatures per month, per year
df2 = df2.groupby(["LATITUDE", "LONGITUDE", "Country", "NAME", 'Year', "Month"])["Temp"].aggregate(np.mean)
df2 = pd.DataFrame(df2) # appending the above to a data frame
df2 = df2.reset_index() # and resetting the index appropriately
df2

	LATITUDE	LONGITUDE	Country	NAME	Year	Month	Temp
0	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	1	25.940840
1	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	2	25.650976
2	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	3	25.886124
3	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	4	25.287054
4	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	5	22.841311
...	...	...	...	...	...	...	...
125995	2.833	-60.70	Brazil	BOA_VISTAAEROPORT	2019	8	23.764634
125996	2.833	-60.70	Brazil	BOA_VISTAAEROPORT	2019	9	25.871429
125997	2.833	-60.70	Brazil	BOA_VISTAAEROPORT	2019	10	26.625412
125998	2.833	-60.70	Brazil	BOA_VISTAAEROPORT	2019	11	26.701905
125999	2.833	-60.70	Brazil	BOA_VISTAAEROPORT	2019	12	26.719643

126000 rows × 7 columns

# Add a column to determine whether the point is above or
# below the equator
latitudes = np.array(df2['LATITUDE'])

# above or below or on equator? 
equator = np.array([])

for lat in latitudes:
    if lat < 0:
        equator = np.append(equator, 'below')
    elif lat > 0: 
        equator = np.append(equator, 'above')
    else: 
        equator = np.append(equator, 'on')
        
df2['Position Relative to Equator (0°N)'] = equator
df2.head()

	LATITUDE	LONGITUDE	Country	NAME	Year	Month	Temp	Position Relative to Equator (0°N)
0	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	1	25.940840	below
1	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	2	25.650976	below
2	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	3	25.886124	below
3	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	4	25.287054	below
4	-33.517	-53.35	Brazil	STVITORIA_DO_PALMAR	1990	5	22.841311	below

Now that we have a clean data set with all the information we need, we are ready to create our visualization using Plotly!

fig = px.scatter(df2,
                 title = "Yearly Temperature Trends (1990 - 2020)"
                 x = "Month", 
                 y = "Temp",
                 color = "Year",
                 width = 800,
                 height = 400,
                 opacity = 0.5, 
                 facet_col = "Position Relative to Equator (0°N)")


# reduce whitespace
fig.update(layout={"title": {"x": 0.5, "y": 0.97}})
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0})
# show the plot
fig.show()

This visualization is quite useful as it depicts a number of things with respect to the temperature data. For one thing, as the year increases (from 1990 to 2020), we note that the average temperature per year also increases (which is expected given the current circumstances and global warming). Additionally, we are also clearly able to see the average temperature change over the year (Higher temperatures in January-December and lower temperatures in June-July). Interestingly, we note that the temperature does not fluctuate depending on the latitude - in this case, this is to be expected since Brazil is closer to the equator and hence in an equatorial climate zone.

Another way of visualizing the temperature data in Brazil would be through a histogram, as shown below!

fig = px.histogram(df2,
                   title = "Distribution of Temperatures Below & Above the Equator",
                   x = "Temp", 
                   color = "Position Relative to Equator (0°N)",
                   opacity = 0.2, 
                   nbins = 35, 
                   barmode='stack',
                   width = 700,
                   height = 500)

# reduce whitespace
fig.update(layout={"title": {"x": 0.7, "y": 0.98}})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()

Interestingly, we note that the temperatures above the equator seem to be a little higher than the temperatures below the equator - an observation we were not able to derive from the previous visualization! We also note that temperatures closer to 26°C are far more common, which is to be expected since Brazil falls on the equator.

We might also consider viewing the temperature data using a box plot, as shown below!

fig = px.box(df2, 
             title = "Temperature Distributions Below & Above the Equator Over Ten Years",
             x = "Year", 
             y = "Temp",
             color = "Position Relative to Equator (0°N)",
             width = 1000,
             height = 400)

# reduce whitespace
fig.update(layout={"title": {"x": 0.7, "y": 0.97}})
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
# show the plot
fig.show()

The above interactive visualization is extremely useful as we clearly see the upward incline in average temperatures - again, this is to be expected as average temperatures have increased on account of global warming. However, unlike the above histogram plot, we are not able to qualitatively determine that the temperatures at stations located above the equator might be slightly higher than the temperatures at stations located below the equator.

As far as this particular project was concerned, undoubtedly the most tedious part invovled data manipulation and querying the appropriate data. And though this part of the process was a tad bit challenging, there is not doubt that generating visualizations - even interactive ones at that - are far easier than if the were not manipulated and queried appropriately.

Additionally, I found querying the data a little bit challenging in the sense that it was important for me to plan out my visualizations before extracting the data. Often, I would have to re-query the appropriate data because I did not acquire all the data I needed to for the visualization. To be sure, a lesson learned is to plan out your visualizations in detail before attempting to query the data and construct it.

Disclaimer
This blog post was created for my PIC16B class at UCLA (Previously titled: Blog Post 1 - onquering Databases and Interactive Visualizations ft. NOAA Climate Data Set).

Written on October 14, 2021