Visualizing the Palmer Penguins Dataset

In this post I will outline how to construct a data visualisation of the Palmer Penguins data set.

Downloading the Data

This first step might seem a little obvious, but the first thing we have to do before constructing any sort of data visualisation of a data set, we must first download and read in the data. To do so, I will be used the Pandas package and running the following lines of code:

import pandas as pd
import tabulate
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

Now, we may check the penguins dataset we have read into Python. This step may not be wholly necessary, however it is good practice to view the data set to make sure it is read in correctly.

penguins.head()
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 11/16/07 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 11/16/07 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN

Cleaning the Dataset

Once we have read in the data set, we must clean it before we start constructing a data visualization. This might involve removing rows with NaN values, dropping unecessary or constant columns, and improving readability.
Specifically for the Palmer Penguins data set, I will be dropping the following columns: studyName, Sample Number, Individual ID, Clutch Completion, Date Egg, and Comments. The reason for doing so is because even though useful information could be potentially derived from these variables, I personally do not find this information as useful.

columns_to_drop = ['studyName', 'Sample Number', 'Individual ID', 'Clutch Completion', 'Date Egg', 'Comments']
# creating a new sub-data set with the above columns dropped
pen = penguins.drop(columns_to_drop, axis=1)
# viewing the new data set
pen.head()
Species Region Island Stage Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
0 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage 39.1 18.7 181.0 3750.0 MALE NaN NaN
1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454
2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302
3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage NaN NaN NaN NaN NaN NaN NaN
4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426

Now that we have dropped some of the unecessary columns, we now need to search for any constant columns (i.e., columns with only one unique value) that do not add any useful information to the data set

columns_to_drop = []
for column in pen.columns:
    # If the number of unique elements in a column
    # is equal to 1, drop the column
    if pen[column].nunique() == 1:
        columns_to_drop.append(column)
# Viewing the constant columns
columns_to_drop
['Region', 'Stage']

Since Region and Stage are constant columns, we may also remove those columns from the data set as follows.

pen = pen.drop(columns_to_drop, axis=1)
# Viewing the data set
pen.head()
Species Island Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
0 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.1 18.7 181.0 3750.0 MALE NaN NaN
1 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454
2 Adelie Penguin (Pygoscelis adeliae) Torgersen 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302
3 Adelie Penguin (Pygoscelis adeliae) Torgersen NaN NaN NaN NaN NaN NaN NaN
4 Adelie Penguin (Pygoscelis adeliae) Torgersen 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426

The next step in the data cleaning process is to remove any NaN values from the data set. Luckily for us, there is a very handy pandas function i.e., dropna that let’s us do this:

pen = pen.dropna()
pen.head()
Species Island Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
1 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454
2 Adelie Penguin (Pygoscelis adeliae) Torgersen 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302
4 Adelie Penguin (Pygoscelis adeliae) Torgersen 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426
5 Adelie Penguin (Pygoscelis adeliae) Torgersen 39.3 20.6 190.0 3650.0 MALE 8.66496 -25.29805
6 Adelie Penguin (Pygoscelis adeliae) Torgersen 38.9 17.8 181.0 3625.0 FEMALE 9.18718 -25.21799

Now, it would be a good idea to reset the index of the data set as we have dropped a good number of rows

pen = pen.reset_index(drop=True)
pen.tail() # checking the last few rows of the data set
Species Island Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
320 Gentoo penguin (Pygoscelis papua) Biscoe 47.2 13.7 214.0 4925.0 FEMALE 7.99184 -26.20538
321 Gentoo penguin (Pygoscelis papua) Biscoe 46.8 14.3 215.0 4850.0 FEMALE 8.41151 -26.13832
322 Gentoo penguin (Pygoscelis papua) Biscoe 50.4 15.7 222.0 5750.0 MALE 8.30166 -26.04117
323 Gentoo penguin (Pygoscelis papua) Biscoe 45.2 14.8 212.0 5200.0 FEMALE 8.24246 -26.11969
324 Gentoo penguin (Pygoscelis papua) Biscoe 49.9 16.1 213.0 5400.0 MALE 8.36390 -26.15531

Now that we have dropped all the columns and rows we needed to and reset the index, the last step in our data cleaning process is to improve upon the readability of our dataset. Specifically, I want to improve on the readability of the columns with string data: Species, Island, and Sex columns. Consider the unique elements of the Species column

pen['Species'].unique() #extracting the unique elements of the species column
array(['Adelie Penguin (Pygoscelis adeliae)',
       'Chinstrap penguin (Pygoscelis antarctica)',
       'Gentoo penguin (Pygoscelis papua)'], dtype=object)

Notice that the individual strings are rather long. Instead, we could replace these strings with something a bit more readable. For instance, I would replace 'Adelie Penguin (Pygoscelis adeliae)' with 'Adelie', and so on.

# We will replace the appropriate Species names with 'Adelie', 'Chinstrap', and 'Gentoo'
pen['Species'] = pen['Species'].replace('Adelie Penguin (Pygoscelis adeliae)', 'Adelie')
pen['Species'] = pen['Species'].replace('Chinstrap penguin (Pygoscelis antarctica)', 'Chinstrap')
pen['Species'] = pen['Species'].replace('Gentoo penguin (Pygoscelis papua)', 'Gentoo')
pen['Species'].unique() # checking the unique species
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

Similar to what we have done with the Species column, we will also investigate the unique elements of the Island and Sex columns.

# Checking the unqiue elements of the Island column
pen['Island'].unique()
array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

The readability of the unique elements in the Island column is pretty good as it is, so we do not need to make any changes!

# Checking the unique elements of the Sex column
pen['Sex'].unique()
array(['FEMALE', 'MALE', '.'], dtype=object)

Here we have the standard 'FEMALE' and 'MALE' sex elements, but we also have a third element '.'. For the sake of simplicity, I will be replacing the '.' element with NaN and removing the appropriate rows from the data set

# Removing all the '.'from the Sex column for simplicity
pen['Sex'] = pen['Sex'].replace('.', float('NaN'))
pen = pen.dropna() # removing the NaN values
pen # Checking the dataset
Species Island Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
0 Adelie Torgersen 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454
1 Adelie Torgersen 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302
2 Adelie Torgersen 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426
3 Adelie Torgersen 39.3 20.6 190.0 3650.0 MALE 8.66496 -25.29805
4 Adelie Torgersen 38.9 17.8 181.0 3625.0 FEMALE 9.18718 -25.21799
... ... ... ... ... ... ... ... ... ...
320 Gentoo Biscoe 47.2 13.7 214.0 4925.0 FEMALE 7.99184 -26.20538
321 Gentoo Biscoe 46.8 14.3 215.0 4850.0 FEMALE 8.41151 -26.13832
322 Gentoo Biscoe 50.4 15.7 222.0 5750.0 MALE 8.30166 -26.04117
323 Gentoo Biscoe 45.2 14.8 212.0 5200.0 FEMALE 8.24246 -26.11969
324 Gentoo Biscoe 49.9 16.1 213.0 5400.0 MALE 8.36390 -26.15531

324 rows × 9 columns

We have now successfully cleaned our data set!

Constructing the Visualization

The first step before constructing the visualization is to import the relevant visualization packages: Matplotlib, Seaborn, etc.

import seaborn as sns
from matplotlib import pyplot as plt

The type of visualization I am interested in constructing involves a few of the characteristic features of penguins: Culmen Length (mm), Culmen Depth (mm), Flipper Length (mm), Body Mass (g), Delta 15 N (o/oo), Delta 13 C (o/oo). I will have to pick two of these features for my visualization. To do so, I will construct a pairplot of the data set and choose the two features with the most interesting set of data clusters:

sns.set_theme(style="ticks")
sns.pairplot(pen, hue="Species")
plt.suptitle('Palmer Penguins Pairplot', x=0.5, y=1.0)
plt.savefig('palmer-penguins-pairplot.png')

palmer-penguins-pairplot.png

The pairplot is one useful way of visualizaing the Palmer Penguins data set as it shows data relationships between the various features or columns. However, while we retain the information on the Species of penguins, we also do end up leaving out the Island and Sex information.
After studying the pairplot, I thought that the plot between Culmen Length (mm) and Culmen Depth (mm) was pretty cool as it has three distinct clusters, one corresponding to each species. As such, I have chosen those two features for my visualization.

plt.figure(figsize=(20, 10))
fig = sns.relplot(
    data=pen, x="Culmen Length (mm)", y="Culmen Depth (mm)",
    col="Island", hue="Species", style="Sex",
    kind="scatter"
)
plt.suptitle('Classification of Palmer Penguins by Island, Sex, and Species', x=0.5,y=1.05)
plt.savefig('palmer-penguins-classification.png')

palmer-penguins-classification.png

As such, with the above final visualization, I have three plots (one corresponding to each Island) depicting the relationship between Culmen Length (mm) and Culmen Depth (mm). Additionally, I have also changed up the markers to differentiate between the Sex of the penguins.
We are able to extract some interesting insights from this particular visualization; for instance, we note that only Adelie penguins are found in the Torgersen Island, Adelie and Gentoo are found in Biscoe Island, and Adelie and Chinstrap are found on Dream Island. We are also able to extract useful insights about the various Species of penguins from the data clusters!
The best part about this visualization is that we could switch up the x and y axes with other features and extract useful insights from those clusters as well!

Something I found really troublesome in this project was the data exploration and cleaning phase. However, though it was a tad bit taxing, I found that it really helped me in the end as I was able to generate my data visualization super easily. To be sure, a lesson learned the hard way is to always properly clean (and profile) the data set before attempting anything else!

Disclaimer
This blog post was created for my PIC16B class at UCLA (Previously titled: Blog Post 0 - Visualizing the Palmer Penguins Dataset).

Written on October 5, 2021