Visualizing the Palmer Penguins Dataset
In this post I will outline how to construct a data visualisation of the Palmer Penguins data set.
Downloading the Data
This first step might seem a little obvious, but the first thing we have to do before constructing any sort of data visualisation of a data set, we must first download and read in the data. To do so, I will be used the Pandas package and running the following lines of code:
import pandas as pd
import tabulate
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)
Now, we may check the penguins dataset we have read into Python. This step may not be wholly necessary, however it is good practice to view the data set to make sure it is read in correctly.
penguins.head()
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 11/11/07 | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | NaN | NaN | Not enough blood for isotopes. |
1 | PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 11/11/07 | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 | NaN |
2 | PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 11/16/07 | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 | NaN |
3 | PAL0708 | 4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 11/16/07 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Adult not sampled. |
4 | PAL0708 | 5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 11/16/07 | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 | NaN |
Cleaning the Dataset
Once we have read in the data set, we must clean it before we start constructing a data visualization. This might involve removing rows with NaN
values, dropping unecessary or constant columns, and improving readability.
Specifically for the Palmer Penguins data set, I will be dropping the following columns: studyName, Sample Number, Individual ID, Clutch Completion, Date Egg, and Comments.
The reason for doing so is because even though useful information could be potentially derived from these variables, I personally do not find this information as useful.
columns_to_drop = ['studyName', 'Sample Number', 'Individual ID', 'Clutch Completion', 'Date Egg', 'Comments']
# creating a new sub-data set with the above columns dropped
pen = penguins.drop(columns_to_drop, axis=1)
# viewing the new data set
pen.head()
Species | Region | Island | Stage | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | NaN | NaN |
1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 |
2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 |
3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 |
Now that we have dropped some of the unecessary columns, we now need to search for any constant columns (i.e., columns with only one unique value) that do not add any useful information to the data set
columns_to_drop = []
for column in pen.columns:
# If the number of unique elements in a column
# is equal to 1, drop the column
if pen[column].nunique() == 1:
columns_to_drop.append(column)
# Viewing the constant columns
columns_to_drop
['Region', 'Stage']
Since Region
and Stage
are constant columns, we may also remove those columns from the data set as follows.
pen = pen.drop(columns_to_drop, axis=1)
# Viewing the data set
pen.head()
Species | Island | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | |
---|---|---|---|---|---|---|---|---|---|
0 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | NaN | NaN |
1 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 |
2 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 |
3 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 |
The next step in the data cleaning process is to remove any NaN
values from the data set. Luckily for us, there is a very handy pandas function i.e., dropna
that let’s us do this:
pen = pen.dropna()
pen.head()
Species | Island | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | |
---|---|---|---|---|---|---|---|---|---|
1 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 |
2 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 |
4 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 |
5 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | MALE | 8.66496 | -25.29805 |
6 | Adelie Penguin (Pygoscelis adeliae) | Torgersen | 38.9 | 17.8 | 181.0 | 3625.0 | FEMALE | 9.18718 | -25.21799 |
Now, it would be a good idea to reset the index of the data set as we have dropped a good number of rows
pen = pen.reset_index(drop=True)
pen.tail() # checking the last few rows of the data set
Species | Island | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | |
---|---|---|---|---|---|---|---|---|---|
320 | Gentoo penguin (Pygoscelis papua) | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | FEMALE | 7.99184 | -26.20538 |
321 | Gentoo penguin (Pygoscelis papua) | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | FEMALE | 8.41151 | -26.13832 |
322 | Gentoo penguin (Pygoscelis papua) | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | MALE | 8.30166 | -26.04117 |
323 | Gentoo penguin (Pygoscelis papua) | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | FEMALE | 8.24246 | -26.11969 |
324 | Gentoo penguin (Pygoscelis papua) | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | MALE | 8.36390 | -26.15531 |
Now that we have dropped all the columns and rows we needed to and reset the index, the last step in our data cleaning process is to improve upon the readability of our dataset. Specifically, I want to improve on the readability of the columns with string data: Species, Island, and Sex
columns. Consider the unique elements of the Species
column
pen['Species'].unique() #extracting the unique elements of the species column
array(['Adelie Penguin (Pygoscelis adeliae)',
'Chinstrap penguin (Pygoscelis antarctica)',
'Gentoo penguin (Pygoscelis papua)'], dtype=object)
Notice that the individual strings are rather long. Instead, we could replace these strings with something a bit more readable. For instance, I would replace 'Adelie Penguin (Pygoscelis adeliae)'
with 'Adelie'
, and so on.
# We will replace the appropriate Species names with 'Adelie', 'Chinstrap', and 'Gentoo'
pen['Species'] = pen['Species'].replace('Adelie Penguin (Pygoscelis adeliae)', 'Adelie')
pen['Species'] = pen['Species'].replace('Chinstrap penguin (Pygoscelis antarctica)', 'Chinstrap')
pen['Species'] = pen['Species'].replace('Gentoo penguin (Pygoscelis papua)', 'Gentoo')
pen['Species'].unique() # checking the unique species
array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)
Similar to what we have done with the Species
column, we will also investigate the unique elements of the Island
and Sex
columns.
# Checking the unqiue elements of the Island column
pen['Island'].unique()
array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)
The readability of the unique elements in the Island
column is pretty good as it is, so we do not need to make any changes!
# Checking the unique elements of the Sex column
pen['Sex'].unique()
array(['FEMALE', 'MALE', '.'], dtype=object)
Here we have the standard 'FEMALE'
and 'MALE'
sex elements, but we also have a third element '.'
. For the sake of simplicity, I will be replacing the '.'
element with NaN
and removing the appropriate rows from the data set
# Removing all the '.'from the Sex column for simplicity
pen['Sex'] = pen['Sex'].replace('.', float('NaN'))
pen = pen.dropna() # removing the NaN values
pen # Checking the dataset
Species | Island | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | |
---|---|---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 |
1 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 |
2 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 |
3 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | MALE | 8.66496 | -25.29805 |
4 | Adelie | Torgersen | 38.9 | 17.8 | 181.0 | 3625.0 | FEMALE | 9.18718 | -25.21799 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
320 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | FEMALE | 7.99184 | -26.20538 |
321 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | FEMALE | 8.41151 | -26.13832 |
322 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | MALE | 8.30166 | -26.04117 |
323 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | FEMALE | 8.24246 | -26.11969 |
324 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | MALE | 8.36390 | -26.15531 |
324 rows × 9 columns
We have now successfully cleaned our data set!
Constructing the Visualization
The first step before constructing the visualization is to import the relevant visualization packages: Matplotlib, Seaborn, etc.
import seaborn as sns
from matplotlib import pyplot as plt
The type of visualization I am interested in constructing involves a few of the characteristic features of penguins: Culmen Length (mm), Culmen Depth (mm), Flipper Length (mm), Body Mass (g), Delta 15 N (o/oo), Delta 13 C (o/oo)
. I will have to pick two of these features for my visualization. To do so, I will construct a pairplot of the data set and choose the two features with the most interesting set of data clusters:
sns.set_theme(style="ticks")
sns.pairplot(pen, hue="Species")
plt.suptitle('Palmer Penguins Pairplot', x=0.5, y=1.0)
plt.savefig('palmer-penguins-pairplot.png')
The pairplot is one useful way of visualizaing the Palmer Penguins data set as it shows data relationships between the various features or columns. However, while we retain the information on the Species
of penguins, we also do end up leaving out the Island
and Sex
information.
After studying the pairplot, I thought that the plot between Culmen Length (mm)
and Culmen Depth (mm)
was pretty cool as it has three distinct clusters, one corresponding to each species. As such, I have chosen those two features for my visualization.
plt.figure(figsize=(20, 10))
fig = sns.relplot(
data=pen, x="Culmen Length (mm)", y="Culmen Depth (mm)",
col="Island", hue="Species", style="Sex",
kind="scatter"
)
plt.suptitle('Classification of Palmer Penguins by Island, Sex, and Species', x=0.5,y=1.05)
plt.savefig('palmer-penguins-classification.png')
As such, with the above final visualization, I have three plots (one corresponding to each Island
) depicting the relationship between Culmen Length (mm)
and Culmen Depth (mm)
. Additionally, I have also changed up the markers to differentiate between the Sex
of the penguins.
We are able to extract some interesting insights from this particular visualization; for instance, we note that only Adelie penguins are found in the Torgersen Island, Adelie and Gentoo are found in Biscoe Island, and Adelie and Chinstrap are found on Dream Island. We are also able to extract useful insights about the various Species of penguins from the data clusters!
The best part about this visualization is that we could switch up the x
and y
axes with other features and extract useful insights from those clusters as well!
Something I found really troublesome in this project was the data exploration and cleaning phase. However, though it was a tad bit taxing, I found that it really helped me in the end as I was able to generate my data visualization super easily. To be sure, a lesson learned the hard way is to always properly clean (and profile) the data set before attempting anything else!
Disclaimer
This blog post was created for my PIC16B class at UCLA (Previously titled: Blog Post 0 - Visualizing the Palmer Penguins Dataset).