Data Visualization with `pandas`

In this tutorial, we will learn how to use the pandas, seaborn and matplotlib.pyplot packages to manipulate data and present a simple data visualization with the Palmer Penguins data set.

Introduction

First, we will import the packages and read in the data:

import pandas as pd
import numpy as np
# The following packages contain useful plotting tools:
import seaborn as sns
from matplotlib import pyplot as plt
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/palmer_penguins.csv"
penguins = pd.read_csv(url)

A sample of the data has the following format:

penguins.head(3)
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN

Data Manipulation

For the purpose of this visualization, we will change the species’ name to their shortened, informal name. This will make all data visualizations look cleaner (ie. in the legend on our plots, we will have “Adelie” rather than “Adelie Penguin (Pygoscelis adeliae)”, making it much neater).

penguins["Species"] = penguins["Species"].replace({"Adelie Penguin (Pygoscelis adeliae)":"Adelie", 
                            "Chinstrap penguin (Pygoscelis antarctica)": "Chinstrap",
                            "Gentoo penguin (Pygoscelis papua)":"Gentoo"})
penguins.head(3)

The “Species” column is now changed to what we expected:

studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 11/11/07 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 11/11/07 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 11/16/07 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN

Creating A Visualization

For the data visualization, we will be examining how the bill length and bill depth compare against different species of penguins. To do so, we will create a scatterplot using seaborn and matplotlib.pyplot, as well as creating a line of best fit for each species to show a correlation between Bill Length and Bill Depth. Note that Culmen is another word for the penguin’s bill, so to make the visual a bit more readable, we change all instances of “Culmen” to “Bill” in the last line of the following code block.

# Set the size of the plot to (x,y),
# where x is width, y is height:
plt.subplots(figsize=(8,5))

# Make the scatterplot with data from the penguins DataFrame,
# using the listed columns as the x, y axes:
# Hue denotes what differentiates the dots, and s is dot size.
sns.scatterplot(data=penguins,
                x = "Culmen Length (mm)", 
                y = "Culmen Depth (mm)", 
                hue = "Species", 
                s = 30)

# Create a best fit line for each species to show a correlation:
# ci=False means to not show a range of certainty in the plot.
# penguins["Species"].unique() gives a list of the (3) unique species.
for species in penguins["Species"].unique():
    sns.regplot(data=penguins[penguins["Species"]== species],
                x = "Culmen Length (mm)", 
                y = "Culmen Depth (mm)",
                ci=False)

# Rename the title, xaxis, yaxis respectively:
plt.gca().set(title="Bill Length vs. Bill Depth of Adelie,\n Chinstrap and Gentoo Penguin Species", 
              xlabel="Bill Length (mm)", ylabel = "Bill Depth (mm)")

PalmerPenguins_plot.jpg

And there we have it! We now have a simple data visualization showing the correlation between Bill Length and Depth for each of the three penguin species!

Though this is a rather simple and straightfoward visualization, pandas and matplotlib.pyplot allow for an extraordinary amount of different kinds of visuals one can produce.

Written on October 5, 2021