Section: Quantitative data | Data Visualization on Seaborn

Collapse Expand
Definition
- Select activity Que sont les données numériques? Les données quant...
  
  What are quantitative data?
  
  Quantitative, or numerical, data, as the name suggests, consists of measurable data, i.e., numbers.
  
  We will present and test the different types of graphs that can be created with Seaborn.

Relplot

The relplot() function allows you to create scatter plots and line plots.

Here is the function signature:

Signature relplot

There is of course documentation available online, so we will only go over the most essential elements in order to display what we need as quickly as possible, namely:

Parameter name	Description	Format	Example
data	The dataframe you are working on	DataFrame, Series, dict, array, or list of arrays	data=table
x	Variable for the x-axis	String corresponding to a variable	x="weight"
y	Variable for the y-axis	String corresponding to a variable	y=”height”
hue	Allows to add a variable as different colors	String corresponding to a variable	hue=”age”
size	Allows to add a variable as the size of points	String corresponding to a variable	size=”money”
style	Allows to add a variable as the type of points	String corresponding to a variable	style=”sex”
row	Allows to create a table of plots, controling the number of rows	String corresponding to a variable	row=”category”
col	Allows to create a table of plots, controling the number of columns	String corresponding to a variable	col=”job”
kind	Type of plot you want	String corresponding to a type	kind=”scatter” or kind=”line”

Here is an example with the following code:

data = sns.load_dataset("penguins")
sns.relplot(data=data,x="bill_length_mm",y="bill_depth_mm",hue="species",style="sex",size="body_mass_g",col="island")
plt.show()

Which produces the following result:

Graphique relplot scatter.

We can see that col allows us to create different plots within the same figure.

We can also add ellipses to relplot() scatter plots; to draw on them, we need to retrieve the axis (ax):

df = sns.load_dataset("penguins")
df = df[["species", "bill_length_mm", "body_mass_g"]].dropna()

g = sns.relplot(
    data=df,
    x="bill_length_mm",
    y="body_mass_g",
    hue="species",
    kind="scatter",
    height=5
)
ax = g.ax

def add_confidence_ellipse(x, y, ax, n_std=2.0, **kwargs):
    cov = np.cov(x, y)
    mean = np.mean(x), np.mean(y)
    eigvals, eigvecs = np.linalg.eigh(cov)
    order = eigvals.argsort()[::-1]
    eigvals, eigvecs = eigvals[order], eigvecs[:, order]
    angle = np.degrees(np.arctan2(*eigvecs[:, 0][::-1]))
    width, height = 2 * n_std * np.sqrt(eigvals)
    ellipse = Ellipse(
        xy=mean,
        width=width,
        height=height,
        angle=angle,
        fill=False,
        **kwargs
    )
    ax.add_patch(ellipse)

palette = sns.color_palette()

for i, species in enumerate(df["species"].unique()):
    subset = df[df["species"] == species]
    add_confidence_ellipse(
        subset["bill_length_mm"],
        subset["body_mass_g"],
        ax,
        edgecolor=palette[i],
        linewidth=2
    )

ax.set_xlabel("Bill length (mm)")
ax.set_ylabel("Body mass (g)")
ax.set_title("95% Confidence Ellipse by Species — Penguins")

plt.show()

Ellipse comes from matplotlib.patches.

And here is the result of this code:

graphe ellipse

We can also display lines by changing the kind:

df = sns.load_dataset("penguins")
sns.relplot(data=df,x="bill_length_mm",y="bill_depth_mm",hue="species",style="sex",col="island",kind="line")
plt.show()

size cannot be used with line plots; here is the result of the code:

graphique lines

Displot

The displot() function allows you to display different types of distributions.

signature displot

Parameter name	Description	Format	Example
data	The dataframe you are working on	DataFrame, Series, dict, array, or list of arrays	data=table
x	Variable for the x-axis	String corresponding to a variable	x="weight"
y	Variable for the y-axis	String corresponding to a variable	y=”height”
hue	Allows to add a variable as different colors	String corresponding to a variable	hue=”age”
row	Allows to create a table of plots, controling the number of rows	String corresponding to a variable	row=”category”
col	Allows to create a table of plots, controling the number of columns	String corresponding to a variable	col=”job”
kind	Type of plot you want	String corresponding to a type	kind=”hist”,kind=”kde” ou kind=”ecdf”
rug	Allows to see individual data points on the axes	Boolean	rug=True

Here is an example code that creates a histogram:

data = sns.load_dataset("penguins")
sns.displot(data=data, x="bill_length_mm", rug=True, hue="sex", bins=20)
plt.show()

If you don’t specify the data for the y-axis, it will represent the number of occurrences, and if you don’t specify the kind, it defaults to a histogram. The bins argument controls the number of bars.

We also have access to kernel density estimation (KDE) to estimate a distribution. Here is an example of how to use it:

data = sns.load_dataset("penguins")
sns.displot(data=data,x="bill_length_mm", rug=True, hue="sex", kind="kde")
plt.show()

kde monodimensionnelle

If you specify a variable for the y-axis:

data = sns.load_dataset("penguins")
sns.displot(data=data,x="bill_length_mm", y="bill_depth_mm", rug=True, hue="sex", kind="kde")
plt.show()

kde bidimensionnel

The rug parameter allows you to display individual observations along the axes of the plot.

The last type of distribution available is the ECDF (empirical cumulative distribution function). You cannot specify a y variable for this distribution since it is univariate.

data = sns.load_dataset("penguins")
sns.displot(data=data, x="body_mass_g", rug=True, hue="sex", kind="ecdf", row="species", col="sex", height=5)
plt.show()

ecdf The row parameter allows you to display additional plots based on another variable in the dataset. The height parameter controls the height of the plots.

Boxplot et Violinplot

A common graphical representation of data is the box plot, which can be accessed using boxplot().

signature boxplot

Parameter name	Description	Format	Example
data	The dataframe you are working on	DataFrame, Series, dict, array, or list of arrays	data=table
x	Variable for the x-axis	String corresponding to a variable	x="weight"
y	Variable for the y-axis	String corresponding to a variable	y=”height”
hue	Allows to add a variable as different colors	String corresponding to a variable	hue=”age”
dodge	Variable allowing to choose if the contents of the graphs overlap	Boolean	dodge=False
width	Variable controlling the width of the boxes	Float	width=0.5
gap	Variable controlling the gap between different boxes	Float	gap=0.1

Here is an example of code:

data = sns.load_dataset("penguins")
sns.boxplot(data=data, x="bill_length_mm", hue="island", dodge=True, width=0.5)
plt.show()

boxplot sans gap
By default, gap is set to 0. The orientation is handled automatically by Seaborn, but if the plot is two-dimensional, it can be chosen manually.

data = sns.load_dataset("penguins")
sns.boxplot(data=data, x="bill_length_mm", hue="island", dodge=True, width=0.5, gap=0.1, log_scale=True)
plt.show()

boxplot gap

log_scale allows you to change the scale. A numeric value sets the base. If the plot is two-dimensional, two values can be provided, one for each axis.

The violin plot is also accessible via violinplot().

signature violinplot

Parameter name	Description	Format	Example
data	The dataframe you are working on	DataFrame, Series, dict, array, or list of arrays	data=table
x	Variable for the x-axis	String corresponding to a variable	x="weight"
y	Variable for the y-axis	String corresponding to a variable	y=”height”
hue	Allows to add a variable as different colors	String corresponding to a variable	hue=”age”
inner	Variable allowing to choose the inner representation of the violin	String corresponding to a representation type	inner=”box”,inner=”quart”,inner=”point”
split	Variable allowing to choose to show 2 data groups on the same violin.	Boolean	split=True
width	Variable controlling the width of the boxes	Float	width=0.5
dodge	Variable allowing to choose if the contents of the graphs overlap	Boolean	dodge=False
gap	Variable controlling the gap between different boxes when dodge is True	Float	gap=0.1

Here is an example of code:

data = sns.load_dataset("penguins")
sns.violinplot(data=data, x="bill_length_mm", hue="sex", dodge=True, linewidth=3, split=True, inner="point")
plt.show()

violinplot point

We can choose to display a small box plot within the violin plot:

data = sns.load_dataset("penguins")
sns.violinplot(data=data, x="bill_length_mm", hue="sex",dodge=True,linewidth=3, split=True, inner="box")
plt.show()

violin plot box

Regplot et Lmplot

If you want to perform linear regressions, Seaborn provides a dedicated function: regplot().

Parameter name	Description	Format	Example
data	The dataframe you are working on	DataFrame, Series, dict, array, or list of arrays	data=table
x	Variable for the x-axis	String corresponding to a variable	x="weight"
y	Variable for the y-axis	String corresponding to a variable	y=”height”
ci	Variable allowing to control the confidence interval displayed	Integer between 1 and 100	ci=99
nboot	Variable indicating the number of bootstrap resampling that will be done	Integer	nboot=100
seed	Variable to indicate a seed for the resampling, allows reproductibility	Integer	seed=42
logistic	Variable allowing to do a logistic regression	Boolean	logistic=True
lowess	Variable allowing to do a lowess regression	Boolean	lowess=True
robust	Variable allowing to do a robust regression	Boolean	robust=True

regplot() also allows you to display the confidence interval, which is set to 95% by default.

Here is an example of code:

sns.regplot(data=data, x="bill_length_mm",y="bill_depth_mm", ci=70)
plt.show()

regplot simple

We can change the type by selecting a parameter, for example the lowess parameter, and setting it to True:

sns.regplot(data=data, x="bill_length_mm", y="bill_depth_mm",
 ci=99, lowess=True)
plt.show()

lowess regplot

The confidence interval is not displayed when using LOWESS.

Another option is lmplot(), which is more suitable for performing regressions across multiple plots.

signature lmplot

Parameter name	Description	Format	Example
data	The dataframe you are working on	DataFrame, Series, dict, array, or list of arrays	data=table
x	Variable for the x-axis	String corresponding to a variable	x="weight"
y	Variable for the y-axis	String corresponding to a variable	y=”height”
hue	Allows to add a variable as different colors	String corresponding to a variable	hue=”age”
row	Allows to create a table of plots, controling the number of rows	String corresponding to a variable	row=”category”
col	Allows to create a table of plots, controling the number of columns	String corresponding to a variable	col=”job”
ci	Variable allowing to control the confidence interval displayed	Integer between 1 and 100	ci=99
nboot	Variable indicating the number of bootstrap resampling that will be done	Integer	nboot=100
lowess	Variable allowing to do a lowess regression	Boolean	lowess=True

Here is an example of code:

sns.lmplot(data=data, x="bill_length_mm", y="bill_depth_mm", ci=95, hue="island", robust=True, col="sex")
plt.show()

lmplot

Robust and logistic regressions are also available, as with regplot(). nboot and seed are also available.

Heatmap et Clustermap

Seaborn also allows you to create heatmaps using heatmap().

Parameter name	Description	Format	Example
data	The dataframe you are working on	DataFrame, Series, dict, array, or list of arrays	data=table
cmap	Heatmap colors. Either a Matplotlib palette or a custom one.	String corresponding to a palette or a Seaborn color_palette	cmap=”viridis” or cmap = sns.color_palette("light:blue", as_cmap=True)
annot	Parameter that determines whether to display the values inside the cells.	Boolean	annot=True, default is False
vmin	Minimum value that will be taken into account for the colormap.	Float	vmin=30.6
vmax	Maximum value that will be taken into account for the colormap.	Float	vmax=42
linecolor	Parameter used to choose the color of the lines between the cells.	String corresponding to a color	linecolor=”blue”
linewidths	Parameter controlling the thickness of the lines between the cells	Float	linewidths=0.2 or linewidths=10
mask	Parameter used to control the range of values taken into account in the heatmap.	Boolean list, same format as data	mask=table_mask

Here is an example of code:

glue=sns.load_dataset("glue").pivot(index="Model",columns="Task",values="Score")
sns.heatmap(glue)

We use pivot to format the data in the order we want:

index defines the variable for the y-axis (ordinates).
columns defines the variable for the x-axis (abscissas).
values must be a numerical variable, and it is what the heatmap will use for coloring.

heatmap simple

sns.heatmap(glue,cmap="viridis",annot=True,vmin=20,vmax=80,linecolor="red",linewidths=0.1)

With the vmin and vmax parameters, we can choose the range of values over which the heatmap will apply, and we also have visual options such as linecolor and linewidths for the lines between the cells.

heatmap modifiée

If we need clustering in the heatmap, we can use Seaborn’s clustermap(). One thing to know is that this function requires SciPy, so it must be installed in the environment you are working in. If you are using Colab, this will not be necessary, as you can import it directly.

signature clustermap

Parameter name	Description	Format	Example
data	The dataframe you are working on	DataFrame, Series, dict, array, or list of arrays	data=table
method	SciPy method used to perform clustering.	String corresponding to a SciPy method.	method=’centroid’
metric	SciPy metric used for clustering.	String corresponding to a SciPy metric.	metric=’jaccard’
z_score	Parameter used to center and standardize the data.	0 to standardize rows, 1 to standardize columns.	z_score=0
standard_scale	Parameter used to normalize the data, without deviation.	0 to standardize rows, 1 to standardize columns.	standard_scale=1
row_cluster, col_cluster	Parameters used to choose the clustering axes.	Boolean	row_cluster=False
figsize	Parameter controlling the size of the figure.	tuple (width, height)	figsize=(4,4)
dendrogram_ratio	Parameter controlling the size ratio of the dendrograms.	tuple (row ratio, column ratio)	dendrogram_ratio=(0.2,0.1)
cbar_pos	Parameter controlling the position of the color bar.	tuple (left, bottom, width, height)	cbar_pos=(0,0.1,0.05,0.6)

Here is an example of code:

iris = sns.load_dataset("iris")
species = iris.pop("species")
sns.clustermap(iris)

clustermap basique

Now let’s explore different parameters:

lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)
sns.clustermap(iris,row_cluster=True,dendrogram_ratio=(0.2,0.1),row_colors=row_colors,method="weighted",metric="correlation",z_score=1,annot=True,figsize=(3,9),cbar_pos=(0,0.1,0.02,0.8))

row_cluster allows grouping rows based on their similarity in order to reveal clusters. dendrogram_ratio controls the size of the dendrograms: the first value corresponds to the one on the left, and the second to the one at the top. row_colors allows adding a color indicator next to the rows. In this case, using the previous settings, the species of each row is shown. metric defines the similarity (distance) measure used, and method specifies the algorithm used for clustering. Setting z_score to 1 normalizes the data across rows. cbar_pos allows setting the position of the color bar.

clustermap plus complexe

Quantitative data

Section outline

Definition

What are quantitative data?

Relplot

Displot

Boxplot et Violinplot

Regplot et Lmplot

Heatmap et Clustermap