Section outline

        • What are quantitative data? 

          Quantitative, or numerical, data, as the name suggests, consists of measurable data, i.e., numbers.

          We will present and test the different types of graphs that can be created with Seaborn.

           

        • The relplot() function allows you to create scatter plots and line plots.

          Here is the function signature:

          Signature relplot

          There is of course documentation available online, so we will only go over the most essential elements in order to display what we need as quickly as possible, namely:

          Parameter name Description Format Example
          data

          The dataframe you are working on

          DataFrame, Series, dict, array, or list of arrays data=table
          x Variable for the x-axis String corresponding to a variable x="weight"
          y Variable for the y-axis String corresponding to a variable y=”height”
          hue Allows to add a variable as different colors String corresponding to a variable hue=”age”
          size Allows to add a variable as the size of points String corresponding to a variable size=”money”
          style Allows to add a variable as the type of points String corresponding to a variable style=”sex”
          row Allows to create a table of plots, controling the number of rows String corresponding to a variable row=”category”
          col Allows to create a table of plots, controling the number of columns String corresponding to a variable col=”job”
          kind Type of plot you want String corresponding to a type kind=”scatter” or kind=”line”

          Here is an example with the following code:

          data = sns.load_dataset("penguins")
          sns.relplot(data=data,x="bill_length_mm",y="bill_depth_mm",hue="species",style="sex",size="body_mass_g",col="island")
          plt.show()

          Which produces the following result:

          Graphique relplot scatter.

          We can see that col allows us to create different plots within the same figure.

          We can also add ellipses to relplot() scatter plots; to draw on them, we need to retrieve the axis (ax):

          df = sns.load_dataset("penguins")
          df = df[["species", "bill_length_mm", "body_mass_g"]].dropna()
          
          g = sns.relplot(
              data=df,
              x="bill_length_mm",
              y="body_mass_g",
              hue="species",
              kind="scatter",
              height=5
          )
          ax = g.ax
          
          def add_confidence_ellipse(x, y, ax, n_std=2.0, **kwargs):
              cov = np.cov(x, y)
              mean = np.mean(x), np.mean(y)
              eigvals, eigvecs = np.linalg.eigh(cov)
              order = eigvals.argsort()[::-1]
              eigvals, eigvecs = eigvals[order], eigvecs[:, order]
              angle = np.degrees(np.arctan2(*eigvecs[:, 0][::-1]))
              width, height = 2 * n_std * np.sqrt(eigvals)
              ellipse = Ellipse(
                  xy=mean,
                  width=width,
                  height=height,
                  angle=angle,
                  fill=False,
                  **kwargs
              )
              ax.add_patch(ellipse)
          
          palette = sns.color_palette()
          
          for i, species in enumerate(df["species"].unique()):
              subset = df[df["species"] == species]
              add_confidence_ellipse(
                  subset["bill_length_mm"],
                  subset["body_mass_g"],
                  ax,
                  edgecolor=palette[i],
                  linewidth=2
              )
          
          ax.set_xlabel("Bill length (mm)")
          ax.set_ylabel("Body mass (g)")
          ax.set_title("95% Confidence Ellipse by Species — Penguins")
          
          plt.show()

          Ellipse comes from matplotlib.patches.

          And here is the result of this code:

          graphe ellipse


          We can also display lines by changing the kind:

          df = sns.load_dataset("penguins")
          sns.relplot(data=df,x="bill_length_mm",y="bill_depth_mm",hue="species",style="sex",col="island",kind="line")
          plt.show()
           

          size cannot be used with line plots; here is the result of the code:


          graphique lines

           

           
           
        • The displot() function allows you to display different types of distributions.

          signature displot

           
          Parameter name Description Format Example
          data

          The dataframe you are working on

          DataFrame, Series, dict, array, or list of arrays data=table
          x Variable for the x-axis String corresponding to a variable x="weight"
          y Variable for the y-axis String corresponding to a variable y=”height”
          hue Allows to add a variable as different colors String corresponding to a variable hue=”age”
          row Allows to create a table of plots, controling the number of rows String corresponding to a variable row=”category”
          col Allows to create a table of plots, controling the number of columns String corresponding to a variable col=”job”
          kind Type of plot you want String corresponding to a type kind=”hist”,kind=”kde” ou kind=”ecdf”
          rug Allows to see individual data points on the axes Boolean rug=True

          Here is an example code that creates a histogram:
          data = sns.load_dataset("penguins")
          sns.displot(data=data, x="bill_length_mm", rug=True, hue="sex", bins=20)
          plt.show()
          Histogramme
           

          If you don’t specify the data for the y-axis, it will represent the number of occurrences, and if you don’t specify the kind, it defaults to a histogram. The bins argument controls the number of bars.


          We also have access to kernel density estimation (KDE) to estimate a distribution. Here is an example of how to use it:

          data = sns.load_dataset("penguins")
          sns.displot(data=data,x="bill_length_mm", rug=True, hue="sex", kind="kde")
          plt.show()

          kde monodimensionnelle

          If you specify a variable for the y-axis:

          data = sns.load_dataset("penguins")
          sns.displot(data=data,x="bill_length_mm", y="bill_depth_mm", rug=True, hue="sex", kind="kde")
          plt.show()

          kde bidimensionnel

          The rug parameter allows you to display individual observations along the axes of the plot.


           

          The last type of distribution available is the ECDF (empirical cumulative distribution function). You cannot specify a y variable for this distribution since it is univariate.

          data = sns.load_dataset("penguins")
          sns.displot(data=data, x="body_mass_g", rug=True, hue="sex", kind="ecdf", row="species", col="sex", height=5)
          plt.show()

          ecdfThe row parameter allows you to display additional plots based on another variable in the dataset. The height parameter controls the height of the plots.

           

           
           
        • A common graphical representation of data is the box plot, which can be accessed using boxplot().

          signature boxplot

          Parameter name Description Format Example
          data

          The dataframe you are working on

          DataFrame, Series, dict, array, or list of arrays data=table
          x Variable for the x-axis String corresponding to a variable x="weight"
          y Variable for the y-axis String corresponding to a variable y=”height”
          hue Allows to add a variable as different colors String corresponding to a variable hue=”age”
          dodge Variable allowing to choose if the contents of the graphs overlap Boolean dodge=False
          width Variable controlling the width of the boxes Float width=0.5
          gap Variable controlling the gap between different boxes Float gap=0.1

          Here is an example of code:

          data = sns.load_dataset("penguins")
          sns.boxplot(data=data, x="bill_length_mm", hue="island", dodge=True, width=0.5)
          plt.show()
          

          boxplot sans gap
          By default, gap is set to 0. The orientation is handled automatically by Seaborn, but if the plot is two-dimensional, it can be chosen manually.

          data = sns.load_dataset("penguins")
          sns.boxplot(data=data, x="bill_length_mm", hue="island", dodge=True, width=0.5, gap=0.1, log_scale=True)
          plt.show()

          boxplot gap

          log_scale allows you to change the scale. A numeric value sets the base. If the plot is two-dimensional, two values can be provided, one for each axis.


          The violin plot is also accessible via violinplot().

          signature violinplot

          Parameter name Description Format Example
          data

          The dataframe you are working on

          DataFrame, Series, dict, array, or list of arrays data=table
          x Variable for the x-axis String corresponding to a variable x="weight"
          y Variable for the y-axis String corresponding to a variable y=”height”
          hue Allows to add a variable as different colors String corresponding to a variable hue=”age”
          inner Variable allowing to choose the inner representation of the violin String corresponding to a representation type inner=”box”,inner=”quart”,inner=”point”
          split Variable allowing to choose to show 2 data groups on the same violin. Boolean split=True
          width Variable controlling the width of the boxes Float width=0.5
          dodge Variable allowing to choose if the contents of the graphs overlap Boolean dodge=False
          gap Variable controlling the gap between different boxes when dodge is True Float gap=0.1

          Here is an example of code:

          data = sns.load_dataset("penguins")
          sns.violinplot(data=data, x="bill_length_mm", hue="sex", dodge=True, linewidth=3, split=True, inner="point")
          plt.show()

          violinplot point

          We can choose to display a small box plot within the violin plot:

          data = sns.load_dataset("penguins")
          sns.violinplot(data=data, x="bill_length_mm", hue="sex",dodge=True,linewidth=3, split=True, inner="box")
          plt.show()

          violin plot box

           
           
        • If you want to perform linear regressions, Seaborn provides a dedicated function: regplot().

          Parameter name Description Format Example
          data

          The dataframe you are working on

          DataFrame, Series, dict, array, or list of arrays data=table
          x Variable for the x-axis String corresponding to a variable x="weight"
          y Variable for the y-axis String corresponding to a variable y=”height”
          ci Variable allowing to control the confidence interval displayed Integer between 1 and 100 ci=99
          nboot Variable indicating the number of bootstrap resampling that will be done Integer nboot=100
          seed Variable to indicate a seed for the resampling, allows reproductibility Integer seed=42
          logistic Variable allowing to do a logistic regression Boolean logistic=True
          lowess Variable allowing to do a lowess regression Boolean lowess=True
          robust Variable allowing to do a robust regression Boolean robust=True

          regplot() also allows you to display the confidence interval, which is set to 95% by default.


          Here is an example of code:
          sns.regplot(data=data, x="bill_length_mm",y="bill_depth_mm", ci=70)
          plt.show()

          regplot simple

          We can change the type by selecting a parameter, for example the lowess parameter, and setting it to True:
          sns.regplot(data=data, x="bill_length_mm", y="bill_depth_mm",
           ci=99, lowess=True)
          plt.show()

          lowess regplot

          The confidence interval is not displayed when using LOWESS.


          Another option is lmplot(), which is more suitable for performing regressions across multiple plots.

          signature lmplot


          Parameter name Description Format Example
          data

          The dataframe you are working on

          DataFrame, Series, dict, array, or list of arrays data=table
          x Variable for the x-axis String corresponding to a variable x="weight"
          y Variable for the y-axis String corresponding to a variable y=”height”
          hue Allows to add a variable as different colors String corresponding to a variable hue=”age”
          row Allows to create a table of plots, controling the number of rows String corresponding to a variable row=”category”
          col Allows to create a table of plots, controling the number of columns String corresponding to a variable col=”job”
          ci Variable allowing to control the confidence interval displayed Integer between 1 and 100 ci=99
          nboot Variable indicating the number of bootstrap resampling that will be done Integer nboot=100
          lowess Variable allowing to do a lowess regression Boolean lowess=True

          Here is an example of code:

          sns.lmplot(data=data, x="bill_length_mm", y="bill_depth_mm", ci=95, hue="island", robust=True, col="sex")
          plt.show()

          lmplot

          Robust and logistic regressions are also available, as with regplot(). nboot and seed are also available.

           
           
           
        • Seaborn also allows you to create heatmaps using heatmap().

          Parameter name Description Format Example
          data

          The dataframe you are working on

          DataFrame, Series, dict, array, or list of arrays data=table
          cmap
           

          Heatmap colors. Either a Matplotlib palette or a custom one.

          String corresponding to a palette or a Seaborn color_palette cmap=”viridis” or cmap = sns.color_palette("light:blue", as_cmap=True)
          annot
           Parameter that determines whether to display the values inside the cells.
          Boolean annot=True, default is False
          vmin Minimum value that will be taken into account for the colormap. Float vmin=30.6
          vmax Maximum value that will be taken into account for the colormap. Float vmax=42
          linecolor
          Parameter used to choose the color of the lines between the cells.
          String corresponding to a color linecolor=”blue”
          linewidths
           

          Parameter controlling the thickness of the lines between the cells

          Float linewidths=0.2 or linewidths=10
          mask Parameter used to control the range of values taken into account in the heatmap. Boolean list, same format as data mask=table_mask

          Here is an example of code:

          glue=sns.load_dataset("glue").pivot(index="Model",columns="Task",values="Score")
          sns.heatmap(glue)

          We use pivot to format the data in the order we want:

          • index defines the variable for the y-axis (ordinates).
          • columns defines the variable for the x-axis (abscissas).
          • values must be a numerical variable, and it is what the heatmap will use for coloring.

          heatmap simple

          sns.heatmap(glue,cmap="viridis",annot=True,vmin=20,vmax=80,linecolor="red",linewidths=0.1)
          

          With the vmin and vmax parameters, we can choose the range of values over which the heatmap will apply, and we also have visual options such as linecolor and linewidths for the lines between the cells.

          heatmap modifiée


          If we need clustering in the heatmap, we can use Seaborn’s clustermap(). One thing to know is that this function requires SciPy, so it must be installed in the environment you are working in. If you are using Colab, this will not be necessary, as you can import it directly.

          signature clustermap

          Parameter name Description Format Example
          data

          The dataframe you are working on

          DataFrame, Series, dict, array, or list of arrays data=table
          method SciPy method used to perform clustering. String corresponding to a SciPy method. method=’centroid’
          metric SciPy metric used for clustering. String corresponding to a SciPy metric. metric=’jaccard’
          z_score Parameter used to center and standardize the data.
          0 to standardize rows, 1 to standardize columns.
          z_score=0
          standard_scale Parameter used to normalize the data, without deviation.
           

          0 to standardize rows, 1 to standardize columns.

          standard_scale=1

          row_cluster,

          col_cluster

           

          Parameters used to choose the clustering axes.

          Boolean row_cluster=False
          figsize Parameter controlling the size of the figure. tuple (width, height) figsize=(4,4)
          dendrogram_ratio
           

          Parameter controlling the size ratio of the dendrograms.

          tuple (row ratio, column ratio) dendrogram_ratio=(0.2,0.1)
          cbar_pos
           

          Parameter controlling the position of the color bar.

          tuple (left, bottom, width, height) cbar_pos=(0,0.1,0.05,0.6)


           

          Here is an example of code:

          iris = sns.load_dataset("iris")
          species = iris.pop("species")
          sns.clustermap(iris)

          clustermap basique

          Now let’s explore different parameters:

          lut = dict(zip(species.unique(), "rbg"))
          row_colors = species.map(lut)
          sns.clustermap(iris,row_cluster=True,dendrogram_ratio=(0.2,0.1),row_colors=row_colors,method="weighted",metric="correlation",z_score=1,annot=True,figsize=(3,9),cbar_pos=(0,0.1,0.02,0.8))

          row_cluster allows grouping rows based on their similarity in order to reveal clusters. dendrogram_ratio controls the size of the dendrograms: the first value corresponds to the one on the left, and the second to the one at the top. row_colors allows adding a color indicator next to the rows. In this case, using the previous settings, the species of each row is shown. metric defines the similarity (distance) measure used, and method specifies the algorithm used for clustering. Setting z_score to 1 normalizes the data across rows. cbar_pos allows setting the position of the color bar.

          clustermap plus complexe