{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "oZ511soZ8yH3",
      "metadata": {
        "id": "oZ511soZ8yH3"
      },
      "source": [
        "# **Data Visualization on Seaborn**\n",
        "\n",
        "---\n",
        "\n",
        "\n",
        "---\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "PMlRa4bO9igp",
      "metadata": {
        "id": "PMlRa4bO9igp"
      },
      "source": [
        "# Introduction\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "00ec26b2",
      "metadata": {
        "id": "00ec26b2"
      },
      "source": [
        "In order to use seaborn we will first import the necessary libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a0640849",
      "metadata": {
        "id": "a0640849"
      },
      "outputs": [],
      "source": [
        "import seaborn as sns\n",
        "import seaborn.objects as so\n",
        "import matplotlib.pyplot as plt\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "from matplotlib.patches import Ellipse"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8a76a386",
      "metadata": {
        "id": "8a76a386"
      },
      "source": [
        "We will load a dataset from the freely available datasets provided by seaborn."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6c0ba412",
      "metadata": {
        "id": "6c0ba412"
      },
      "outputs": [],
      "source": [
        "data=sns.load_dataset(\"penguins\")\n",
        "print(data.isnull())\n",
        "print(data.isnull().any())"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e78894dd",
      "metadata": {
        "id": "e78894dd"
      },
      "source": [
        "If we want to work with data points containing all the variables, we can use the dropna function in order to get rid of incomplete data points"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "20cfdb49",
      "metadata": {
        "id": "20cfdb49"
      },
      "outputs": [],
      "source": [
        "data.dropna(subset=[\"body_mass_g\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "yu0zO1IZ9zCw",
      "metadata": {
        "id": "yu0zO1IZ9zCw"
      },
      "source": [
        "# Numerical plots\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "363c0bdd",
      "metadata": {
        "id": "363c0bdd"
      },
      "source": [
        "We will first work with relplots. These are the kind of plot allowing us to make cloud points graphics, and seaborn allows us to choose the x-axis and y-axis variables, and add representation of other variables thanks to the hue, style and size parameters. The col parameter allows us to separate the data in different graphics depending on a specific varaible."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6188b0a4",
      "metadata": {
        "id": "6188b0a4"
      },
      "outputs": [],
      "source": [
        "data = sns.load_dataset(\"penguins\")\n",
        "sns.relplot(data=data,x=\"bill_length_mm\",y=\"bill_depth_mm\",hue=\"species\",style=\"sex\",size=\"body_mass_g\",col=\"island\")\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "05c34982",
      "metadata": {
        "id": "05c34982"
      },
      "source": [
        "We have 3 graphics, one for each island, the color represents the specie, the type of point the sex and the size of the point the body_mass.\n",
        "\n",
        "If we want some statistical representation of our data in addition to the points, we can use matplot lib to plot a conficence ellipse. The following function add_confidence_ellipse uses the Ellipse from matplotlib in order to plot it. The n_std parameter of the function allow us to choose the confidence interval we want."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d0cf097c",
      "metadata": {
        "id": "d0cf097c"
      },
      "outputs": [],
      "source": [
        "df = sns.load_dataset(\"penguins\")\n",
        "df = df[[\"species\", \"bill_length_mm\", \"body_mass_g\"]].dropna()\n",
        "\n",
        "g = sns.relplot(\n",
        "    data=df,\n",
        "    x=\"bill_length_mm\",\n",
        "    y=\"body_mass_g\",\n",
        "    hue=\"species\",\n",
        "    kind=\"scatter\",\n",
        "    height=5\n",
        ")\n",
        "ax = g.ax\n",
        "\n",
        "def add_confidence_ellipse(x, y, ax, n_std=2.0, **kwargs):\n",
        "    cov = np.cov(x, y)\n",
        "    mean = np.mean(x), np.mean(y)\n",
        "    eigvals, eigvecs = np.linalg.eigh(cov)\n",
        "    order = eigvals.argsort()[::-1]\n",
        "    eigvals, eigvecs = eigvals[order], eigvecs[:, order]\n",
        "    angle = np.degrees(np.arctan2(*eigvecs[:, 0][::-1]))\n",
        "    width, height = 2 * n_std * np.sqrt(eigvals)\n",
        "    ellipse = Ellipse(\n",
        "        xy=mean,\n",
        "        width=width,\n",
        "        height=height,\n",
        "        angle=angle,\n",
        "        fill=False,\n",
        "        **kwargs\n",
        "    )\n",
        "    ax.add_patch(ellipse)\n",
        "\n",
        "palette = sns.color_palette()\n",
        "\n",
        "for i, species in enumerate(df[\"species\"].unique()):\n",
        "    subset = df[df[\"species\"] == species]\n",
        "    add_confidence_ellipse(\n",
        "        subset[\"bill_length_mm\"],\n",
        "        subset[\"body_mass_g\"],\n",
        "        ax,\n",
        "        edgecolor=palette[i],\n",
        "        linewidth=2\n",
        "    )\n",
        "\n",
        "ax.set_xlabel(\"Bill length (mm)\")\n",
        "ax.set_ylabel(\"Body mass (g)\")\n",
        "ax.set_title(\"95% Confidence Ellipse by Species — Penguins\")\n",
        "\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "620810b1",
      "metadata": {
        "id": "620810b1"
      },
      "source": [
        "We can choose the \"line\" kind in order to represent the data with lines."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "abcf072d",
      "metadata": {
        "id": "abcf072d"
      },
      "outputs": [],
      "source": [
        "df = sns.load_dataset(\"penguins\")\n",
        "sns.relplot(data=df,x=\"bill_length_mm\",y=\"bill_depth_mm\",hue=\"species\",style=\"sex\",col=\"island\",kind=\"line\")\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "dac5e185",
      "metadata": {
        "id": "dac5e185"
      },
      "source": [
        "The data isn't the most well-suited for using the \"line\" kind, but we notice that the size parameter isn't used anymore, that's because it doesn't work with lines.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "wYNdo3mWDDD5",
      "metadata": {
        "id": "wYNdo3mWDDD5"
      },
      "source": [
        "We can use the displot function of seaborn in order to plot statistical distribution of our data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "H4aqV1o_zONV",
      "metadata": {
        "id": "H4aqV1o_zONV"
      },
      "outputs": [],
      "source": [
        "sns.displot(data=data, x=\"bill_length_mm\", rug=True, hue=\"sex\", bins=20)\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "M4Y41U1ozeUw",
      "metadata": {
        "id": "M4Y41U1ozeUw"
      },
      "source": [
        "For exemple with the kind=\"kde\" we get a kernel distribution estimation of the data. The rug parameter allow to show the individual data points on the axis. The bins parameter allows to choose the number of bars that will be displayed. Hence, a higher number means an increase in the accuracy of the distribution plotted."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "eba75690",
      "metadata": {
        "id": "eba75690"
      },
      "outputs": [],
      "source": [
        "data = sns.load_dataset(\"penguins\")\n",
        "sns.displot(data=data,x=\"bill_length_mm\", rug=True, hue=\"sex\", kind=\"kde\")\n",
        "plt.show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "5a4557a8",
      "metadata": {
        "id": "5a4557a8"
      },
      "source": [
        "If we assign a variable to the y axis we get a 2D KDE plot. It essentialy works as a level curve. A curve correspond to datapoints with similaire density of probability. The center of a group of curve is a high density zone."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "36fb9c95",
      "metadata": {
        "id": "36fb9c95"
      },
      "outputs": [],
      "source": [
        "data = sns.load_dataset(\"penguins\")\n",
        "sns.displot(data=data,x=\"bill_length_mm\", y=\"bill_depth_mm\", rug=True, hue=\"sex\", kind=\"kde\")\n",
        "plt.show()\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "472dcabf",
      "metadata": {
        "id": "472dcabf"
      },
      "source": [
        "The other type of distribution plot is the empirical distribution function(ecdf). This distribution is monovariational, so only the x axis must be assigned."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "8bfcab9d",
      "metadata": {
        "id": "8bfcab9d"
      },
      "outputs": [],
      "source": [
        "data = sns.load_dataset(\"penguins\")\n",
        "sns.displot(data=data, x=\"body_mass_g\", rug=True, hue=\"sex\", kind=\"ecdf\", row=\"species\", col=\"sex\", height=5)\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8ChAwF7wFGL3",
      "metadata": {
        "id": "8ChAwF7wFGL3"
      },
      "source": [
        "As expected we have 3 species and it is assigned as row so we have 3 rows, and we have 2 sex category so we have 2 columns."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "378ed865",
      "metadata": {
        "id": "378ed865"
      },
      "source": [
        "Another type of graphic is the boxplot, a classical statistical tool. The dodge parameter allow to ensure that the different boxes don't overlap"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "033dc1c0",
      "metadata": {
        "id": "033dc1c0"
      },
      "outputs": [],
      "source": [
        "data = sns.load_dataset(\"penguins\")\n",
        "sns.boxplot(data=data, x=\"bill_length_mm\", hue=\"island\", dodge=True, width=0.5)\n",
        "plt.show()\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "DO8wl8D1FZIu",
      "metadata": {
        "id": "DO8wl8D1FZIu"
      },
      "source": [
        "With the use of the dodge parameter, the plots are not overlapping. You can check this by turning dodge to False."
      ]
    },
    {
      "cell_type": "markdown",
      "id": "64ebd4b1",
      "metadata": {
        "id": "64ebd4b1"
      },
      "source": [
        "The gap parameter allows to control the distance between boxes and the log_scale parameter allow to enable the log scale"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "fb8f7460",
      "metadata": {
        "id": "fb8f7460"
      },
      "outputs": [],
      "source": [
        "data = sns.load_dataset(\"penguins\")\n",
        "sns.boxplot(data=data, x=\"bill_length_mm\", hue=\"island\", dodge=True, width=0.5, gap=0.1, log_scale=True)\n",
        "plt.show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c091c72e",
      "metadata": {
        "id": "c091c72e"
      },
      "source": [
        "Another classical statistical tool is the violin plot. The inner parameter allows to control the representation of data in the plot.The split parameter allows to put 2 different violin plots on the same one since they are symetrical."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f97f5700",
      "metadata": {
        "id": "f97f5700"
      },
      "outputs": [],
      "source": [
        "data = sns.load_dataset(\"penguins\")\n",
        "sns.violinplot(data=data, x=\"bill_length_mm\", hue=\"sex\", dodge=True, linewidth=3, split=True, inner=\"point\")\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "436f570f",
      "metadata": {
        "id": "436f570f"
      },
      "outputs": [],
      "source": [
        "data = sns.load_dataset(\"penguins\")\n",
        "sns.violinplot(data=data, x=\"bill_length_mm\", hue=\"sex\",dodge=True,linewidth=3, split=True, inner=\"box\")\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c88cad20",
      "metadata": {
        "id": "c88cad20"
      },
      "source": [
        "The regplots are tools allowing to make regression on data points, with an incertainty interval of the curve visible. Here for example we are confident at 70% that the true curve is within the interval on the graphic."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "92d5c163",
      "metadata": {
        "id": "92d5c163"
      },
      "outputs": [],
      "source": [
        "sns.regplot(data=data, x=\"bill_length_mm\",y=\"bill_depth_mm\", ci=70)\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bce25baa",
      "metadata": {
        "id": "bce25baa"
      },
      "source": [
        "Different types of regression are available, here for exemple the lowess type."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "42e18e0e",
      "metadata": {
        "id": "42e18e0e"
      },
      "outputs": [],
      "source": [
        "sns.regplot(data=data, x=\"bill_length_mm\", y=\"bill_depth_mm\",\n",
        " ci=99, lowess=True)\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "91602aa3",
      "metadata": {
        "id": "91602aa3"
      },
      "source": [
        "We can choose a robust regression, which is more precise, but it is more computationally intensive."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "9e59df9c",
      "metadata": {
        "id": "9e59df9c"
      },
      "outputs": [],
      "source": [
        "sns.lmplot(data=data, x=\"bill_length_mm\", y=\"bill_depth_mm\", ci=95, hue=\"island\", robust=True, col=\"sex\")\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "b91b2b75",
      "metadata": {
        "id": "b91b2b75"
      },
      "source": [
        "Another representation available is the heatmap, which requires a specific configuration for the data, which pivots allows us to get. Index is the y axis, columns the x axis and values is the variable we're looking at. We will use another dataset, glue, which represents scores of AI models on different tasks."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e7025494",
      "metadata": {
        "id": "e7025494"
      },
      "outputs": [],
      "source": [
        "glue=sns.load_dataset(\"glue\").pivot(index=\"Model\",columns=\"Task\",values=\"Score\")\n",
        "sns.heatmap(glue)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3424b97f",
      "metadata": {
        "id": "3424b97f"
      },
      "source": [
        "We can choose the range of values on which the heatmap will be used with vmin and vmax, and choose a colormap and the line colors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a5910782",
      "metadata": {
        "id": "a5910782"
      },
      "outputs": [],
      "source": [
        "sns.heatmap(glue,cmap=\"viridis\",annot=True,vmin=20,vmax=80,linecolor=\"red\",linewidths=0.1)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "31787103",
      "metadata": {
        "id": "31787103"
      },
      "source": [
        "For clustering purposes we have the clustermap. For clustering we need to work with numerical values only."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f06e2421",
      "metadata": {
        "id": "f06e2421"
      },
      "outputs": [],
      "source": [
        "iris = sns.load_dataset(\"iris\")[:100]\n",
        "species = iris.pop(\"species\")\n",
        "sns.clustermap(iris)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3266aa33",
      "metadata": {
        "id": "3266aa33"
      },
      "source": [
        "We can use the row_colors parameter in order to add back the species that we removed from the data, it will allow us to check if species are indeed close together in the clustering, considering that we have access to different clustering methods and results may vary depending on them. method and metric allow us to try different clusterings. Then we have visual parameters such as dendrogram_ratio, figsize and cbar_pos. z_score allows us to choose wherever we can't to calculate the zscore for the rows or the columns."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "072f091c",
      "metadata": {
        "id": "072f091c"
      },
      "outputs": [],
      "source": [
        "lut = dict(zip(species.unique(), \"rbg\"))\n",
        "row_colors = species.map(lut)\n",
        "sns.clustermap(iris,row_cluster=True,dendrogram_ratio=(0.3,0.1),row_colors=row_colors,method=\"weighted\",metric=\"correlation\",z_score=1,annot=True,figsize=(7,12),cbar_pos=(0,0.1,0.02,0.8))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "wMbexlD5-HD5",
      "metadata": {
        "id": "wMbexlD5-HD5"
      },
      "source": [
        "# Multiple graphics on a figure\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "725f13f6",
      "metadata": {
        "id": "725f13f6"
      },
      "source": [
        "Seaborn allows for easy visual representation of differents types of graphics on the same figure. For exemple we have the jointplot to plot a 2 dimensional figure, such as a 2D KDE and the monodimensional KDE correspond on each axis. The second part of the code allows for the plotting of datapoints."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "27a7919a",
      "metadata": {
        "id": "27a7919a"
      },
      "outputs": [],
      "source": [
        "g=sns.jointplot(data=data, x=\"bill_length_mm\", y=\"bill_depth_mm\",\n",
        " kind=\"kde\", hue=\"species\")\n",
        "\n",
        "for species in data[\"species\"].unique():\n",
        "    subset = data[data[\"species\"] == species]\n",
        "    g.ax_joint.scatter(\n",
        "        subset[\"bill_length_mm\"],\n",
        "        subset[\"bill_depth_mm\"],\n",
        "        s=subset[\"body_mass_g\"]/500,\n",
        "        label=species,\n",
        "        alpha=0.7)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "050fac76",
      "metadata": {
        "id": "050fac76"
      },
      "source": [
        "JointGrid permits the same things as JointPair but with added control on the figure. We can for example choose different types of plots for the 2D and 1D plot. Note that we can use parameters of the chosen plot type such as fill=True for the kdeplot."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "3bdd9b16",
      "metadata": {
        "id": "3bdd9b16"
      },
      "outputs": [],
      "source": [
        "g5 = sns.JointGrid(data=data, x=\"bill_length_mm\", y=\"bill_depth_mm\",\n",
        " hue=\"species\")\n",
        "g5.plot_joint(sns.scatterplot)\n",
        "g5.plot_marginals(sns.kdeplot, fill=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8c2153cc",
      "metadata": {
        "id": "8c2153cc"
      },
      "source": [
        "Another useful type of representation is the pairplot. It will make a figure with graphics that correspond to all combination of variables on the data. The kind parameter allows to choose the marginal graphics, and the diag_kind the diagonal ones."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "09092554",
      "metadata": {
        "id": "09092554"
      },
      "outputs": [],
      "source": [
        "data=sns.load_dataset(\"penguins\")\n",
        "sns.pairplot(data=data, hue=\"species\",kind=\"kde\",diag_kind=\"hist\")"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "3b6f885c",
      "metadata": {
        "id": "3b6f885c"
      },
      "source": [
        "As JointGrid, PairGrid permits to do the same things as PairPlot but with more control. We can for example choose what kind of graphics we want on the diagonal, the lower part and the upper part of the figure."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6648da1a",
      "metadata": {
        "id": "6648da1a"
      },
      "outputs": [],
      "source": [
        "g = sns.PairGrid(data, hue=\"species\")\n",
        "g.map_upper(sns.histplot)\n",
        "g.map_lower(sns.kdeplot, fill=True)\n",
        "g.map_diag(sns.histplot, kde=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "abd0154e",
      "metadata": {
        "id": "abd0154e"
      },
      "source": [
        "The FacetGrid allows for great control of the figure. You can make different graphics and modify specific graphics of the figure by using the different values of variable."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f68cd3cd",
      "metadata": {
        "id": "f68cd3cd"
      },
      "outputs": [],
      "source": [
        "tips = sns.load_dataset(\"tips\")\n",
        "g = sns.FacetGrid(tips,col=\"sex\",row=\"time\",margin_titles=True,\n",
        "     despine=False,sharex=False)\n",
        "g.figure.subplots_adjust(wspace=0.05, hspace=0.2)\n",
        "for (row_val, col_val), ax in g.axes_dict.items():\n",
        "     subset = tips[(tips[\"time\"] == row_val) & (tips[\"sex\"] == col_val)]\n",
        "     if row_val == \"Lunch\" and col_val == \"Female\":\n",
        "         subset = subset[subset[\"tip\"] > 2]\n",
        "         sns.scatterplot(data=subset,x=\"total_bill\",y=\"tip\",hue=\"smoker\",\n",
        "         ax=ax)\n",
        "         ax.set_facecolor(\".3\")\n",
        "         ax.set_xlim(0,20)\n",
        "         ax.set_ylim(0,30)\n",
        "         ax.set_xticks([0, 5, 15])\n",
        "         ax.set_xlabel(\"tip>2\")\n",
        "         ax.grid(True, color=\"gray\", linestyle=\"-\", linewidth=0.5)\n",
        "         ax.spines[\"top\"].set_color(\"red\")\n",
        "         ax.spines[\"right\"].set_color(\"red\")\n",
        "         ax.spines[\"bottom\"].set_color(\"red\")\n",
        "         ax.spines[\"left\"].set_color(\"red\")\n",
        "         ax.tick_params(axis=\"x\", colors=\"blue\")\n",
        "     else:\n",
        "         ax.set_facecolor((0, 0, 0, 0))\n",
        "         sns.scatterplot(data=subset,x=\"total_bill\",y=\"tip\",hue=\"smoker\",\n",
        "         ax=ax)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0c31fad8",
      "metadata": {
        "id": "0c31fad8"
      },
      "source": [
        "You can either choose to save on the window that pops when you show your figure with the button, or save directly into your computer with the savefig function, which allows you to choose the definition of the figure with the dpi, and whever or not you want the background to be transparent to add it to a website for example."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5bfbc1c5",
      "metadata": {
        "id": "5bfbc1c5"
      },
      "outputs": [],
      "source": [
        "g.savefig(\"img\",dpi=200,transparent=True)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "FOSBnBTd-hB9",
      "metadata": {
        "id": "FOSBnBTd-hB9"
      },
      "source": [
        "# Seaborn Objects\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "03f55bda",
      "metadata": {
        "id": "03f55bda"
      },
      "source": [
        "Another paradigm has been added to Seaborn in 2022, it's the object representation inspired by ggplot2. It allows for construction of figures in a different and more sequential manner. You first choose the data you want to plot, then the type of plot you want such as :"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c58f2a69",
      "metadata": {
        "id": "c58f2a69"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips,x=\"total_bill\").add(so.Bar(),so.Hist()).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d06cc890",
      "metadata": {
        "id": "d06cc890"
      },
      "source": [
        "We have access to different options for different plots, for scatterplots with so.Dot() we have for example the Jitter() to avoid overlapping points. The color parameter is for all intents and purposes the equivalent of hue in the preceding paradigm, and marker allows us again to visualize other variables."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "01ac7697",
      "metadata": {
        "id": "01ac7697"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips,x=\"smoker\",y=\"tip\").add(so.Dot(),so.Jitter(),color=\"day\",\n",
        "marker=\"time\").facet(\"sex\").limit(y=(4,11)).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e61cf6ed",
      "metadata": {
        "id": "e61cf6ed"
      },
      "source": [
        "For the case of scatterplots you can add a linear regression by using the Line() and the Polyfit(). There are diffent types of lines and not only Polyfit ones."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "41ce2fb6",
      "metadata": {
        "id": "41ce2fb6"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips, x=\"total_bill\", y=\"tip\").add(so.Dot(), color=\"time\", marker=\"day\").facet(\"sex\").add(so.Line(), so.PolyFit(), color=\"time\").show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "ac7b0a53",
      "metadata": {
        "id": "ac7b0a53"
      },
      "source": [
        "There also is the possibility of making queries with the query function on the data itself. They work using classical logic with and and or, != and == statements. We then use the pipe method in order to add the graphic we want to the data we have queried. We can then add what we want to the figure. Here we add an aggregation line and an error estimation with Band and Est."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f06afe1e",
      "metadata": {
        "id": "f06afe1e"
      },
      "outputs": [],
      "source": [
        "diamonds=sns.load_dataset(\"diamonds\")\n",
        "diamonds.query(\"cut == 'Ideal' and (color == 'D' or color == 'F')\").pipe(so.Plot, \"depth\", \"price\",linestyle=\"color\").add(so.Line(color=\".1\",linewidth=1),so.Agg()).add(so.Band(), so.Est(),group=\"color\",color=\"color\").show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d1e43255",
      "metadata": {
        "id": "d1e43255"
      },
      "source": [
        "Path() is an alternative to Line() when you want the graphic to strictly follow the order of datapoints. It is useful when you want to plot a trajectory."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "5911ac6f",
      "metadata": {
        "id": "5911ac6f"
      },
      "outputs": [],
      "source": [
        "healthexp=sns.load_dataset(\"healthexp\")\n",
        "p = so.Plot(healthexp, \"Spending_USD\", \"Life_Expectancy\", color=\"Country\").add(so.Path()).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1f91ae65",
      "metadata": {
        "id": "1f91ae65"
      },
      "source": [
        "The facet method allows for figures with multiple graphics corresponding to different values of a variable,wrap controls the numbers of graphics on a row. The Area() will fill the surface below the line that the data would make."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2fae7bc9",
      "metadata": {
        "id": "2fae7bc9"
      },
      "outputs": [],
      "source": [
        "so.Plot(healthexp,\"Year\",\"Spending_USD\").facet(\"Country\",wrap=3).add(so.Area(),color=\"Country\",legend=False).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "abf0d5d3",
      "metadata": {
        "id": "abf0d5d3"
      },
      "source": [
        "The Stack() method will simply stack all the graphics on the same one."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2ac5ed60",
      "metadata": {
        "id": "2ac5ed60"
      },
      "outputs": [],
      "source": [
        "so.Plot(healthexp,\"Year\",\"Spending_USD\",color=\"Country\").add(so.Area(),\n",
        "so.Stack()).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9236b753",
      "metadata": {
        "id": "9236b753"
      },
      "source": [
        "Here we use Agg() to aggregate all values on a point, essentialy plotting the mean of the variable and Range() and Est() plot the uncertainty of this value, which we have control of since we can choose the amount of standard deviation used."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ovoqPswDF_Oe",
      "metadata": {
        "id": "ovoqPswDF_Oe"
      },
      "outputs": [],
      "source": [
        "df = pd.DataFrame({\n",
        "    \"x\": [1, 2, 3],\n",
        "    \"y\": [10, 15, 20],\n",
        "    \"ymin\": [8, 12, 17],\n",
        "    \"ymax\": [12, 18, 23]\n",
        "})\n",
        "(\n",
        "    so.Plot(df, x=\"x\", y=\"y\")\n",
        "    .add(so.Range(), ymin=\"ymin\", ymax=\"ymax\")\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d6466d66",
      "metadata": {
        "id": "d6466d66"
      },
      "outputs": [],
      "source": [
        "so.Plot(data,x=\"sex\",y=\"flipper_length_mm\",linestyle=\"species\").facet(\"species\").add(so.Line(marker=\"o\"), so.Agg()).add(so.Range(), so.Est(errorbar=(\"sd\",1))).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "d979836b",
      "metadata": {
        "id": "d979836b"
      },
      "source": [
        "Example of different kind of plots on the same graphic, thanks to a double .add()."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c5389698",
      "metadata": {
        "id": "c5389698"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips,x=\"total_bill\").facet(row=\"time\",col=\"sex\").add(so.Bar(),so.Hist(stat=\"density\")).add(so.Line(color=\"red\"),so.KDE()).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "60846eb9",
      "metadata": {
        "id": "60846eb9"
      },
      "source": [
        "The Dodge() method serves the same purpose as the dodge parameter from the previous paradigm."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "eb8559f0",
      "metadata": {
        "id": "eb8559f0"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips, \"total_bill\", \"smoker\", color=\"sex\").add(so.Bar(alpha=.5), so.Agg(), so.Dodge()).add(so.Range(), so.Est(errorbar=\"sd\"), so.Dodge()).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6d27008e",
      "metadata": {
        "id": "6d27008e"
      },
      "source": [
        "The Count() methods does something different than Agg(), as its name implies it counts the occurences of a certain variable, here whever or not the person is a smoker."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c25d5366",
      "metadata": {
        "id": "c25d5366"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips,y=\"day\",color=\"smoker\").add(so.Bar(),so.Count(),so.Stack()).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "f95a52c7",
      "metadata": {
        "id": "f95a52c7"
      },
      "source": [
        "We can also represent percentiles with seaborn object. We can give a list to Perc() to make the precise percentiles we want appear. If we don't do it and give nothing it gives automatically the 20,40,60,80 and 100 percentiles."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "862c96e1",
      "metadata": {
        "id": "862c96e1"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips,\"smoker\",\"total_bill\").add(so.Dot(marker=\"s\"),so.Perc([10,50,90])).show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "63a13551",
      "metadata": {
        "id": "63a13551"
      },
      "source": [
        "Here is an example of Dots() which allows to plot a lot of points with good performance. The Jitter makes the illusion of a bar. We then proceed to graph the percentiles from 0 to 25 and from 75 to 100. The Shift() on y allows the visual to be bellow the points. With .scale we can choose the scale for a specific axis, here x."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f50bd0ac",
      "metadata": {
        "id": "f50bd0ac"
      },
      "outputs": [],
      "source": [
        "so.Plot(diamonds, \"price\", \"cut\").add(so.Dots(pointsize=1, alpha=.2), so.Jitter(.3)).add(so.Range(color=\"k\"), so.Perc([0, 25]), so.Shift(y=.2)).add(so.Range(color=\"k\"), so.Perc([75, 100]), so.Shift(y=.2)).scale(x=\"log\").show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "6bed1d4e",
      "metadata": {
        "id": "6bed1d4e"
      },
      "source": [
        "You can make graphics by choosing a baseline for example with Norm and a condition for the baseline with where. Here we take as a baseline the x.min(), which in turn is the earliest year we have."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "692a4167",
      "metadata": {
        "id": "692a4167"
      },
      "outputs": [],
      "source": [
        "so.Plot(healthexp, x=\"Year\", y=\"Spending_USD\", color=\"Country\").add(so.Lines(), so.Norm(where=\"x == x.min()\",percent=True)).label(y=\"Percent change in spending from 1970 baseline\").show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "1abe31f5",
      "metadata": {
        "id": "1abe31f5"
      },
      "source": [
        "Here we use Norm() with percent to show the variation over the years."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "2bc7b257",
      "metadata": {
        "id": "2bc7b257"
      },
      "outputs": [],
      "source": [
        "mpg=sns.load_dataset(\"mpg\")\n",
        "so.Plot(mpg,\"model_year\",\"horsepower\").add(so.Line(),so.Agg(),so.Norm(percent=True),color=\"origin\").show()\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "fXGsGzLPCrkV",
      "metadata": {
        "id": "fXGsGzLPCrkV"
      },
      "source": [
        "There are variants of most objects with an s, these versions are better suited to plot a lot of data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "sDom0hjWCHph",
      "metadata": {
        "id": "sDom0hjWCHph"
      },
      "outputs": [],
      "source": [
        "diamonds = sns.load_dataset(\"diamonds\")\n",
        "so.Plot(diamonds,\"price\",color=\"cut\").add(so.Bars(),so.Hist(bins=50)).scale(x=\"log\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "DbvYbfO8BBIn",
      "metadata": {
        "id": "DbvYbfO8BBIn"
      },
      "outputs": [],
      "source": [
        "diamonds = sns.load_dataset(\"diamonds\")\n",
        "so.Plot(diamonds,\"price\",color=\"cut\").add(so.Bar(),so.Hist(bins=50)).scale(x=\"log\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "0d84c0b5",
      "metadata": {
        "id": "0d84c0b5"
      },
      "outputs": [],
      "source": [
        "\n",
        "so.Plot(tips,x=\"total_bill\",y=\"tip\").add(so.Dots(),so.Jitter(0.5),color=\"day\",marker=\"time\").facet(\"sex\").scale(x=so.Continuous(trans=\"log10\"),y=so.Continuous(trans=\"sqrt\")).show()\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "T6SEKBuycDOj",
      "metadata": {
        "id": "T6SEKBuycDOj"
      },
      "source": [
        "You can also control the visual of your graphics with for example specific types of markers for your scatterplots."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "NcqAdH9HSJpx",
      "metadata": {
        "id": "NcqAdH9HSJpx"
      },
      "outputs": [],
      "source": [
        "tips=sns.load_dataset(\"tips\")\n",
        "so.Plot(tips, x=\"total_bill\", y=\"tip\") \\\n",
        "  .add(so.Dots(), so.Jitter(0.5),\n",
        "       color=\"day\", marker=\"time\") \\\n",
        "  .scale(marker={\n",
        "      \"Lunch\": \"o\",\n",
        "      \"Dinner\": \"1\"\n",
        "  }) \\\n",
        "  .show()\n",
        "\n",
        "so.Plot(tips, x=\"total_bill\", y=\"tip\") \\\n",
        "  .add(so.Dots(marker=\"x\"), so.Jitter(0.5)) \\\n",
        "  .show()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "Tirsrbrodsez",
      "metadata": {
        "id": "Tirsrbrodsez"
      },
      "source": [
        "You can also choose the type of lines."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "M7YuXg5zdv7s",
      "metadata": {
        "id": "M7YuXg5zdv7s"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips,\"total_bill\",\"tip\").add(so.Lines(linestyle='--'))"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "eL8eZKzvgCZP",
      "metadata": {
        "id": "eL8eZKzvgCZP"
      },
      "source": [
        "And we can also choose the colors manually :"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "tRgvD0LKgFGU",
      "metadata": {
        "id": "tRgvD0LKgFGU"
      },
      "outputs": [],
      "source": [
        "so.Plot(tips, \"total_bill\", \"tip\",\n",
        "        color=\"sex\", linestyle=\"sex\").add(so.Lines()).scale(color={\"Male\": \"blue\",\"Female\": \"red\"},\n",
        "      linestyle={\"Male\": \"--\",\"Female\": (4,4)}\n",
        "  ).show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "k2RauHVkjVe_",
      "metadata": {
        "id": "k2RauHVkjVe_"
      },
      "outputs": [],
      "source": [
        "so.Plot(healthexp, x=\"Year\", y=\"Spending_USD\", color=\"Country\").add(so.Lines(), so.Norm(where=\"x == x.min()\", percent=True)).scale(color=\"rocket\") \\\n",
        "  .label(y=\"Percent change in spending from 1970 baseline\")"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "display_name": "venv (3.12.3)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}