diff --git a/README.md b/README.md index 7c6072c..1cbca3a 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,59 @@ -# module-python-visualization +# Python for data visualization and model interpretation + Training module on data visualization using Python tools + +## Downloading the data + +For this tutorial, we will use two different datasets. + +### Datasaurus (optional) + +We will use this dataset solely to illustrate certain theoretical concepts. Downloading this dataset is therefore optional. + +To download the data: + +https://www.kaggle.com/code/tombutton/datasaurus-dozen/input + +### Brain development fMRI + +We will download this dataset using `nilearn`. Follow the instructions in the notebook. + +## Installation + +1. Install `uv` + +```bash +curl -LsSf https://astral.sh/uv/install.sh | sh +``` + +2. Check `uv` installation + +```bash +uv -h +``` + +3. Create a virtual environment + +```bash +uv venv visu-env --python=3.10 +``` + +4. Activate your environment + +```bash +source visu-env/bin/activate +``` + +5. Install the dependencies + +```bash +uv pip install -r requirements.txt +``` + +## Configure the virtual environment as a Jupyter notebook kernel + +In your terminal: + +```bash +python -m ipykernel install --user --name=visu-env --display-name "Python cours visu" +``` \ No newline at end of file diff --git a/exercise_visu_en.ipynb b/exercise_visu_en.ipynb new file mode 100644 index 0000000..2bc0181 --- /dev/null +++ b/exercise_visu_en.ipynb @@ -0,0 +1,489 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "016a2a06-8ad9-4988-8d98-f03211e10d1d", + "metadata": {}, + "source": [ + "# Exercises : Data Visualization in Python\n", + "\n", + "This exercise aims to test your data visualization skills in Python\n", + "\n", + "**Context**\n", + "\n", + "To complete the exercises, you will have to download the **ADHD200** dataset. Both the phenotypic data and the resting-state fMRI data will be used.\n", + "\n", + "**Instructions**\n", + "\n", + "1. **Compliance**: Do not modify tje names of the output variables (e.g., q1_n_sujets) given in the commented lines.\n", + "2. **Execution**: Ensure that your code runs without errors from top to bottom.\n", + "3. **Parameters**: Strictly follow the parameters specified in each question.\n", + "4. **Data** : Only use the `pheno`, `func`, and `confounds` variables provided in the configuration cell (Section 0), which are already aligned with each other.\n", + "\n", + "**Module validation**\n", + "To validate the module, please complete the following questions:\n", + "- Question 2: Relationship Between Age and Movement\n", + "- Question 3: Outliers and Participant Number\n", + "- Question 5: Visualize the Atlas\n", + "- Question 6: Time Series\n", + " \n", + "\n", + "The rest of the questions are bonus exercises." + ] + }, + { + "cell_type": "markdown", + "id": "3ebae5d5-4530-43b7-9dc0-47da8c1c6c45", + "metadata": {}, + "source": [ + "## Section 0: Configuration and Data Loading\n", + "\n", + "Execute these configuration cells. They download the data and prepare the variables you will use in the exercise. **Do not modify them.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c7a4822-2bef-4332-8652-67bff094d073", + "metadata": {}, + "outputs": [], + "source": [ + "# Configuration — do not modify\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "import os\n", + "import statsmodels\n", + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "import plotly.express as px\n", + "\n", + "from cmcrameri import cm\n", + "from nilearn import datasets, plotting\n", + "from nilearn.image import math_img\n", + "from nilearn.maskers import NiftiLabelsMasker\n", + "from matplotlib.colors import ListedColormap\n", + "from nilearn.connectome import ConnectivityMeasure\n", + "\n", + "print(\"Packages successfully imported.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f7c7842-8168-4426-8f02-081bf39a26c8", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# Loading ADHD200 datase — do not modify\n", + "data_dir = './nilearn_data'\n", + "\n", + "adhd_dataset = datasets.fetch_adhd(n_subjects=40, data_dir=data_dir)\n", + "\n", + "# Raw phenotypic data (might not cover all participants)\n", + "pheno_raw = adhd_dataset.phenotypic\n", + "\n", + "print(f\"Dataset loaded : {len(adhd_dataset.func)} functional images\")\n", + "print(f\"Raw phenotypic data : {pheno_raw.shape[0]} subjects\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b19534c-82f4-4565-a114-7d80c7001da1", + "metadata": {}, + "outputs": [], + "source": [ + "# Aligning data — do not modify\n", + "#\n", + "# Some subjects have fMRI data but no phenotypic data.\n", + "# We are keeping only subjects that have both.\n", + "\n", + "func_ids = [int(os.path.basename(f).split('_')[0]) for f in adhd_dataset.func]\n", + "pheno_ids = set(pheno_raw['Subject'].values)\n", + "matched_idx = [i for i, fid in enumerate(func_ids) if fid in pheno_ids]\n", + "matched_fids = [func_ids[i] for i in matched_idx]\n", + "\n", + "# Aligned data ready to use\n", + "pheno = pheno_raw.set_index('Subject').loc[matched_fids].reset_index()\n", + "func = [adhd_dataset.func[i] for i in matched_idx]\n", + "confounds = [adhd_dataset.confounds[i] for i in matched_idx]\n", + "\n", + "print(f\"Subjects with both imaging AND phenotypic data: {len(func)}\")\n", + "print(f\"Available phenotypic columns: {list(pheno.columns)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "0b7786cc-c2d1-4578-8599-28c27bfde924", + "metadata": {}, + "source": [ + "# Part 1 : Exploring phenotypic data\n", + "\n", + "In this section, you will explore the clinical and demographic variables of the ADHD200 dataset." + ] + }, + { + "cell_type": "markdown", + "id": "28419906-7dec-4373-93c9-2ecb67e13706", + "metadata": {}, + "source": [ + "### Question 1: Age Distribution by Sex\n", + "\n", + "Visualize the `age` variable in the `pheno` dataframe based on the `sex` variable. Complete the `histplot` function below by specifying the correct variables. You must specify the following parameters:\n", + "\n", + "- Specify the data structure (dataframe) to use in the function.\n", + "- The `age` variable must be on the x-axis.\n", + "- Age distributions must be separated by the `sex` variable. This must be specified within the same `histplot` function below.\n", + "- Overlay a Kernel Density Estimate (`kdeplot`) on the histogram.\n", + "\n", + "You will need to specify the values for four parameters. **Leave the default values for other variables in the function.**\n", + "\n", + "If needed, refer to the [`seaborn` documentation](https://seaborn.pydata.org/generated/seaborn.histplot.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c388267-d49d-4155-bcd5-c5f49bbf6707", + "metadata": {}, + "outputs": [], + "source": [ + "ax_q1 = plt.subplot()\n", + "\n", + "sns.histplot(\n", + " # ...\n", + " ax=ax_q1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fc8e289c-388e-4c90-ae08-58c996f939a4", + "metadata": {}, + "source": [ + "### Question 2: Relationship Between Age and Movement\n", + "\n", + "Look at the relationship between the participant's age (`age`) and the movement levels (`MeanFD`) during the fMRI acquisition. Specifically, you will have to:\n", + "\n", + "- Choose the appropriate function from the `seaborn` library to visualize only the relationship between the two variables, while including a regression model.\n", + "- Specify the data structure (dataframe) to use in the function.\n", + "- `age` should be treated as your independent variable.\n", + "- `MeanFD` should be treated as the dependent variable." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "972ea435-ad5d-4c42-ba15-685e42ad50fb", + "metadata": {}, + "outputs": [], + "source": [ + "ax_q2 = plt.subplot()\n", + "\n", + "# Modify the lines below. Once you have replaced `` with the correct function\n", + "# and replaced the `...` with the correct variables, uncomment the following lines:\n", + "\n", + "#sns.(\n", + "# ...,\n", + "# ax=ax_q2\n", + "#)" + ] + }, + { + "cell_type": "markdown", + "id": "f2d47307-f7d8-4007-b52b-204f5166dd3b", + "metadata": {}, + "source": [ + "### Question 3: Outliers and Participant Number\n", + "\n", + "While creating the figure showing the relationship between age and movement, you notice a participant with a potentially aberrant mean FD value. Extract the participant id for that participant directly from your figure.\n", + "\n", + "Reproduce the figure from the previous question using `plotly.express`. For this question, the Plotly function is indicated. When you modify and run the cell below, you should be able to see the values for the variables `age`, `MeanFD`, and `Subject` when hovering your cursor over a point.\n", + "\n", + "💡 To visualize the regression, use the Ordinary Least Squares (OLS) method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf30bb7e-895c-42f4-904d-b3f53b6a1fca", + "metadata": {}, + "outputs": [], + "source": [ + "q3 = px.scatter(\n", + " # ...\n", + ")\n", + "\n", + "q3.show()" + ] + }, + { + "cell_type": "markdown", + "id": "925e1729-4238-4e30-b321-7d3573aae8e6", + "metadata": {}, + "source": [ + "#### Question 3: Outliers and Participant Number (continued)\n", + "\n", + "Now that you can access the participant number directly from your figure, store this number in the variable `q3_outlier` in the cell below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65d021f0-9811-4341-8946-7a48ab9167fa", + "metadata": {}, + "outputs": [], + "source": [ + "# REPLACE THE ... WITH YOUR ANSWER\n", + "q3_outlier = ...\n", + "print(f'Subject {q3_outlier} seems to have an aberrant mean FD value.')" + ] + }, + { + "cell_type": "markdown", + "id": "5d9a8c66-929b-4c18-8f6e-e31cd9f87ea1", + "metadata": {}, + "source": [ + "### Question 4: Site Effect\n", + "\n", + "Check the distribution of participants according to the acquisition site. Complete the cell below by adding the following parameters:\n", + "- Your figure size should have a width of 10 and a height of 3\n", + "- Visualize the sites on the x-axis\n", + "- The number of participants per site should be separated by group\n", + "- Change the y-axis name to the one stored in the variable `y_axis_name` in the cell below. 💡 You will need to modify an attribute of the variable `ax_q4`.\n", + "\n", + "Modify the `...` in the cell below as requested." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7da4baa-88d9-468c-a505-9f1583933733", + "metadata": {}, + "outputs": [], + "source": [ + "# Specify the figure size\n", + "fig_q4, ax_q4 = plt.subplots(1, ...)\n", + "\n", + "# Specify the requested parameters in the function below\n", + "sns.countplot(\n", + " ...,\n", + " ax=ax_q4\n", + ")\n", + "\n", + "# Modify the axis name\n", + "y_axis_name = 'Number of participants' # DO NOT MODIFY\n", + "ax_q4. ..." + ] + }, + { + "cell_type": "markdown", + "id": "f2420ea7-f25f-44cc-805d-440c8415c39b", + "metadata": {}, + "source": [ + "# Part 2 : Exploring fMRI data\n", + "\n", + "In this section, you will explore the functional fMRI data from the ADHD200 dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87946a87-25cc-4449-9f80-611778cddcce", + "metadata": {}, + "outputs": [], + "source": [ + "# Loading the BASC 12 ROIs atlas — do not modify\n", + "basc = datasets.fetch_atlas_basc_multiscale_2015(\n", + " data_dir=data_dir, resolution=12\n", + ")\n", + "labels = basc.labels\n", + "atlas_img = basc.maps\n", + "print(f\"Atlas loaded : {atlas_img}\")\n", + "print(f\"Number of regions: {len(labels)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "03081687-0895-48c2-b1e7-24fae50f6746", + "metadata": {}, + "source": [ + "### Question 5: Visualize the Atlas\n", + "\n", + "Visualize the Regions of Interest (ROIs) using the `plot_roi` function from the `nilearn` library. Visualize the atlas according to the following instructions:\n", + "\n", + "- Remove the cross showing the coordinate position for each slice.\n", + "- Add the title `Atlas BASC - 12 ROIs`\n", + "- Show the atlas at coordinates (0, 0, 0)\n", + "\n", + "💡 Consult the documentation for the [`plot_roi`](https://nilearn.github.io/dev/modules/generated/nilearn.plotting.plot_roi.html) function to determine which parameters to include." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a281fac8-18ff-43f1-bb33-4043444efbaf", + "metadata": {}, + "outputs": [], + "source": [ + "title_q5 = 'Atlas BASC - 12 ROIs'\n", + "\n", + "fig_q5 = plotting.plot_roi(\n", + " ...\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d1ae58ea-5127-4970-91d1-1bfc2188da1d", + "metadata": {}, + "source": [ + "### Question 6: Time Series\n", + "\n", + "Compare the temporal signals of **the first brain region** of the atlas between the subject with the lowest movement and the one with the highest movement (see figure generated in Question 3). Create a figure with 1 row and 2 columns. In the left column, visualize your ROI. In the right column, visualize the time series for both subjects.\n", + "\n", + "**Left Column:**\n", + "- Visualize your ROI using `nilearn`'s `plot_roi` function\n", + "- Remove annotations\n", + "- Remove the colorbar\n", + "- Visualize only the axial slice according to the coordinates specified in the `coords` variable in the cell below\n", + "- Don't forget to specify your axis in the subplot!\n", + "\n", + "💡 Consult the [`plot_roi`](https://nilearn.github.io/dev/modules/generated/nilearn.plotting.plot_roi.html) documentation for parameter details.\n", + "\n", + "**Right column:**\n", + "- Visualize the time series of both subjects for your ROI using `matplotlib`'s `plot` function\n", + "- Specify the following labels for each subject: `'low'` for the subject with the least movement and `'high'` for the subject with the most movement\n", + "- Insert a legend at the bottom left of your subplot to indicate which color belongs to which subject\n", + "- The time series for the subject with the most movement (`'high'`) should be dashed (`'--'`)\n", + "- Add the title `'Volumes'` to your x-axis and set the font size to `14`\n", + "- Your line width should be `3`\n", + "\n", + "💡 Refer to the visualization course tutorial and the [matplotlib documentation](https://matplotlib.org/stable/)if needed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d59b59ae-e889-47cd-897f-ea9bfd798254", + "metadata": {}, + "outputs": [], + "source": [ + "# Create your figure layout\n", + "fig_q6, axes_q6 = plt.subplots(1, 2, figsize=(20, 5), width_ratios=[1, 3]) # DO NOT MODIFY\n", + "\n", + "# Specify subject numbers\n", + "id_low_q6 = ... # TO MODIFY\n", + "id_high_q6 = ... # TO MODIFY\n", + "\n", + "# Fetch data for the specified subjects\n", + "img_low_q6 = [f for f in func if str(id_low_q6) in f][0] # DO NOT MODIFY\n", + "img_high_q6 = [f for f in func if str(id_high_q6) in f][0] # DO NOT MODIFY\n", + "\n", + "# Instantiate the masker\n", + "# DO NOT MODIFY\n", + "mask = NiftiLabelsMasker(\n", + " labels_img=atlas_img,\n", + " labels = labels\n", + ")\n", + "# Extract time series for the specified subjects\n", + "# DO NOT MODIFY\n", + "timeserie_low_q6=mask.fit_transform(\n", + " img_low_q6\n", + ")\n", + "timeserie_high_q6=mask.fit_transform(\n", + " img_high_q6\n", + ")\n", + "\n", + "# Fetch time series for the first ROI\n", + "roi_id_q6 = ... # TO MODIFY\n", + "roi_low_q6 = timeserie_low_q6[:, roi_id_q6] # DO NOT MODIFY\n", + "roi_high_q6 = timeserie_high_q6[:, roi_id_q6] # DO NOT MODIFY\n", + "\n", + "# Select specific region in the atlas\n", + "roi_atlas = math_img('img1==1', img1=atlas_img) # DO NOT MODIFY\n", + "\n", + "# Add ROI to the figure\n", + "coords = (0,)\n", + "\n", + "# TO MODIFY: SPECIFY REQUIRED PARAMETERS IN THE FUNCTION BELOW\n", + "plotting.plot_roi(\n", + " ...\n", + ")\n", + "\n", + "# Add time series to the figure\n", + "# TO MODIFY: SPECIFY AXIS, FUNCTION AND PARAMETERS FOR THE LOW MOVEMENT PARTICIPANT\n", + "axes_q6...\n", + "# TO MODIFY: SPECIFY AXIS, FUNCTION AND PARAMETERS FOR THE HIGH MOVEMENT PARTICIPANT\n", + "axes_q6...\n", + "# TO MODIFY: ADD LEGEND AND NECESSARY PARAMETERS\n", + "axes_q6...\n", + "# TO MODIFY: APPLY REQUESTED CHANGES TO THE X-AXIS\n", + "axes_q6..." + ] + }, + { + "cell_type": "markdown", + "id": "9a041d64-9445-4d96-96ab-90a85f84894a", + "metadata": {}, + "source": [ + "### Question 7: Carpet plot\n", + "\n", + "Compare the time series across all voxels for those two subjects. Use `nilearn`'s `plot_carpet` function. Specifically, you will need to:\n", + "\n", + "- Create a figure with two rows (1 column)\n", + "- The top row should contain the carpet plot for the subject with the least movement (lowest Mean FD value)\n", + "- The bottom row should contain the carpet plot for the subject with the most movement (highest Mean FD value)\n", + "- For each subplot, add a title. For the top figure, use 'Participant with the lowest Mean FD value'. For the bottom figure, use 'Participant with the highest Mean FD value'\n", + "- Separate the carpet plot by atlas regions. **Hint:** you will need to use the `mask_img` and `mask_labels` parameters of the `plot_carpet` function\n", + "\n", + "💡 The data to visualize has already been stored in variables in the previous question. To determine which variable to use for the mask, carefully check the expected variable type in the [nilearn documentation](https://nilearn.github.io/dev/modules/generated/nilearn.plotting.plot_carpet.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f8a7bf0f-51ac-4e6d-b6ef-919024587bde", + "metadata": {}, + "outputs": [], + "source": [ + "labels_dict_q7 = {item: int(item) for item in labels if item != 'Background'} # DO NOT MODIFY\n", + "\n", + "fig_q7, axes_q7 = plt.subplots(2, 1, figsize=(16, 12))\n", + "\n", + "display_low_q7 = plotting.plot_carpet(\n", + " ...\n", + ")\n", + "display_high_q7 = plotting.plot_carpet(\n", + " ...\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python cours visu", + "language": "python", + "name": "visu-env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/exercise_visu_fr.ipynb b/exercise_visu_fr.ipynb new file mode 100644 index 0000000..590d1e3 --- /dev/null +++ b/exercise_visu_fr.ipynb @@ -0,0 +1,479 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "016a2a06-8ad9-4988-8d98-f03211e10d1d", + "metadata": {}, + "source": [ + "# Exercices: Visualisation de données en python\n", + "\n", + "Cet exercice vise à tester vos compétences en visualisation de données en python.\n", + "\n", + "**Contexte**\n", + "\n", + "Pour compléter ces exercices, vous devrez utiliser le jeu de données **ADHD200**. Plus précisément, vous devrez visualiser les données phénotypiques et les données d'IRM fonctionnelle au repos.\n", + "\n", + "**Instructions :**\n", + "\n", + "1. **Conformité** : Ne modifiez pas les noms des variables de résultat (ex: q1_n_sujets) donnés dans les lignes commentées. \n", + "2. **Exécution** : Assurez-vous que votre code s'exécute sans erreur de haut en bas.\n", + "3. Paramètres : Respectez exactement les paramètres spécifiés dans chaque question.\n", + "3. **Données** : Utilisez uniquement les variables `pheno`, `func` et `confounds` fournies dans la cellule de configuration (Section 0), qui sont déjà alignées entre elles." + ] + }, + { + "cell_type": "markdown", + "id": "3ebae5d5-4530-43b7-9dc0-47da8c1c6c45", + "metadata": {}, + "source": [ + "## Section 0: Configuration et chargement des données\n", + "\n", + "Exécutez ces cellules de configuration. Elles téléchargent les données et préparent les variables que vous utiliserez dans l'exercice. **Ne les modifiez pas.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c7a4822-2bef-4332-8652-67bff094d073", + "metadata": {}, + "outputs": [], + "source": [ + "# Configuration — ne pas modifier\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "\n", + "import os\n", + "import statsmodels\n", + "import numpy as np\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "import plotly.express as px\n", + "\n", + "from cmcrameri import cm\n", + "from nilearn import datasets, plotting\n", + "from nilearn.image import math_img\n", + "from nilearn.maskers import NiftiLabelsMasker\n", + "from matplotlib.colors import ListedColormap\n", + "from nilearn.connectome import ConnectivityMeasure\n", + "\n", + "print(\"Bibliothèques importées avec succès.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1f7c7842-8168-4426-8f02-081bf39a26c8", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# Chargement du dataset ADHD200 — ne pas modifier\n", + "data_dir = './nilearn_data'\n", + "\n", + "adhd_dataset = datasets.fetch_adhd(n_subjects=40, data_dir=data_dir)\n", + "\n", + "# Les données phénotypiques brutes (peuvent ne pas couvrir tous les sujets)\n", + "pheno_brute = adhd_dataset.phenotypic\n", + "\n", + "print(f\"Dataset chargé : {len(adhd_dataset.func)} images fonctionnelles\")\n", + "print(f\"Données phénotypiques brutes : {pheno_brute.shape[0]} sujets\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b19534c-82f4-4565-a114-7d80c7001da1", + "metadata": {}, + "outputs": [], + "source": [ + "# Alignement des données — ne pas modifier\n", + "#\n", + "# Certains sujets ont des images IRMf mais pas de données phénotypiques.\n", + "# On conserve uniquement les sujets présents dans les deux sources.\n", + "\n", + "func_ids = [int(os.path.basename(f).split('_')[0]) for f in adhd_dataset.func]\n", + "pheno_ids = set(pheno_brute['Subject'].values)\n", + "matched_idx = [i for i, fid in enumerate(func_ids) if fid in pheno_ids]\n", + "matched_fids = [func_ids[i] for i in matched_idx]\n", + "\n", + "# Données alignées et prêtes à l'emploi\n", + "pheno = pheno_brute.set_index('Subject').loc[matched_fids].reset_index()\n", + "func = [adhd_dataset.func[i] for i in matched_idx]\n", + "confounds = [adhd_dataset.confounds[i] for i in matched_idx]\n", + "\n", + "print(f\"Sujets avec données d'imagerie ET phénotypiques : {len(func)}\")\n", + "print(f\"Colonnes phénotypiques disponibles : {list(pheno.columns)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "0b7786cc-c2d1-4578-8599-28c27bfde924", + "metadata": {}, + "source": [ + "# Partie 1 : Exploration des données phénotypiques\n", + "\n", + "Dans cette section, vous allez explorer les variables cliniques et démographiques du jeu de données ADHD200." + ] + }, + { + "cell_type": "markdown", + "id": "28419906-7dec-4373-93c9-2ecb67e13706", + "metadata": {}, + "source": [ + "### Question 1: Distribution de l'âge selon le sexe\n", + "\n", + "Visualisez la variable `age` dans le *dataframe* `pheno` en fonction de la variable `sex`. Remplissez la fonction `histplot` ci-dessous en précisant les bonnes variables. Vous devrez préciser les paramètres suivants:\n", + "\n", + "- Spécifiez la structure de données (dataframe) à utiliser dans la fonction.\n", + "- La variable `age` doit figurée sur l'axe des x.\n", + "- Les distributions de l'âge doivent être séparées selon la variable `sex`. Cela doit être spécifié dans la même fonction `histplot` ci-dessous.\n", + "- Superposez une estimation de la densité par noyau (`kdeplot`) à l'histogramme.\n", + "\n", + "Vous devrez donc spécifier la valeur de quatre paramètes. **Laisser les valeurs par défaut pour les autres variables de la fonction.**\n", + "\n", + "Au besoin, référez-vous à la documentation de [`seaborn`](https://seaborn.pydata.org/generated/seaborn.histplot.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c388267-d49d-4155-bcd5-c5f49bbf6707", + "metadata": {}, + "outputs": [], + "source": [ + "ax_q1 = plt.subplot()\n", + "\n", + "sns.histplot(\n", + " # ...\n", + " ax=ax_q1\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fc8e289c-388e-4c90-ae08-58c996f939a4", + "metadata": {}, + "source": [ + "### Question 2: Relation entre l'âge et le mouvement\n", + "\n", + "Explorez la relation entre l'âge des participant.e.s (`age`) et les niveaux de mouvement (`MeanFD`) lors des acquisitions d'IRMf. Plus précisément, vous devrez:\n", + "\n", + "- Choisir la fonction adéquate de la librairie `seaborn` vous permettant de visualiser uniquement la relation entre les deux variables, tout en y incluant un modèle de régression.\n", + "- Spécifier la structure de données (dataframe) à utiliser dans la fonction.\n", + "- L'âge doit être traité comme votre variable indépendante.\n", + "- Le mouvement doit être traité comme variable dépendante." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "972ea435-ad5d-4c42-ba15-685e42ad50fb", + "metadata": {}, + "outputs": [], + "source": [ + "ax_q2 = plt.subplot()\n", + "\n", + "# Modifiez les lignes ci-dessous. Une fois que vous aurez remplacé `` par la bonne fonction\n", + "# et que vous aurez remplacé les `...` par les bonnes variables, décommenter les lignes suivantes:\n", + "\n", + "#sns.(\n", + "# ...,\n", + "# ax=ax_q2\n", + "#)" + ] + }, + { + "cell_type": "markdown", + "id": "f2d47307-f7d8-4007-b52b-204f5166dd3b", + "metadata": {}, + "source": [ + "### Question 3: Valeurs aberrantes et numéro de participant.e\n", + "\n", + "Vous effectuez la figure montrant la relation entre l'âge et le mouvement et vous remarquez qu'il semble y avoir un.e participant.e ayant une valeur de FD moyenne potentiellement aberrante. Extrayez le numéro de participant.e directement à partir de votre figure. \n", + "\n", + "Reproduisez la figure que vous avez faite à la question précédente avec `plotly.express`. Pour cette question, la fonction de plotly est indiquée. Lorsque vous allez modifier puis rouler la cellule ci-dessous, vous devrez être en mesure de voir les valeurs des variables `age`, `MeanFD` et `Subject` lorsque vous placez votre curseur au-dessus d'un point.\n", + "\n", + "💡 Pour visualiser la régression, utilisez la méthode des moindres carrés ordinaires." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bf30bb7e-895c-42f4-904d-b3f53b6a1fca", + "metadata": {}, + "outputs": [], + "source": [ + "q3 = px.scatter(\n", + " # ...\n", + ")\n", + "\n", + "q3.show()" + ] + }, + { + "cell_type": "markdown", + "id": "925e1729-4238-4e30-b321-7d3573aae8e6", + "metadata": {}, + "source": [ + "#### Question 3: Valeurs aberrantes et numéro de participant.e (suite)\n", + "\n", + "Maitenant que vous pouvez avoir accès au numéro de participant.e directement à partir de votre figure, stocker ce numéro dans la variable `q3_outlier` dans la cellule ci-dessous." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65d021f0-9811-4341-8946-7a48ab9167fa", + "metadata": {}, + "outputs": [], + "source": [ + "# REMPLACEZ LES ... PAR VOTRE RÉPONSE\n", + "q3_outlier = ...\n", + "print(f'Le sujet {q3_outlier} semble avoir une valeur de FD moyenne aberrante.')" + ] + }, + { + "cell_type": "markdown", + "id": "5d9a8c66-929b-4c18-8f6e-e31cd9f87ea1", + "metadata": {}, + "source": [ + "### Question 4: Effet du site\n", + "\n", + "Regardez la répartition des participant.e.s en fonction du site d'acquisition. Complétez la cellule ci-dessous en y ajoutant les paramètres suivants:\n", + "- La taille de votre figure devrait avoir une largeur de 10 et une hauteur de 3.\n", + "- Visualisez les sites sur l'axe des x\n", + "- Le nombre de participant.e.s par site doit être séparé par groupe\n", + "- Modifiez le nom de l'axe des y pour celui stocké dans la variable `y_axis_name` dans la cellule ci-dessous. 💡 Vous devrez modifier un attribut de la variable `ax_q4`.\n", + "\n", + "Modifiez les `...` dans la cellule ci-dessous selon ce qui a été demandé." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7da4baa-88d9-468c-a505-9f1583933733", + "metadata": {}, + "outputs": [], + "source": [ + "# Spécifiez la taille de la figure\n", + "fig_q4, ax_q4 = plt.subplots(1, ...)\n", + "\n", + "# Spécifiez les paramètres demandés dans la fonction ci-dessous\n", + "sns.countplot(\n", + " ...,\n", + " ax=ax_q4\n", + ")\n", + "\n", + "# Modifiez le nom de l'axe\n", + "y_axis_name = 'Nombre de participants' # NE PAS MODIFIER\n", + "ax_q4. ..." + ] + }, + { + "cell_type": "markdown", + "id": "f2420ea7-f25f-44cc-805d-440c8415c39b", + "metadata": {}, + "source": [ + "# Partie 2 : Exploration des données d'IRM fonctionnelle\n", + "\n", + "Dans cette section, vous allez explorer les données d'IRM fonctionnelle du jeu de données ADHD200." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "87946a87-25cc-4449-9f80-611778cddcce", + "metadata": {}, + "outputs": [], + "source": [ + "# Chargement de l'atlas BASC 12 ROIs — ne pas modifier\n", + "basc = datasets.fetch_atlas_basc_multiscale_2015(\n", + " data_dir=data_dir, resolution=12\n", + ")\n", + "labels = basc.labels\n", + "atlas_img = basc.maps\n", + "print(f\"Atlas chargé : {atlas_img}\")\n", + "print(f\"Nombre de régions: {len(labels)}\")" + ] + }, + { + "cell_type": "markdown", + "id": "03081687-0895-48c2-b1e7-24fae50f6746", + "metadata": {}, + "source": [ + "### Question 5: Visualiser l'atlas\n", + "\n", + "Visualisez les régions d'intérêt (*Regions of Interest*; ROIs) grâce à la fonction `plot_roi` de la librairie `nilearn`. Visualisez l'atlas selon les instructions suivantes:\n", + "\n", + "- Enlevez la croix montrant la position des coordonnées pour chaque coupe\n", + "- Ajoutez le titre `Atlas BASC - 12 ROIs`\n", + "- Montrez l'atlas à la coordonnées (0, 0, 0)\n", + "\n", + "💡 Consultez la documentation de la fonction [`plot_roi`](https://nilearn.github.io/dev/modules/generated/nilearn.plotting.plot_roi.html) pour déterminer quels paramètres inclure dans la fonction." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a281fac8-18ff-43f1-bb33-4043444efbaf", + "metadata": {}, + "outputs": [], + "source": [ + "title_q5 = 'Atlas BASC - 12 ROIs'\n", + "\n", + "fig_q5 = plotting.plot_roi(\n", + " ...\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d1ae58ea-5127-4970-91d1-1bfc2188da1d", + "metadata": {}, + "source": [ + "### Question 6: Séries temporelles\n", + "\n", + "Comparez les signaux temporelles de la première région cérébrale de l'atlas entre le sujet ayant le plus de mouvement et celui ayant le plus de mouvement (voir figure générée à la question 3). Créez une figure avec 1 ligne et deux colonnes. Dans la colonne de gauche, vous devrez visualisez votre région d'intérêt. Dans la colonne de droite, vous devrez visualiser la série temporelle pour vos deux sujets.\n", + "\n", + "**Colonne de gauche:**\n", + "- Visualisez votre région d'intérêt grâce à la fonction `plot_roi` de `nilearn`\n", + "- Enlevez les annotations\n", + "- Enlevez la barre de couleur (`colorbar`)\n", + "- Visualisez uniquement la coupe axiale selon les coordonnées spécifiées dans la variable `coords` dans la cellule ci-dessous\n", + "- N'oubliez pas de spécifier votre axe dans le sous-graphique !\n", + "\n", + "💡 Consultez la documentation de la fonction [`plot_roi`](https://nilearn.github.io/dev/modules/generated/nilearn.plotting.plot_roi.html) pour déterminer quels paramètres inclure dans la fonction.\n", + "\n", + "**Colonne de droite:**\n", + "- Visualisez les séries temporelles de vos deux sujets pour votre région d'intérêt grâce à la fonction `plot` de `matplotlib`\n", + "- Spécifiez les suivantes étiquettes pour chacun de vos sujets: `'low'` pour votre sujet avec le moins de mouvement et `'high'` pour votre sujet avec le plus de mouvement\n", + "- Insérez une légende en bas à gauche de votre sous-graphique pour indiquer quelle couleur appartient à quel sujet.\n", + "- La série temporelle pour le sujet ayant le plus de mouvement (`'high'`) devrait être en pointillé (`'--'`)\n", + "- Ajoutez le titre `'Volumes'` à votre axe des x et metter la taille de la police pour ce titre à `14`\n", + "- L'épaisseur de vos ligne devrait être de `3`.\n", + "\n", + "💡 Référez vous au tutoriel du cours de visualisation et à [la documentation de matplotlib](https://matplotlib.org/stable/) au besoin." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d59b59ae-e889-47cd-897f-ea9bfd798254", + "metadata": {}, + "outputs": [], + "source": [ + "# Créez le layout de votre figure\n", + "fig_q6, axes_q6 = plt.subplots(1, 2, figsize=(20, 5), width_ratios=[1, 3]) # NE PAS MODIFIER\n", + "\n", + "# Indiquez les numéros de sujet\n", + "id_low_q6 = ... # À MODIFIER\n", + "id_high_q6 = ... # À MODIFIER\n", + "\n", + "# Allez chercher les données des sujets spécifiés plus haut\n", + "img_low_q6 = [f for f in func if str(id_low_q6) in f][0] # NE PAS MODIFIER\n", + "img_high_q6 = [f for f in func if str(id_high_q6) in f][0] # NE PAS MODIFIER\n", + "\n", + "# Instanciez le masque\n", + "# NE PAS MODIFIER\n", + "mask = NiftiLabelsMasker(\n", + " labels_img=atlas_img,\n", + " labels = labels\n", + ")\n", + "# Extrayez les séries temporelles pour les sujets spécifiés plus haut\n", + "# NE PAS MODIFIER\n", + "timeserie_low_q6=mask.fit_transform(\n", + " img_low_q6\n", + ")\n", + "timeserie_high_q6=mask.fit_transform(\n", + " img_high_q6\n", + ")\n", + "\n", + "# Allez chercher les séries temporelles pour la première région d'intérêt\n", + "roi_id_q6 = ... # À MODIFIER\n", + "roi_low_q6 = timeserie_low_q6[:, roi_id_q6] # NE PAS MODIFIER\n", + "roi_high_q6 = timeserie_high_q6[:, roi_id_q6] # NE PAS MODIFIER\n", + "\n", + "# Sélectionnez la région spécifique dans l'atlas\n", + "roi_atlas = math_img('img1==1', img1=atlas_img) # NE PAS MODIFIER\n", + "\n", + "# Ajoutez la région d'intérêt à la figure\n", + "coords = (0,)\n", + "\n", + "# À MODIFIER: SPÉCIFIEZ LES PARAMÈTRES REQUIS DANS LA FONCTION CI-DESSOUS\n", + "plotting.plot_roi(\n", + " ...\n", + ")\n", + "\n", + "# Ajoutez les séries temporelles à la figure\n", + "# À MODIFIER: SPÉCIFIEZ L'AXE, LA FONCTION ET LES PARAMÈTRES REQUIS DANS LA FONCTION POUR LE PARTICIPANT AYANT LE MOINS DE MOUVEMENT\n", + "axes_q6...\n", + "# À MODIFIER: SPÉCIFIEZ L'AXE, LA FONCTION ET LES PARAMÈTRES REQUIS DANS LA FONCTION POUR LE PARTICIPANT AYANT LE PLUS DE MOUVEMENT\n", + "axes_q6...\n", + "# À MODIFIER: AJOUTEZ LA LÉGENDE ET LES PARAMÈTRES NÉCESSAIRES\n", + "axes_q6...\n", + "# À MODIFIER: APPORTEZ LES MODIFICATIONS DEMANDÉES À L'AXE DES X\n", + "axes_q6..." + ] + }, + { + "cell_type": "markdown", + "id": "9a041d64-9445-4d96-96ab-90a85f84894a", + "metadata": {}, + "source": [ + "### Question 7: Carpet plot\n", + "\n", + "Comparez les séries temporelles de tous les voxels (et non simplement pour une région d'intérêt) pour ces deux sujets. Pour ce faire, vous devrez utiliser la fonction `plot_carpet` de `nilearn`. Plus précisément, vous allez devoir:\n", + "\n", + "- Créez une figure avec deux rangées (1 colonne)\n", + "- La rangée du haut devra contenir le *carpet plot* pour le sujet ayant le moins de mouvement (plus petite valeur de FD moyenne; voir question précédente).\n", + "- La rangée du bas devra contenir le *carpet plot* pour le sujet ayant le plus de mouvement (plus grande valeur de FD moyenne; voir question précédente).\n", + "- Pour chaque sous-figure, vous devrez ajouter un titre. Pour la figure du haut, vous devrez utiliser comme titre 'Participant with the lowest Mean FD value'. Pour la figure du bas, vous devrez utiliser comme titre 'Participant with the highest Mean FD value'\n", + "- Séparez le *carpet plot* selon les régions de votre atlas. **Indice:** vous allez devoir utiliser les paramètres `mask_img` et `mask_labels` de la fonction `plot_carpet`.\n", + "\n", + "💡 Les données à visualiser ont déjà étaient stockées dans des variables à la question précédente. Pour déterminer quelle variable utilisée pour le masque, regardez attentivement le type de la variable attendue dans [la documentation de nilearn](https://nilearn.github.io/dev/modules/generated/nilearn.plotting.plot_carpet.html).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f8a7bf0f-51ac-4e6d-b6ef-919024587bde", + "metadata": {}, + "outputs": [], + "source": [ + "labels_dict_q7 = {item: int(item) for item in labels if item != 'Background'} # NE PAS MODIFIER\n", + "\n", + "fig_q7, axes_q7 = plt.subplots(2, 1, figsize=(16, 12))\n", + "\n", + "display_low_q7 = plotting.plot_carpet(\n", + " ...\n", + ")\n", + "display_high_q7 = plotting.plot_carpet(\n", + " ...\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python cours visu", + "language": "python", + "name": "visu-env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/visualization_en.ipynb b/notebooks/visualization_en.ipynb new file mode 100644 index 0000000..03b2317 --- /dev/null +++ b/notebooks/visualization_en.ipynb @@ -0,0 +1,2737 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7fe3c968-1410-49ba-ac5c-3c3789dc1e1f", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "source": [ + "# Visualization and Model Interpretation\n", + "\n", + "## Learning Objectives\n", + "👀 Understand the purpose of visualization\n", + "
📈 Understand which type of graph to use according to the data type\n", + "
🎨 Adequately choose your color palette\n", + "
🔃 Learn how to modify different elements of our figures\n", + "
🧠 Use `nilearn` for neuroimaging data visualization\n", + "
🤖 Understand how visualization can help us interpret our machine learning models\n", + "\n", + "## Tutorial Organization\n", + "\n", + "The session will consist of both theoretical and practical sections:\n", + "\n", + "**Theory**\n", + "\n", + "We will discuss the basic theoretical principles of visualization. The following principles will be covered:\n", + "- Graph types (tabular data): univariate vs. bivariate; categorical vs. continuous\n", + "- Color palettes: perceptually uniform vs. non-uniform; discrete vs. continuous\n", + "- Graph types (neuroimaging data): statistical maps, connectomes, etc.\n", + "\n", + "This part includes several interactive elements to lead students to form their own understanding of the material. The code is already provided, but we will not dwell on it.\n", + "\n", + "**Practice**\n", + "\n", + "We will put the theoretical principles into practice as we go. We will use the following libraries:\n", + "- matplotlib\n", + "- seaborn\n", + "- ptitprince\n", + "- plotly\n", + "- nilearn\n", + "\n", + "Students will be required to modify the provided code to understand the role of various visualization parameters and to answer specific questions based on what we cover in class or from provided references (e.g., matplotlib documentation)." + ] + }, + { + "cell_type": "markdown", + "id": "70ffeea7-819f-4607-9e78-b43cc095fe0c", + "metadata": {}, + "source": [ + "
\n", + "The yellow boxes contain questions/exercises for students to answer.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "d19ef6c6-9843-4a8b-bf8d-0af3049447a7", + "metadata": {}, + "source": [ + "
\n", + "The blue boxes contain additional information about the datasets and functions used.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7237bc73-6409-4a39-8149-c4987f29e5c1", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import nibabel as nib\n", + "import seaborn as sns\n", + "import ptitprince as pt\n", + "import ipywidgets as widgets\n", + "import matplotlib.pyplot as plt\n", + "import plotly.express as px\n", + "\n", + "from cmcrameri import cm\n", + "from nilearn import datasets, plotting, image\n", + "from nilearn.input_data import NiftiLabelsMasker\n", + "from matplotlib import colormaps, colors\n", + "from colorspacious import cspace_converter, cspace_convert" + ] + }, + { + "cell_type": "markdown", + "id": "d702f598", + "metadata": {}, + "source": [ + "## Load the data\n", + "\n", + "### Brain development fMRI dataset\n", + "For this tutorial, we will use the brain development fMRI dataset, which includes phenotypic data as well fMRI data collected from children and adults during movie watching ([Richardson et al., 2018](https://doi.org/10.1038/s41467-018-03399-2)).\n", + "\n", + "**Note:** if you have already downloaded the data, you can change the path in the cell below to point to the appropriate directory." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8f0b0b9", + "metadata": {}, + "outputs": [], + "source": [ + "development_dataset = datasets.fetch_development_fmri(data_dir='data/')" + ] + }, + { + "cell_type": "markdown", + "id": "5041274a-1dbe-4f1f-9208-6b705c01d684", + "metadata": {}, + "source": [ + "### Datasaurus\n", + "In the first part of the tutorial, we will also briefly use the dataset datasaurus. If you want to be able to execute the cells using that dataset, you will have to download the data from [kaggle](https://www.kaggle.com/datasets/tombutton/datasaurusdozen).\n", + "\n", + "**Note:** change the path in the cell below to point to the directory where you downloaded the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46a73b81-6371-4ac2-98fb-a444d3c7504d", + "metadata": {}, + "outputs": [], + "source": [ + "# Credit: Alberto Cairo (original datasaurus), and Justin Matejka and George Fitzmaurice (datasaurus dozen)\n", + "data = pd.read_csv(\"data/datasaurus.csv\") " + ] + }, + { + "cell_type": "markdown", + "id": "cba69723-44c3-4508-a5ee-a515193d8c9d", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "## A picture is worth a thousand words\n", + "- Visualizations help us better understand the complexity of our data\n", + "- They allow us to support our findings\n", + "- And they help us share a message" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc862f23-1dfb-47b1-918c-147e6f2569a0", + "metadata": {}, + "outputs": [], + "source": [ + "# rcParams allows us to modify the global parameters of our figures\n", + "# Here, we are simply ensuring that values between (-8,000,000, 8,000,000) are not shown in scientific notation\n", + "plt.rcParams['axes.formatter.useoffset'] = False\n", + "plt.rcParams['axes.formatter.limits'] = (-8000000, 8000000) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1479b22-5c32-4de9-a894-35160a480ea5", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_category(category):\n", + " plt.scatter(data[data['dataset'] == category]['x'], data[data['dataset'] == category]['y'])\n", + " plt.xlabel('x')\n", + " plt.ylabel('y')\n", + " max_x = data[data['dataset'] == category]['x'].max()\n", + " max_y = data[data['dataset'] == category]['y'].max()\n", + " plt.text(max_x-0.22*max_x, max_y+0.10*max_y, f\"Mean: ({data[data['dataset'] == category]['x'].mean().round(2)}, {data[data['dataset'] == category]['y'].mean().round(2)})\")\n", + " plt.text(max_x-0.22*max_x, max_y+0.06*max_y, f\"Std: ({data[data['dataset'] == category]['x'].std().round(2)}, {data[data['dataset'] == category]['y'].std().round(2)})\")\n", + "\n", + " plt.show()\n", + "\n", + "widgets.interact(\n", + " plot_category,\n", + " category=widgets.Dropdown(\n", + " options=sorted(data['dataset'].unique()),\n", + " description='Dataset:'\n", + " )\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "5d19d31b-3a97-474c-bebb-0f13ac04c85b", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "### \"Never trust summary statistics alone; always visualize your data\"\n", + "

Alberto Cairo

\n", + "\n", + "Python offers a wide range of libraries for visualizing our data:\n", + "- High-level vs. low-level\n", + "- Static images vs. interactive plots\n", + "- General-purpose libraries (e.g., Matplotlib, Seaborn, Bokeh, and Plotly)\n", + "- Domain-specific libraries (e.g., Nilearn)\n", + "\n", + "But with great power comes great responsibility..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc2c873e-d608-4022-a380-b2cd0cd6c04f", + "metadata": {}, + "outputs": [], + "source": [ + "data = pd.DataFrame(\n", + " {\n", + " 'Categories': ['A', 'B'],\n", + " 'Values': [6000000, 7066000]\n", + " },\n", + " \n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ee868f9-7c04-4ddc-a068-46b7289077bf", + "metadata": {}, + "outputs": [], + "source": [ + "plt.bar(data['Categories'], data['Values'])\n", + "plt.ylim([5500000, 8000000])\n", + "plt.yticks([])\n", + "plt.title(\"What could you say about 'A' and 'B' if you only look at the figure?\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9c4c673-1a48-4f9f-97b6-d29bd084c39e", + "metadata": {}, + "outputs": [], + "source": [ + "plt.bar(data['Categories'], data['Values'])\n", + "plt.ylim([5500000, 8000000])\n", + " \n", + "plt.title(\"What do you observe?\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82aaaa58-6c8f-4f2f-9634-f58ade71b8fd", + "metadata": {}, + "outputs": [], + "source": [ + "plt.bar(data['Categories'], data['Values'])\n", + "\n", + "# Ajoute les valeurs au-dessus des bars\n", + "for i, v in enumerate(data['Values']):\n", + " plt.text(i, v + 1, str(v), ha='center', va='bottom')\n", + " \n", + "plt.title(\"Is that better?\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "6e55598d-529e-41d9-bde9-72f3dfae8f89", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "## A graph for every data type!\n", + "\n", + "There are several types of graphs. The chosen chart type depends on the variables you want to visualize:\n", + "- Univariate visualization:: **continuous variable** vs **categorical**\n", + "- Bivariate visualization: **categorical x categorical** vs **categorical x continuous** vs **continuous x continuous**" + ] + }, + { + "cell_type": "markdown", + "id": "829dcb49-628f-4b61-81e7-de20c0f29ec7", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "### Univariate visualizations\n", + "\n", + "For a **continuous variable**, we can visualize its distribution:\n", + "- Histogram\n", + "- kde plot\n", + "- Strip plot\n", + " \n", + "For a **categorical variable**, we can visualize the quantity of observations for each category:\n", + "- Bar plot" + ] + }, + { + "cell_type": "markdown", + "id": "019ba4e7-660d-4dd9-bf09-f54bf67fd8e4", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "source": [ + "
\n", + "If the `brain development fMRI dataset` is not under the data/ folder, change the path in the cell below for the appropriate one.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "473b14ef-f075-40d5-a5ae-7183f8c50e1e", + "metadata": {}, + "outputs": [], + "source": [ + "# Retrieve participants data\n", + "participants = pd.read_csv('data/development_fmri/development_fmri/participants.tsv', sep='\\t')\n", + "\n", + "# Let's check what we have here\n", + "participants.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b261b1ed-b7ae-4d5c-9511-9de670e7d3d9", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's check our data types\n", + "participants.dtypes" + ] + }, + { + "cell_type": "markdown", + "id": "fb599c4d-23a7-4368-a7b3-f234b42597c1", + "metadata": {}, + "source": [ + "
\n", + "The dataset contains continuous variables (e.g., `Age`, `ToM Booklet-Matched`, `FB_Composite`) and categorical variables (e.g., `AgeGroup`, `Child_Adult`, `Gender`). `ToM Booklet-Matched` represents a score on a task designed to assess Theory of Mind—the ability to attribute mental states (beliefs, desires, emotions, intentions) to oneself or others.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20965f72-f1d2-4ecb-94b4-0cfcfc8455c1", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's look at the descriptive statistics related to the `Age` variable\n", + "print(participants['Age'].describe())" + ] + }, + { + "cell_type": "markdown", + "id": "60726fed-f93c-4639-ac98-85ffb6ec8e80", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "#### Histogram\n", + "\n", + "A histogram allows us to visualize the distribution of a given variable in a discrete manner by grouping its values into consecutive intervals (bins). This provides the frequency (i.e., the number of observations) within each of these intervals.\n", + "\n", + "You can generate a histogram in Matplotlib using the [hist function](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c332d133-f1a8-42b9-a29f-b20f8e73f9c7", + "metadata": { + "jp-MarkdownHeadingCollapsed": true + }, + "outputs": [], + "source": [ + "# Let's visualize the distribution of `Age`\n", + "plt.hist(participants['Age'])\n", + "# Adding a title\n", + "plt.title(\"Age Distribution\")\n", + "# Adding a title for he x and y label\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Frequency')" + ] + }, + { + "cell_type": "markdown", + "id": "eca45e76-c80e-45b7-a19d-71914416945d", + "metadata": {}, + "source": [ + "
\n", + "The science behind the number of bins – Part 1\n", + "
The hist function in matplolib groups data into 10 bins by default. However, an inappropriate number of bins (either too small or too large) will not represent the distribution accurately.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "2de74802-65da-4fa8-8267-2c36f94555d0", + "metadata": {}, + "source": [ + "
\n", + "Bins in practice!\n", + "
Modify the value of the `bins` parameter in the cell below to determine how many bins would most accurately reflects our `Age` variable.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dcd2ae46-2746-4a0a-b3b8-e19c811ac281", + "metadata": {}, + "outputs": [], + "source": [ + "# Change the value of `bins`\n", + "plt.hist(participants['Age'], bins=10)\n", + "# Adding a title\n", + "plt.title(\"Age Distribution\")\n", + "# Adding a title for he x and y label\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Frequency')" + ] + }, + { + "cell_type": "markdown", + "id": "8c227775-9b11-4d3f-8576-78c45d00c174", + "metadata": {}, + "source": [ + "
\n", + "The science behind the number of bins – Part 2\n", + "
Several different rules exist for calculating the optimal number of bins, as well as the width of the intervals to be used for each. You can use these rules directly within the matplotlib hist function! Simply pass one of the valid rules as the bins parameter.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bd1a025-ef3f-4433-b781-624db50519fb", + "metadata": {}, + "outputs": [], + "source": [ + "help(plt.hist)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55c9b964-315f-4b07-b3e4-5ebb0d6a75db", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's visualize the distribution of `Age`\n", + "plt.hist(participants['Age'], bins='fd')\n", + "# Adding a title\n", + "plt.title(\"Age Distribution\")\n", + "# Adding a title for he x and y label\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Frequency')" + ] + }, + { + "cell_type": "markdown", + "id": "465b2338-d6fe-49dd-9021-493fa165ac75", + "metadata": {}, + "source": [ + "#### kde plot\n", + "\n", + "The **Kernel Density Estimation (KDE)** allows us to visualize he distribution of our variable in a continuous (rather than discrete) manner by estimating a density function. However, much like bin size in a histogram, kernel density estimation is sensitive to the bandwidth!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bdc96ada-1839-4426-b625-ddbca5fe21b2", + "metadata": {}, + "outputs": [], + "source": [ + "# To visualize the kernel density estimation, we will use the `kdeplot` function in `seaborn`\n", + "sns.kdeplot(participants['Age'])" + ] + }, + { + "cell_type": "markdown", + "id": "85ca2aac-7306-472d-a7c8-54fd3d1babd7", + "metadata": {}, + "source": [ + "
\n", + "KDE plot y axis - Part 1\n", + "
You’ve likely noticed that the y-axis values now range from 0 to 0.08 (as opposed to 0 to 50 for our histogram). In a KDE plot, the y-axis represents density, which is the probability per unit of the variable on the x-axis. In other words, it shows how \"dense\" the data is for a given x-value. Therefore, the peaks in this type of figure represent a higher density of points for a specific range of values (i.e., a higher probability of observing a value), while the troughs represent a lower density of points.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "d45d0798-be51-4176-9e3d-85b49577f8db", + "metadata": {}, + "source": [ + "
\n", + "Take note!\n", + "
Since this type of graph provides a continuous estimation, it might suggest that certain data points exist when, in reality, they do not.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92db2e5e-a084-441a-85ac-5a837018d35b", + "metadata": {}, + "outputs": [], + "source": [ + "# We can also overlap a histogram with a kde plot in `seaborn` using the `histplot` function\n", + "sns.histplot(\n", + " participants['Age'], \n", + " kde=True, \n", + " bins='fd', \n", + " edgecolor=None\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5c5d021b-4948-48ea-b62b-78632220a5f4", + "metadata": {}, + "source": [ + "
\n", + "KDE plot y axis - Part 2\n", + "
When overlaying a histogram with a KDE using `seaborn`, you will notice that the y-axis shows the frequency. However, the KDE curve remains a density curve. This occurs because `seaborn` scales the density curve to match the histogram by multiplying the curve by the number of observations and the bin width.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "8e8d64cd-6efe-4e03-ae67-73c03a2fbc3c", + "metadata": {}, + "source": [ + "#### Univariate strip plot\n", + "\n", + "**Histograms** and **KDE plots** allow us to visualize a variable's distribution in a discrete or continuous manner, respectively. However, these types of visualizations do not allow us to see the raw data itself.\n", + "\n", + "The **strip plot** allows us to visualize each individual data point. This can make it easier to identify the presence of outliers within our data. However, a scatter plot is not suitable if we have too many data points." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16d41fcd-dc9b-45e5-bf9f-a76964422182", + "metadata": {}, + "outputs": [], + "source": [ + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "69d2b0df-522a-495f-9eb1-aebe6711f8b1", + "metadata": {}, + "source": [ + "
\n", + "Strip plot reproducibility\n", + "
Try to reproduce the strip plot for the `Age` variable in the two cells below. What do you notice?\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e05fc9b4-700b-4720-84c0-d4db11e062a3", + "metadata": {}, + "outputs": [], + "source": [ + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "689c73ec-a827-4d98-a225-8fe5c0842bd2", + "metadata": {}, + "outputs": [], + "source": [ + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fac872ac-025d-44b7-adec-9d77afe6c2f2", + "metadata": {}, + "source": [ + "
\n", + "The `jitter` parameter\n", + "
You might have noticed that the two scatter plots you generated from the same variable are not exactly identical. This happens because seaborn uses numpy.random to calculate the jitter. To make the jitter calculation reproducible, you can set a seed beforehand.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84c59cd1-d49e-422a-9155-a314b3bd7453", + "metadata": {}, + "outputs": [], + "source": [ + "np.random.seed(12)\n", + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68f63b4d-6662-457e-96c8-ed8054169d79", + "metadata": {}, + "outputs": [], + "source": [ + "np.random.seed(12)\n", + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a9df3eac-a369-49e7-8595-53b53f5cf41d", + "metadata": {}, + "source": [ + "#### Bar plots\n", + "\n", + "The **bar plot** allows you to visualize and compare categorical variables by showing the frequency of different values or simply the values themselves. This type of graph is useful if you want to, for example, compare the number of people per group." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "71c0dd71-b9e3-48a4-a072-9b17eac911c7", + "metadata": {}, + "outputs": [], + "source": [ + "participants['AgeGroup'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97616227-7a10-4f63-b1f3-2a35feb4eb3a", + "metadata": {}, + "outputs": [], + "source": [ + "participants['AgeGroup'].value_counts().plot(kind='bar')\n", + "plt.ylabel('Counts')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e958cc7-8381-470f-a570-72335dc5fef3", + "metadata": {}, + "outputs": [], + "source": [ + "order = sorted(participants.AgeGroup.unique())\n", + "\n", + "sns.countplot(\n", + " data=participants,\n", + " x='AgeGroup',\n", + " order=order,\n", + " hue='Gender',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "56d7fc7c-704c-4079-98f6-ac5ef455610f", + "metadata": {}, + "source": [ + "### Summary\n", + "\n", + "| Graph type | Data type | Summary |\n", + "| --- | --- | --- |\n", + "| Histogram | Continuous | To visualize our data distribution in a discrete way |\n", + "| KDE plot | Continuous | To visualize our data distribution in a continuous way |\n", + "| Strip plot | Continuous | To visualize each data point individually |\n", + "| Bar plot | Categorical | To compare frequencies between groups/categories |\n" + ] + }, + { + "cell_type": "markdown", + "id": "88f04f1d-183d-40e7-ba61-95f2433e3d42", + "metadata": {}, + "source": [ + "### Bivariate Visualizations\n", + "\n", + "For a **continuous variable** x **continuous variable**, we can use:\n", + "- Scatter plot\n", + "- Bivariate KDE\n", + "- Hexplot\n", + "- Joint plot\n", + "- Heat map\n", + "\n", + "For a **categorical variable** x **continuous variable**, we can use:\n", + "- Box plot\n", + "- Violin plot\n", + "- Scatter plot\n", + "- Point plot\n", + "- Rain cloud plot\n", + "- Bar plot... (?)" + ] + }, + { + "cell_type": "markdown", + "id": "034b2e27-ccc2-40bb-9412-f26d5689d692", + "metadata": {}, + "source": [ + "#### Scatter plot - Continuous variable x continuous variable" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eae06367-0a2a-460a-96d9-cb773e0fa2ec", + "metadata": {}, + "outputs": [], + "source": [ + "plt.scatter(participants['Age'], participants['ToM Booklet-Matched'])\n", + "plt.xlabel('Age')\n", + "plt.ylabel('ToM Booklet-Matched')" + ] + }, + { + "cell_type": "markdown", + "id": "8e4fa34f-9852-432b-829b-8885396a175c", + "metadata": {}, + "source": [ + "
\n", + "What do you notice?\n", + "
Take the time to observe the figure that we have just generated.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45922286-dd55-4a5a-b9eb-3b39a2fbe615", + "metadata": {}, + "outputs": [], + "source": [ + "participants.groupby(['Child_Adult'])['ToM Booklet-Matched'].mean()" + ] + }, + { + "cell_type": "markdown", + "id": "7824b776-175f-4c46-b14f-fa56025616e3", + "metadata": {}, + "source": [ + "
\n", + "Missing values\n", + "
The `scatter` function in `matplotlib`, as well as the `regplot` function in `seaborn` automatically remove the missing values.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "03df6418-ec9a-41be-9d5f-2eec7cafb9ee", + "metadata": {}, + "source": [ + "
\n", + "The `regplot` function\n", + "
The `regplot` function in `seaborn` allows you to visualize the scatter plot while fitting a linear regression model to the data!\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4685cfb0-afe3-45a3-91ed-0e6417f5ceb3", + "metadata": {}, + "outputs": [], + "source": [ + "sns.regplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "4a1c8517-5a9f-4e36-9a8c-d253602e9e3b", + "metadata": {}, + "source": [ + "
\n", + "Fitting the regression model\n", + "
A linear regression does not seem to be the best way to model the relationship between our `Age` variable and our `ToM Booklet-Matched` variable. Change the value of the order parameter in the regplot function to 2 to check the fit of this model.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f791b6f2-eafb-44c7-beec-0c4fa479777e", + "metadata": {}, + "outputs": [], + "source": [ + "sns.regplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched'],\n", + " order=2\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "80cf7823-9d62-4c19-ba13-0211f9d71885", + "metadata": {}, + "source": [ + "
\n", + "What if we are adding another variable!\n", + "
We can visualize interactions between multiple variables via the `hue` parameter in the `lmplot` function in `seaborn`. In the cell below, we will explore the relation between `Age` (x axis) and `ToM Booklet-Matched` (y axis) based on the gender (hue).\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07ddd9fb-1611-46cc-b21c-5584d4406346", + "metadata": {}, + "outputs": [], + "source": [ + "sns.lmplot(\n", + " x='Age', \n", + " y='ToM Booklet-Matched',\n", + " data=participants,\n", + " hue='Gender'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "511d7c87-1965-441b-a97a-8dce2cac0255", + "metadata": {}, + "source": [ + "#### Bivariate KDE plot and hex plot - Continuous Variable x continuous Variable\n", + "\n", + "When we have a large number of data points and want to visualize the distribution of points across two variables, we risk having significant overplotting. In this case, it can be difficult to properly visualize the distribution of our data with a scatter plot. It is therefore possible to use other types of graphs:\n", + "\n", + "The **bivariate KDE plot** allows us to visualize how two variables are distributed in a two-dimensional space. Each contour represents a density zone. The closer the contours are, the higher the density—meaning where the data is most concentrated.\n", + "\n", + "The **hex plot** allows us to visualize the data point density in a discrete manner. It is essentially the equivalent of a histogram, but for visualizing two variables instead of one. Darker areas represent zones of high density.\n", + "\n", + "⚠ Bivariate kernel density estimation is more computationally demanding compared to the hex plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "257d7f46-88f9-45ea-9ae3-9363f17b5b73", + "metadata": {}, + "outputs": [], + "source": [ + "sns.kdeplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched'],\n", + " fill=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99ff567f-0637-4164-8925-852c797cb4c1", + "metadata": {}, + "outputs": [], + "source": [ + "sns.jointplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched'],\n", + " kind='hex'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "77fff24f-fe04-4fa1-9cdf-88355c5a8ee8", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_jointplot(kind):\n", + " sns.jointplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched'],\n", + " kind=kind\n", + " )\n", + " plt.show()\n", + "\n", + "widgets.interact(\n", + " plot_jointplot,\n", + " kind=widgets.Dropdown(\n", + " options=['hex', 'reg', 'kde', 'scatter'],\n", + " description='Type:'\n", + " )\n", + ");" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57b35514-a4c9-4c35-9dd6-e6445026e1ac", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_jointplot(kind):\n", + " mean, cov = [0, 1], [(1, .5), (.5, 1)]\n", + " x, y = np.random.multivariate_normal(mean, cov, 1000).T\n", + " \n", + " sns.jointplot(\n", + " x=x, \n", + " y=y,\n", + " kind=kind\n", + " )\n", + " plt.show()\n", + "\n", + "widgets.interact(\n", + " plot_jointplot,\n", + " kind=widgets.Dropdown(\n", + " options=['scatter', 'hex', 'reg', 'kde'],\n", + " description='Type:'\n", + " )\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "ab3e0b7c-df14-4aa0-b259-794e29af2687", + "metadata": {}, + "source": [ + "#### Strip plot - Continuous variable x categorical variable\n", + "\n", + "We have already discussed the **strip plot** in the context of univariate visualization, but we can also use it to look at mulitple groups/categories simultaneously." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df0a447c-3f56-4681-9d60-7f2527cbc39a", + "metadata": {}, + "outputs": [], + "source": [ + "order = sorted(participants.AgeGroup.unique())[:-1]\n", + "\n", + "sns.stripplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'], \n", + " order=order\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "88888956-20fc-428a-be59-58a29bc50cef", + "metadata": {}, + "source": [ + "#### Box plot - Continuous variable continue x categorical variable\n", + "\n", + "\n", + "Box plots allow us to visualize the distributions of one or multiple groups of continuous variables (e.g., age distributions across different experimental groups). The different components of the box plot represent different descriptive statistics:\n", + "- The line in the box represens the median.\n", + "- The extremities of the box (inferior-Q1 and superior-Q3 quartiles) represent the range where 50% of the data is located.\n", + "- The whiskers (i.e., the lines outside of the box) capture the range of the rest of the data.\n", + "- The points represent outliers—values greater than 1.5 x interquartile range (i.e., Q3-Q1) + Q3 or lower than Q1 - 1.5 x interquartile range." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67371457-ec27-4c1a-9481-e1f35b42d31e", + "metadata": {}, + "outputs": [], + "source": [ + "sns.boxplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " order=order\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8b37171f-00e5-488c-a326-2289e724b14f", + "metadata": {}, + "source": [ + "#### Violon plot - Continuous variable x categorical variable\n", + "\n", + "The **violin plots allow** us to visualize the data distribution by using the density curves (aka the **kde** curves). The width of each curve corresponds to the approximate frequency of the points for each region (values on the y-axis)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9a38695-6210-448a-8ca4-f9940388af28", + "metadata": {}, + "outputs": [], + "source": [ + "sns.violinplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'], \n", + " order=order\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "15efa9b0-e35c-4997-a20d-330ab3ad487a", + "metadata": {}, + "source": [ + "#### Raincloud - Continuous variable x categorical variable\n", + "\n", + "Why choosing between strip plot, box plot and violin plot when we can do them all at one! That's what **raincloud plots** allow us to do.\n", + "\n", + "*Note :* this type of graph is not natively integrated into `seaborn` or `matplotlib`. We will need to use the `ptitprince library` that we imported at the beginning of the tutorial (`import ptitprince as pt`).\n", + "\n", + "*Resource :* for more examples using raincloud plot, please refer to this [ptitprince tutorial](https://github.com/pog87/PtitPrince/blob/master/tutorial_python/raincloud_tutorial_python.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c04c3c78-75cd-47dd-9708-93a2344fb54d", + "metadata": {}, + "outputs": [], + "source": [ + "pt.RainCloud(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'], \n", + " order=order,\n", + " bw=0.6\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d6128c77-7e2c-41ad-aa46-6df0019cfd90", + "metadata": {}, + "source": [ + "#### Point plot - Continuous variable x categorical variable\n", + "\n", + "The **point plot** allows us to compare means (or any other descriptive statistic) between groups while showing the uncertainty (e.g., 95% CI, standard deviation, etc.). This type of graph is useful to show trends between different categories or between different time points (e.g., if we have longitudinal measurements)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3b7f45b-6493-404b-bf56-1f5721d354e9", + "metadata": {}, + "outputs": [], + "source": [ + "sns.pointplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " estimator='mean',\n", + " data = participants[participants.AgeGroup!='Adult'], \n", + " order=order,\n", + " errorbar=('ci', 95)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6d6cae11-c355-4da1-8b13-f036366719bd", + "metadata": {}, + "source": [ + "#### Bar plots for bivariate visualizations?\n", + "\n", + "You may have already seen bar plots used to represent a continuous variable. However, this type of visualization is not recommended for this kind of variable:\n", + "- It only allows for the visualization of certain descriptive statistics (e.g., the mean) without providing any information about the distribution of our variable.\n", + "- If you absolutely must use a bar plot, overlay it with a scatter plot!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13184fef-746a-44cf-a3d0-4eebfc25458e", + "metadata": {}, + "outputs": [], + "source": [ + "sns.barplot(\n", + " x='AgeGroup',\n", + " y = 'ToM Booklet-Matched',\n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " order = order, palette='Blues'\n", + ")\n", + "\n", + "\n", + "sns.stripplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched',\n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " jitter=True,\n", + " order = order, color = 'black')\n" + ] + }, + { + "cell_type": "markdown", + "id": "8a976f2a-4944-4b7b-a4dd-71f3a852e3ca", + "metadata": {}, + "source": [ + "## And what if we added a little color to our figures?" + ] + }, + { + "cell_type": "markdown", + "id": "68e1f171-876e-4c86-8b24-33917a0b1f37", + "metadata": {}, + "source": [ + "### Perceptually Uniform vs. Non-Uniform Color Palettes\n", + "- Colors are perceived based on their hue (orange, red, green, etc.) and their luminosity (lightness vs. darkness of a hue).\n", + "- The characteristics of our photoreceptors mean that we do not process the light spectrum uniformly.\n", + "- The majority of the photoreceptors that allow us to see colors (cones) process long wavelengths (i.e., red, orange, yellow).\n", + "- Therefore, we do not perceive variations in green-blue hues as well as we do perceive variations in yellow-red hues." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b21cf7da-33cb-4334-93f4-9f1a9a4fa476", + "metadata": {}, + "outputs": [], + "source": [ + "colormaps['hsv']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eafe14c7-4b28-4ea4-be59-2aec8530738f", + "metadata": {}, + "outputs": [], + "source": [ + "cm.batlow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eef2848b-7792-4c1f-aaf3-ccc339a129eb", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from PIL import Image\n", + "from io import BytesIO" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7611dee9-0528-4899-ab4c-1bda9a658206", + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.get('https://raw.githubusercontent.com/matplotlib/matplotlib/main/doc/_static/stinkbug.png')\n", + "img = np.asarray(Image.open(BytesIO(response.content)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8735a4ea-0afe-46f6-8889-70d24a0dca48", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_img(cmap):\n", + " lum_img = img[:, :, 0]\n", + " if cmap == 'noir et blanc':\n", + " plt.imshow(img)\n", + " elif cmap == 'hsv':\n", + " plt.imshow(lum_img, cmap=\"hsv\")\n", + " elif cmap == 'batlow':\n", + " plt.imshow(lum_img, cmap=cm.batlow)\n", + "\n", + "widgets.interact(\n", + " plot_img,\n", + " cmap=widgets.Dropdown(\n", + " options=['noir et blanc', 'hsv', 'batlow'],\n", + " value='noir et blanc',\n", + " description='Palette:'\n", + " )\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "f77abfa1-7763-457f-a22a-9c6a75ac7d1f", + "metadata": {}, + "source": [ + "
\n", + "What do you notice?\n", + "
Compare the colored image using the `hsv` and `batlow` color palettes with the black and white (grayscale) version.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "fef3aad2-789e-4620-a780-037a7800dbe4", + "metadata": {}, + "source": [ + "
\n", + "You don't have to throw away the rainbow\n", + "
Google developed a perceptually uniform rainbow colormap called `turbo`, which is available on `matplotlib`. For more details about this colormap, please refer to this Google Research blog.\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "012e371d-c345-4963-9aae-ad511922c297", + "metadata": {}, + "source": [ + "### Discrete vs. Continuous Color Palettes\n", + "\n", + "Color palettes can be either continuous (like those we saw above) or discrete. Discrete color palettes are used to visualize categorical variables—where categories have no inherent order (e.g., Children with COVID vs. children without COVID vs. adults with COVID vs. adults without COVID)." + ] + }, + { + "cell_type": "markdown", + "id": "84fb045d-b59c-427f-9f9f-56fe762a5a57", + "metadata": {}, + "source": [ + "#### [matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b726eb33-b475-4aff-bfab-c1773e1683c2", + "metadata": {}, + "outputs": [], + "source": [ + "colormaps['Set2']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a094205a-c558-49d5-9625-86a7df7357a7", + "metadata": {}, + "outputs": [], + "source": [ + "colormaps['Set3']" + ] + }, + { + "cell_type": "markdown", + "id": "09d84ac8-02cf-463c-8c72-32d45b3bac17", + "metadata": {}, + "source": [ + "#### [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad365aaf-eb1f-4874-8958-7cb5ec561ea5", + "metadata": {}, + "outputs": [], + "source": [ + "sns.color_palette('rocket', 10)" + ] + }, + { + "cell_type": "markdown", + "id": "4aa57803-e66a-45eb-872c-261623c47429", + "metadata": {}, + "source": [ + "#### [cmcrameri](https://s-ink.org/scientific-colour-maps)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df1db6fb-6ad3-4f4e-9375-b3dc915eff63", + "metadata": {}, + "outputs": [], + "source": [ + "cm.batlow.resampled(10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9dd84b40-8d10-43d1-8a5f-71a8e2a03adf", + "metadata": {}, + "outputs": [], + "source": [ + "cm.lipari.resampled(10)" + ] + }, + { + "cell_type": "markdown", + "id": "27cc49be-02ac-4ace-a104-752878b6139e", + "metadata": {}, + "source": [ + "
\n", + "Discretizing Continuous Palettes\n", + "
In the cells below, we saw that it is possible to discretize a continuous palette by specifying the number of colors we want. However, as mentioned, discrete palettes are typically used to visualize categories that have no inherent order. Therefore, using an ordered discrete palette is not always necessary.\n", + "
" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "feb640e9-10aa-4d28-9605-16e18b484b0d", + "metadata": {}, + "outputs": [], + "source": [ + "nb_colors = 10\n", + "#np.random.seed(10)\n", + "\n", + "original_cmap = cm.lipari.resampled(nb_colors)\n", + "\n", + "original_colors = original_cmap(np.arange(nb_colors))\n", + "\n", + "shuffled_colors = original_colors.copy()\n", + "np.random.shuffle(shuffled_colors)\n", + "\n", + "colors.ListedColormap(shuffled_colors)" + ] + }, + { + "cell_type": "markdown", + "id": "f49a1daf-c542-45b1-8ffc-2a43e50da03a", + "metadata": {}, + "source": [ + "### Diverging color palettes\n", + "\n", + "Diverging color palettes are useful when we have an interpretable central value. Diverging palettes can be applied to both discrete and continuous scales. For example:\n", + "- **Discrete diverging palettes**: If we have data collected using a Likert scale. Our values might range from 'Strongly Disagree' to 'Strongly Agree,' with 'Neutral' as the central value. In this case, we could visualize the values on the left (from 'Strongly Disagree' to 'Neutral') in shades of blue and the values on the right (from 'Neutral' to 'Strongly Agree') in shades of orange/red.\n", + "- **Continuous diverging palettes**: If we have negative and positive values (e.g., correlation coefficients), we could use zero as the central value. Negative values could be represented in shades of blue and positive values in shades of orange/red." + ] + }, + { + "cell_type": "markdown", + "id": "cca903e7-4cf5-4499-9e35-02f5dcd8600d", + "metadata": {}, + "source": [ + "#### [matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bf82533-bdd6-4361-9feb-44519d62d569", + "metadata": {}, + "outputs": [], + "source": [ + "# Palette discrète divergente\n", + "colormaps['bwr'].resampled(9)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e54cd44-d098-4487-9182-b9653620db9a", + "metadata": {}, + "outputs": [], + "source": [ + "# Palette continue divergente\n", + "colormaps['bwr']" + ] + }, + { + "cell_type": "markdown", + "id": "5e03d59c-ba94-4b20-9779-d1d9fc710b4c", + "metadata": {}, + "source": [ + "#### [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46ba2353-d279-4de2-8726-7ab94132f221", + "metadata": {}, + "outputs": [], + "source": [ + "sns.color_palette(\"coolwarm\", 9)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ceb3a71d-46eb-4777-99ae-6c00159236f7", + "metadata": {}, + "outputs": [], + "source": [ + "sns.color_palette(\"coolwarm\", as_cmap=True)" + ] + }, + { + "cell_type": "markdown", + "id": "3d4b7e41-7ffa-4aa9-a7ad-db81756df281", + "metadata": {}, + "source": [ + "#### [cmcrameri](https://s-ink.org/scientific-colour-maps)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2228a30e-51ed-4a89-89f7-b96345237d9b", + "metadata": {}, + "outputs": [], + "source": [ + "cm.vik.resampled(9)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4de0331-45db-4944-9b74-b3cfe56db752", + "metadata": {}, + "outputs": [], + "source": [ + "cm.vik" + ] + }, + { + "cell_type": "markdown", + "id": "dbb42716-b3b4-47c3-85e0-453b4b27e53c", + "metadata": {}, + "source": [ + "### Universally interpretable color palettes\n", + "\n", + "We talked about the perceptual uniformity of color palettes, but it is also important to consider the use of colors that are perceived universally. In fact, using certain color combinations can make it difficult for people with color vision deficiency to distinguish between data points. A few tips:\n", + "- Avoid red-green combinations since the most common form of color blindness\n", + "- Vary lightness and saturation to make sure the data remains readable even in greyscale\n", + "- Use the colorblind-friendly palettes provided by `matplotlib`, `seaborn` and `cmcrameri`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7b4ac11-efa3-432b-b7f8-7b2f84de212d", + "metadata": {}, + "outputs": [], + "source": [ + "sns.color_palette(\"colorblind\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3fc57c58-68e1-4b30-adcb-9632db371c29", + "metadata": {}, + "outputs": [], + "source": [ + "converters = {\n", + " 'deuter50_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"deuteranomaly\",\n", + " \"severity\": 50\n", + " },\n", + " 'deuter100_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"deuteranomaly\",\n", + " \"severity\": 100\n", + " },\n", + " 'prot50_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"protanomaly\",\n", + " \"severity\": 50\n", + " },\n", + " 'prot100_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"protanomaly\",\n", + " \"severity\": 100\n", + " },\n", + " 'trit50_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"tritanomaly\",\n", + " \"severity\": 50\n", + " },\n", + " 'trit100_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"tritanomaly\",\n", + " \"severity\": 100\n", + " }\n", + "}\n", + "\n", + "def show_palettes(cmap, n_colors=10):\n", + " f, ax = plt.subplots(7, 1, figsize=(10, 6)) #, layout=\"constrained\")\n", + " plt.subplots_adjust(hspace=0.6)\n", + " \n", + " gradient = np.arange(n_colors).reshape(1, -1)\n", + "\n", + " if cmap in ['viridis', 'coolwarm', 'colorblind' ,'pastel', 'muted']:\n", + " cmap_sns = sns.color_palette(cmap, n_colors)\n", + " cmap_m = colors.ListedColormap(sns.color_palette(cmap, n_colors=n_colors))\n", + " if cmap in ['batlow', 'vik']:\n", + " cmap_m = getattr(cm, cmap).resampled(n_colors)\n", + " cmap_sns = sns.color_palette(cmap_m(np.linspace(0, 1, n_colors)))\n", + " if cmap in ['bwr', 'hsv', 'PuOr', 'summer']:\n", + " cmap_m = colormaps[cmap].resampled(n_colors)\n", + " cmap_sns = sns.color_palette(cmap_m(np.linspace(0, 1, n_colors)))\n", + "\n", + " # Original palette\n", + " ax[0].imshow(gradient, aspect='auto', cmap=cmap_m)\n", + " ax[0].set_title('Original')\n", + " for idx, converter in enumerate(converters.keys()):\n", + " # Palette transformée\n", + " converted_palette = cspace_convert(cmap_sns, converters[converter], \"sRGB1\")\n", + " converted_palette = np.clip(converted_palette, 0, 1)\n", + "\n", + " ax[idx+1].imshow(gradient, aspect='auto', cmap=colors.ListedColormap(converted_palette))\n", + " ax[idx+1].set_title(f\"{converters[converter]['cvd_type']}-{converters[converter]['severity']}\")\n", + "\n", + " for a in ax.flatten():\n", + " a.set_yticks([])\n", + " a.set_xticks([])\n", + " \n", + " plt.show()\n", + "\n", + "widgets.interact(\n", + " show_palettes,\n", + " cmap=widgets.Dropdown(\n", + " options=['pastel', 'viridis', 'coolwarm', 'batlow', 'bwr', 'colorblind', 'hsv', 'PuOr', 'summer', 'vik', 'muted'],\n", + " value='viridis',\n", + " description='Palette:'\n", + " ),\n", + ");\n" + ] + }, + { + "cell_type": "markdown", + "id": "12287644-403b-4d4f-af4c-24df7a1aa031", + "metadata": {}, + "source": [ + "
\n", + "More than colors\n", + "
Choosing the right color palette is important, but there are also other strategies we can use to make our figures more accessible and easily interpretable. Instead of using only different colors to distinguish between categories, we could use different marker shapes (e.g., circle vs triangle). If our figure includes lines to, for example, illustrate the trajectory of a variable over time, we could use different line style (e.g., solid vs dashed). For more examples, see \"The best charts for color blind viewers\".\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "3dc5003b-553d-40e7-b882-c41ae62bff96", + "metadata": {}, + "source": [ + "## Anatomy of a figure\n", + "\n", + "We have already seen how to add or modify certain elements of our figures, such as the title and axis labels. In this section, we will discuss the art of figure-making in more detail. We will cover:\n", + "- Subplots\n", + "- Spines \n", + "- Ticks\n", + "- Grid\n", + "- Legend" + ] + }, + { + "cell_type": "markdown", + "id": "4012f001-b49c-4352-8bad-87b4844f7db8", + "metadata": {}, + "source": [ + "### Subplots\n", + "\n", + "Up until now, we have primarily created our figures by directly calling certain functions from matplotlib and seaborn. However, if we want to create a figure with multiple panels (i.e., columns and/or rows), we will need to use subplots." + ] + }, + { + "cell_type": "markdown", + "id": "41de742c-fbff-49e7-9669-f778e75f185a", + "metadata": {}, + "source": [ + "#### matplotlib" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41725d76-fb65-4505-812d-721c117c126c", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's look at what we've used before. This code generates two separated figures.\n", + "plt.hist(participants['Age'], bins='fd')\n", + "plt.title(\"Age Distribution\")\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Frequency')\n", + "plt.show()\n", + "\n", + "plt.scatter(participants['Age'], participants['ToM Booklet-Matched'])\n", + "plt.xlabel('Age')\n", + "plt.ylabel('ToM Booklet-Matched')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e58ebbf4-8f10-4858-9de8-972b35295f64", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "help(plt.subplots)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6a9f9845-1b82-4723-bdfd-75b14590237a", + "metadata": {}, + "outputs": [], + "source": [ + "# If we want to produce a single figure with these two graphs, we will have to use subplots\n", + "# Function syntax: plt.subplots(n_rows, n_cols, *)\n", + "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", + "axes[0].hist(participants['Age'], bins='fd')\n", + "axes[0].set_title(\"Age Distribution\")\n", + "axes[0].set_xlabel('Age')\n", + "axes[0].set_ylabel('Frequency')\n", + "\n", + "axes[1].scatter(participants['Age'], participants['ToM Booklet-Matched'])\n", + "axes[1].set_title(\"Relation between Age and ToM scores\")\n", + "axes[1].set_xlabel('Age')\n", + "axes[1].set_ylabel('ToM Booklet-Matched')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab909a51-6c79-414b-86cf-f14ad0898ba5", + "metadata": {}, + "outputs": [], + "source": [ + "# The subplots can even be used if we only have one graph\n", + "fig, axes = plt.subplots(figsize=(6, 4))\n", + "axes.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Age Distribution\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Frequency')" + ] + }, + { + "cell_type": "markdown", + "id": "1f84023b-a40a-4107-96a4-9170f4ac64f2", + "metadata": {}, + "source": [ + "#### seaborn" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97aa3481-127b-49dc-ba4b-3869ee7bccbc", + "metadata": {}, + "outputs": [], + "source": [ + "# With searbon\n", + "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", + "\n", + "order = sorted(participants[participants.AgeGroup!='Adult']['AgeGroup'].unique())\n", + "#order.sort()\n", + "\n", + "sns.boxplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " order=order,\n", + " ax=axes[0]\n", + ")\n", + "\n", + "sns.violinplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " order=order,\n", + " ax=axes[1]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b8684899-02a2-47f3-ba98-dd524fd0211f", + "metadata": {}, + "source": [ + "### Spines\n", + "The spine (or border) refers to the lines around the plotting area." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1f875ee-d881-469c-98ad-6971db15c32d", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "ax.set_title(\"Distribution de l'age\")\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('Compte')\n", + "\n", + "for spine in ax.spines.values():\n", + " spine.set_color(\"red\")\n", + " spine.set_linewidth(3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f51d8b30-3181-4812-ae57-2bac77001efd", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Age Distribution\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Frequency')\n", + "\n", + "ax.spines[['right', 'top', 'left', 'bottom']].set_visible(False) # ax.spines.top.set_visible(False)" + ] + }, + { + "cell_type": "markdown", + "id": "e8d250f6-6e14-46ba-8559-0827304b01d4", + "metadata": {}, + "source": [ + "### Ticks" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b4ab9580-9f60-45e4-9915-733d806b14f7", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Age Distribution\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Frequency')\n", + "\n", + "ax.tick_params(\n", + " axis='both', # Modification applied to both axes (x and y)\n", + " which='major', # Modification on 'major', 'minor' ou 'both' ticks\n", + " length=10, # Ticks length\n", + " width=2, # Ticks width\n", + " color='purple', # Ticks color\n", + " labelsize=12, # Label size\n", + " labelcolor='darkorange', # Label color\n", + " labelrotation=45\n", + ")\n", + "\n", + "#from matplotlib.ticker import MultipleLocator\n", + "#ax.xaxis.set_minor_locator(MultipleLocator(2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd1952eb-ad95-4d44-97f5-8bd486549769", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Age Distribution\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Frequency')\n", + "\n", + "plt.tick_params(\n", + " axis='x',\n", + " which='both', \n", + " bottom=False, \n", + " top=False, \n", + " labelbottom=False)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "eee08e4e-2473-4075-baf3-efb25dbe6899", + "metadata": {}, + "source": [ + "
\n", + "Modify the code\n", + "
Based on the code provided in the previous cells, modify the cell below to generate a figure with the following characteristics:\n", + "
  • \n", + " No top or right spines\n", + "
  • \n", + "
  • \n", + " No tick marks on the y-axis\n", + "
  • \n", + "
  • \n", + " Minor ticks on the x-axis at intervals of 1\n", + "
  • \n", + "
  • \n", + " X-axis labels rotated to 90 degrees\n", + "
  • \n", + "
  • \n", + " Titles for all axes as well as for the figure\n", + "
  • " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b755dea-75e4-41c3-82e5-57d0ce1571a4", + "metadata": {}, + "outputs": [], + "source": [ + "# Add your code here\n", + "# ..." + ] + }, + { + "cell_type": "markdown", + "id": "c9f7193a-d10f-491d-81d5-7b0226810bb7", + "metadata": {}, + "source": [ + "### Grid" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58c7a538-2139-483b-8f77-d3aa7d016b25", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "ax.set_title(\"Distribution de l'age\")\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('Compte')\n", + "\n", + "\n", + "ax.grid(\n", + " True,\n", + " color='gray',\n", + " linestyle='--',\n", + " linewidth=1\n", + ")\n", + "\n", + "ax.set_axisbelow(True) # Equivalent to the `zorder` parameter" + ] + }, + { + "cell_type": "markdown", + "id": "9750cf87-7b19-40e5-9e94-1a6c8fd58f15", + "metadata": {}, + "source": [ + "### Legend" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19b8c1c5-70eb-4f1d-abce-f208a97e2b42", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "sns.scatterplot(data=participants, x='Age', y='ToM Booklet-Matched', hue='Gender', ax=ax)\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('ToM Booklet-Matched')\n", + "\n", + "legend = ax.legend(fontsize=12, loc='lower right')\n", + "\n", + "legend.get_frame().set_edgecolor(\"black\")\n", + "legend.get_frame().set_linewidth(2)\n", + "legend.get_frame().set_facecolor(\"white\")" + ] + }, + { + "cell_type": "markdown", + "id": "15fb179d-a48e-4267-9654-59e3543f1528", + "metadata": {}, + "source": [ + "### Default parameters\n", + "\n", + "The default parameters for all elements of our figures (e.g., font, text size, line width, etc.) are defined in the `rcParams` object. These values can be modified, allowing us to apply a consistent figure style from one figure to another." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ec3305f-d81d-4c18-8086-6a7573aace18", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "plt.rcParams" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8647130-227d-406f-9100-0fab187465c6", + "metadata": {}, + "outputs": [], + "source": [ + "STYLE = {\n", + " \"font.cursive\": \"Comic Sans MS\",\n", + " \"font.size\": 8,\n", + " \"axes.linewidth\": 1.2,\n", + " \"axes.grid\": True,\n", + " \"axes.grid.axis\": 'both',\n", + " 'axes.axisbelow': True,\n", + " \"figure.dpi\": 150\n", + "}\n", + "\n", + "plt.rcParams.update(STYLE)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d965ff6-cad7-4d68-8353-77d87ebe8bfc", + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(figsize=(6, 4))\n", + "axes.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Age Distribution\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Frequency')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e84e6d32-3d96-4782-ba09-67a04a692464", + "metadata": {}, + "outputs": [], + "source": [ + "# To reset the default values\n", + "plt.rcdefaults()" + ] + }, + { + "cell_type": "markdown", + "id": "c272b2dd-e0d0-42e6-b0a3-f2ef53852154", + "metadata": {}, + "source": [ + "
    \n", + "Choose your style\n", + "
    It is also possible to use predefined styles using the `style.use` function in `matplotlib`. You can consult the list of available styles here. See also the matplotlib documentation to learn more about style sheets and `rcParams`.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3e7bef12-e3d7-4fee-9ab2-74a1911c6fa7", + "metadata": {}, + "outputs": [], + "source": [ + "print(plt.style.available)" + ] + }, + { + "cell_type": "markdown", + "id": "00dbfb21-c6b7-4dee-ad67-aa6db3feca0d", + "metadata": {}, + "source": [ + "
    \n", + "For example, 'ggplot' is a style that we can use to obtain figures similar to those generated by the `ggplot` library in R.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "498fa45c-2693-4364-aa26-1b88a58b745c", + "metadata": {}, + "outputs": [], + "source": [ + "plt.style.use('ggplot')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80353500-b6d6-4362-a63a-4a2d70164616", + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(figsize=(6, 4))\n", + "axes.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Age Distribution\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Frequency')" + ] + }, + { + "cell_type": "markdown", + "id": "f21c21df-af8d-4735-83d2-885fe5f8c46b", + "metadata": {}, + "source": [ + "## Interactive graphs\n", + "\n", + "Static figures are good, but interactive figures are better! In fact, interactive graphics offer us the opportunity to explore our data in ways that wouldn't be possible with static figures and to create dashboards (see [documentation](https://dash.plotly.com/?_gl=1*j1jto0*_gcl_au*MTY3MTkxNTc4MS4xNzczOTMwNTAw*_ga*MTM4NzMyMTAxMy4xNzczOTMwNTAy*_ga_6G7EE0JNSC*czE3NzM5MzA1MDIkbzEkZzAkdDE3NzM5MzA1MTAkajUyJGwwJGgw)). We can add information to our figures without cluttering them.\n", + "\n", + "The two main Python libraries that allow us to create interactive figures are `plotly` and `bokeh`. The `plotly` library offers a high-level interface (`plotly.express`) that allows us to create figures in a single line of code, while `bokeh` offers more customization options. In this tutorial, we will only discuss `plotly.express`, but if you want to try `bokeh`, you can consult [the documentation](https://docs.bokeh.org/en/latest/)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18a2b0a9-8299-4887-9207-255f679babb0", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "help(px.scatter)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfab865d-3208-40a3-84f7-ea2c52f97d90", + "metadata": {}, + "outputs": [], + "source": [ + "fig = px.scatter(\n", + " data_frame=participants, \n", + " x='Age', \n", + " y='ToM Booklet-Matched',\n", + " hover_data=['participant_id', 'Gender'],\n", + " color='Handedness',\n", + " symbol='Handedness'\n", + ")\n", + "\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcf186a9-7b00-44ae-8843-5cfa0f2a7520", + "metadata": {}, + "outputs": [], + "source": [ + "participants.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2f9bc8e-0847-4926-b058-02ec9d192877", + "metadata": {}, + "outputs": [], + "source": [ + "fig = px.scatter(\n", + " data_frame=participants, \n", + " x='Age', \n", + " y='ToM Booklet-Matched',\n", + " hover_data=['participant_id', 'Gender'],\n", + " color='Handedness',\n", + " symbol='Handedness'\n", + ")\n", + "fig.update_xaxes(range=[int(participants['Age'].min()-1), int(participants[participants['Child_Adult']=='child']['Age'].max()+1)])\n", + "fig.update_traces(marker_size=10)\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "id": "7828124c-a55c-4b6f-9a50-b52cd6b2c5eb", + "metadata": {}, + "source": [ + "## Visualizing brain images!\n", + "\n", + "The `nilearn` library supports multiple functions to plot brain images (see [the list of functions](https://nilearn.github.io/stable/plotting/index.html)). In this section, we will see some of those function.\n", + "\n", + "---\n", + "\n", + "Based on the [MAIN tutorial](https://main-educational.github.io/intro_nilearn/machine-learning-with-nilearn.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "163b13a4-3e50-4557-91d6-696d58977595", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's get our data\n", + "data = development_dataset.func\n", + "confounds = development_dataset.confounds\n", + "pheno = pd.DataFrame(development_dataset.phenotypic)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8de6b379-c131-4eec-8af4-2afbcf9487c0", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54ece402-fcd2-4131-8968-abd847afc8f0", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's try visualizing our first file\n", + "plotting.view_img(data[0])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d722995-2f57-4725-9230-5aeeb4068fed", + "metadata": {}, + "outputs": [], + "source": [ + "img = nib.load(data[0])\n", + "img.shape" + ] + }, + { + "cell_type": "markdown", + "id": "ec879242-9800-4e81-afbe-142ce494cf97", + "metadata": {}, + "source": [ + "
    \n", + "Oups!\n", + "
    If we try to visualize our BOLD images, we get an error! This is perfectly normal, as our file contains 4D data—meaning that for every voxel (3D), we have a value for several points in time (repetition time; +1D). If we absolutely want to visualize these files, two options are available to us:\n", + "
      \n", + " 1. We can look at the activity across the whole-brain for a given time point\n", + "
    \n", + "
      \n", + " 2. We can look at the timeserie for a given voxel/parcel\n", + "
    \n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fc331ce-94b0-4de9-a2c7-e7f0cccbe8ef", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's retrieve our first volume for our first participant\n", + "first_volume = image.index_img(data[0], 0)\n", + "\n", + "plotting.view_img(first_volume, black_bg=False, cmap='turbo', symmetric_cmap=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d59f34df-4e5a-4bab-a7f3-76de70365e74", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "help(plotting.plot_stat_map)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74a4affe-7a9b-4d73-b5f8-7e9136c3e2ce", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.plot_stat_map(\n", + " first_volume, \n", + " draw_cross=False,\n", + " #cut_coords=(0, 4, 22),\n", + " display_mode='tiled'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db301379-df28-49e0-96f8-299391dd1b24", + "metadata": {}, + "outputs": [], + "source": [ + "multiscale = datasets.fetch_atlas_basc_multiscale_2015(resolution=64, data_dir='../data')\n", + "atlas_filename = multiscale.maps\n", + "\n", + "# initialize masker (change verbosity)\n", + "masker = NiftiLabelsMasker(labels_img=atlas_filename, standardize=True,\n", + " memory='nilearn_cache', resampling_target=\"data\",\n", + " detrend=True, verbose=0)\n", + "# Extract the timeseries for our first participant\n", + "time_series = masker.fit_transform(data[0], confounds=confounds[0])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c7ef848-e75a-4769-a960-83a973d14a69", + "metadata": {}, + "outputs": [], + "source": [ + "time_series.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d348d1fc-99ff-427b-8251-06eb459780fb", + "metadata": {}, + "outputs": [], + "source": [ + "parcel = 0\n", + "plt.figure(figsize=(12,4))\n", + "plt.plot(time_series.T[parcel])\n", + "plt.title(f'Timeserie for parcel {parcelle}')\n", + "plt.xlabel('Volumes')\n", + "plt.ylabel('Amplitude')" + ] + }, + { + "cell_type": "markdown", + "id": "7da0ce19-dbaa-4976-ad03-e5b1aa88c76d", + "metadata": {}, + "source": [ + "
    \n", + "Visually compare the timeseries\n", + "
    Based on the code provided in the previous cell, modify the cell below to be able to compare the timeserie of parcel 0 and parcel 1.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d12883c4-398b-4c4d-afdd-5be5a78c7532", + "metadata": {}, + "outputs": [], + "source": [ + "help(plt.legend)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7a91dc2-7ad5-4feb-9383-5bd1cd1d529b", + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(12,4))\n", + "plt.plot(time_series.T[0], label='Parcel 0', ls='--', lw=2, alpha=0.4)\n", + "plt.plot(time_series.T[1], label='Parcel 1', lw=2, zorder=1)\n", + "plt.title(f'Timeseries for parcels 0 and 1')\n", + "plt.xlabel('Volumes')\n", + "plt.ylabel('Amplitude')\n", + "plt.legend(loc='lower right')" + ] + }, + { + "cell_type": "markdown", + "id": "fe979fb9-12db-417e-9c22-89b8107fd2c2", + "metadata": {}, + "source": [ + "## Machine learning models interpretation via visualization" + ] + }, + { + "cell_type": "markdown", + "id": "19d43830-3293-4014-9cf5-1102f2762320", + "metadata": {}, + "source": [ + "### Load the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab274378-13fe-49cf-b98b-8f6f25af5837", + "metadata": {}, + "outputs": [], + "source": [ + "from nilearn.connectome import ConnectivityMeasure\n", + "\n", + "correlation_measure = ConnectivityMeasure(kind='correlation', vectorize=True,\n", + " discard_diagonal=True)\n", + "\n", + "\n", + "all_features = [] # here is where we will put the data (a container)\n", + "\n", + "for i,sub in enumerate(data[:66]):\n", + " # extract the timeseries from the ROIs in the atlas\n", + " time_series = masker.fit_transform(sub, confounds=confounds[i])\n", + " # create a region x region correlation matrix\n", + " correlation_matrix = correlation_measure.fit_transform([time_series])[0]\n", + " # add to our container\n", + " all_features.append(correlation_matrix)\n", + " # keep track of status\n", + " print('finished %s of %s'%(i+1,len(data[:66])))\n", + "\n", + "np.savez_compressed('data/MAIN_BASC064_subsamp_features', a=all_features)" + ] + }, + { + "cell_type": "markdown", + "id": "25228970-3a18-44cd-a019-27cc2263627d", + "metadata": {}, + "source": [ + "
    \n", + "Si vos données ne se trouvent pas dans le dossier data/, modifier le chemin dans la cellule ci-dessous.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "88c238d8-cb8c-4f8c-8a74-0a0bfb8b8dd0", + "metadata": {}, + "outputs": [], + "source": [ + "y_ageclass = pheno.head(66)['Child_Adult']\n", + "\n", + "feat_file = 'data/MAIN_BASC064_subsamp_features.npz'\n", + "X_features = np.load(feat_file)['a']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ee2373d-20a1-4a5f-b4e2-b283c344fafd", + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"Shape X: {X_features.shape}\")\n", + "print(f\"Shape y: {y_ageclass.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80129398-0ddd-4153-afb9-9c66b142dff9", + "metadata": {}, + "outputs": [], + "source": [ + "sns.countplot(x = y_ageclass)" + ] + }, + { + "cell_type": "markdown", + "id": "9a9702ec-bdfb-4e2a-93db-7a2fca786ac3", + "metadata": {}, + "source": [ + "### Train the model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0735997-5c51-4af3-a8eb-5da9a7560227", + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "# Split the sample to training/test and\n", + "# stratify by age class, and also shuffle the data.\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(X_features, # x\n", + " y_ageclass, # y\n", + " test_size = 0.2, # 80%/20% split \n", + " shuffle = True, # shuffle dataset\n", + " # before splitting\n", + " stratify = y_ageclass, # keep\n", + " # distribution\n", + " # of ageclass\n", + " # consistent\n", + " # betw. train\n", + " # & test sets.\n", + " random_state = 123 # same shuffle each\n", + " # time\n", + " )\n", + "\n", + "from sklearn.svm import SVC\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.metrics import classification_report, confusion_matrix\n", + "\n", + "scaler = StandardScaler().fit(X_train)\n", + "X_train_scl = scaler.transform(X_train)\n", + "X_test_scl = scaler.transform(X_test)\n", + "\n", + "l_svc = SVC(kernel='linear', class_weight='balanced')\n", + "\n", + "l_svc.fit(X_train_scl, y_train) # fit to training data\n", + "y_pred = l_svc.predict(X_test_scl) # classify age class using testing data\n", + "\n", + "acc = l_svc.score(X_test_scl, y_test) # get accuracy\n", + "cr = classification_report(y_pred=y_pred, y_true=y_test) # get prec., recall & f1\n", + "cm = confusion_matrix(y_pred=y_pred, y_true=y_test) # get confusion matrix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f4345ab-fee5-4db0-86b3-bcd6ebbdf050", + "metadata": {}, + "outputs": [], + "source": [ + "print(cr)" + ] + }, + { + "cell_type": "markdown", + "id": "ea103a73-8c20-4287-99e5-df10d6cc5147", + "metadata": {}, + "source": [ + "### Visualizing the coefficients\n", + "\n", + "We have trained our model and obtained a fairly high prediction score, which indicates that there is likely something in our data that is systematically linked to age." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e847f9d9-6c23-4ad6-835e-e9af08324a4f", + "metadata": {}, + "outputs": [], + "source": [ + "print(l_svc.coef_.shape)\n", + "print(l_svc.coef_)" + ] + }, + { + "cell_type": "markdown", + "id": "5fb261c4-e6cd-43cf-b119-1888d6a7beeb", + "metadata": {}, + "source": [ + "#### Correlation matrix\n", + "The features of our model correspond to the correlation between each pair of regions that we extracted. The coefficients of our model therefore represent the weight of each of these pairs of regions in predicting the age group. We can thus use a correlation matrix to visualize these weights." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd4ab8b7-2ddd-4a63-9c5e-127ce5b4f4bb", + "metadata": {}, + "outputs": [], + "source": [ + "feat_exp_matrix = correlation_measure.inverse_transform(l_svc.coef_)[0]\n", + "\n", + "plotting.plot_matrix(feat_exp_matrix, figure=(10, 8), \n", + " labels=range(feat_exp_matrix.shape[0]),\n", + " reorder='average',\n", + " tri='lower', vmax=0.01, vmin=-0.01)" + ] + }, + { + "cell_type": "markdown", + "id": "0fbc6f06-6625-4806-81fa-6d0dd48be515", + "metadata": {}, + "source": [ + "#### Connectome\n", + "\n", + "We can also directly visualize the weight of our features on a brain!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02ca50c0-40d4-4c70-b262-287268035303", + "metadata": {}, + "outputs": [], + "source": [ + "# Regions coordinates\n", + "coords = plotting.find_parcellation_cut_coords(atlas_filename)\n", + "print(coords.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec4555da-ab29-4e8e-bcb8-5dd8a617cf17", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e02adcdf-3e75-42fb-8b03-a23dd990d097", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True, edge_threshold=0.006)" + ] + }, + { + "cell_type": "markdown", + "id": "6491c5fb-f6d7-46b0-9b13-45a2ebfc4309", + "metadata": {}, + "source": [ + "
    \n", + "Let's add some motion!\n", + "
    Nilearn has a function that allows us to visualize our connectome interactively! This makes it much easier for us to examine the weights of our features.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65e8661f-619e-48ba-ada4-4892a3405d40", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.view_connectome(feat_exp_matrix, coords, edge_threshold='90%')" + ] + }, + { + "cell_type": "markdown", + "id": "c9ab0638-3fdd-4e05-82c3-ece605ea2827", + "metadata": {}, + "source": [ + "
    \n", + "Gray features...\n", + "
    You may have noticed that our feature weights are plotted in gray. Why do you think that is?\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bf0eb66-a0e0-4887-883a-00be14565747", + "metadata": {}, + "outputs": [], + "source": [ + "feat_exp_matrix_rm_diag = feat_exp_matrix\n", + "feat_exp_matrix_rm_diag[feat_exp_matrix==1] = 0\n", + "plotting.view_connectome(feat_exp_matrix_rm_diag, coords, edge_threshold='90%')" + ] + }, + { + "cell_type": "markdown", + "id": "d5c60913-4db1-49f6-8abf-780dc8c5db24", + "metadata": {}, + "source": [ + "We have a model that predicts the age group with very high predictive performance. We can see that the features allowing us to make this prediction are distributed throughout the brain. Can we publish our results now?\n", + "
    \n", + "
    No! We need to explore further to see if our model is biologically plausible... To do this, we are going to visualize our brain images for each of our groups." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e25d6ed-1e05-4293-bd67-260475a570b2", + "metadata": {}, + "outputs": [], + "source": [ + "children, adults = data[33:66], data[0:33]\n", + "avg_children, avg_adults = [], []\n", + "\n", + "# For each participant, we are going to average the brain activity across all our measurement points to obtain a 3D image.\n", + "for child, adult in zip(children, adults):\n", + " avg_adults.append(image.mean_img(adult))\n", + " avg_children.append(image.mean_img(child))\n", + "\n", + "# We are going to average our individual 3D images for each participant, doing so for each of our groups separately.\n", + "avg_children = image.mean_img(avg_children)\n", + "avg_adults = image.mean_img(avg_adults)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "288b7f4d-ee28-4fb6-9172-7327b1a8269e", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.view_img(avg_children, black_bg=False, cut_coords=(0,-16,16), cmap='turbo', symmetric_cmap=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa9254cc-ad38-4c84-8ab1-6cce2b446810", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.view_img(avg_adults, black_bg=False, cut_coords=(0,-16,16), cmap='turbo', symmetric_cmap=False)" + ] + }, + { + "cell_type": "markdown", + "id": "0eebd450-fa9a-43da-bcb6-86602dbb8605", + "metadata": {}, + "source": [ + "
    \n", + "What do you notice?\n", + "
    Look at the averaged images for each of the groups. Can you observe any differences?\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6191daf0-0a2d-4ddd-a653-fddf4ca18048", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's retrieve our first volume for our first participant\n", + "first_volume = image.index_img(data[0], 0)\n", + "\n", + "plotting.view_img(first_volume, black_bg=False, cmap='turbo', symmetric_cmap=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4a6ca30-7117-4084-bdbd-0d460dc509ff", + "metadata": {}, + "outputs": [], + "source": [ + "# Let's retrieve our first volume for our 39th participant\n", + "first_volume = image.index_img(data[40], 0)\n", + "\n", + "plotting.view_img(first_volume, black_bg=False, cmap='turbo', symmetric_cmap=False)" + ] + }, + { + "cell_type": "markdown", + "id": "c71007ad-6b51-4798-beab-abdab0d6f442", + "metadata": {}, + "source": [ + "## Supplementary resources\n", + "\n", + "- [`seaborn` tutorial on visualizing distributions of data](https://seaborn.pydata.org/tutorial/distributions.html)\n", + "- [Python Graph Gallery](https://python-graph-gallery.com/)\n", + "- [Mastering data charts: A comprehensive guide to visualization](https://www.atlassian.com/data/charts)\n", + "- [Common caveats to avoid](https://www.data-to-viz.com/caveats.html)\n", + "- [Nilearn visualization functions - (f)MRI data](https://nilearn.github.io/dev/plotting/index.html)\n", + "- [MNE python visualization tutorials - EEG/MEG data](https://mne.tools/stable/auto_tutorials/visualization/index.html)\n", + "- [DIPY visualization tutorials - Diffusion MRI data](https://docs.dipy.org/stable/examples_built/index#visualization)" + ] + } + ], + "metadata": { + "@deathbeds/jupyterlab-fonts": { + "fontLicenses": {}, + "fonts": {}, + "styles": { + ":root": {} + } + }, + "jupytext": { + "formats": "ipynb,md" + }, + "kernelspec": { + "display_name": "Python cours visu", + "language": "python", + "name": "visu-env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/visualization_en.md b/notebooks/visualization_en.md new file mode 100644 index 0000000..b1c55b5 --- /dev/null +++ b/notebooks/visualization_en.md @@ -0,0 +1,1526 @@ +--- +jupyter: + jupytext: + formats: ipynb,md + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + jupytext_version: 1.19.1 + kernelspec: + display_name: Python cours visu + language: python + name: visu-env +--- + + +# Visualization and Model Interpretation + +## Learning Objectives +👀 Understand the purpose of visualization +
    📈 Understand which type of graph to use according to the data type +
    🎨 Adequately choose your color palette +
    🔃 Learn how to modify different elements of our figures +
    🧠 Use `nilearn` for neuroimaging data visualization +
    🤖 Understand how visualization can help us interpret our machine learning models + +## Tutorial Organization + +The session will consist of both theoretical and practical sections: + +**Theory** + +We will discuss the basic theoretical principles of visualization. The following principles will be covered: +- Graph types (tabular data): univariate vs. bivariate; categorical vs. continuous +- Color palettes: perceptually uniform vs. non-uniform; discrete vs. continuous +- Graph types (neuroimaging data): statistical maps, connectomes, etc. + +This part includes several interactive elements to lead students to form their own understanding of the material. The code is already provided, but we will not dwell on it. + +**Practice** + +We will put the theoretical principles into practice as we go. We will use the following libraries: +- matplotlib +- seaborn +- ptitprince +- plotly +- nilearn + +Students will be required to modify the provided code to understand the role of various visualization parameters and to answer specific questions based on what we cover in class or from provided references (e.g., matplotlib documentation). + + +
    +The yellow boxes contain questions/exercises for students to answer. +
    + + +
    +The blue boxes contain additional information about the datasets and functions used. +
    + +```python editable=true slideshow={"slide_type": ""} +import numpy as np +import pandas as pd +import nibabel as nib +import seaborn as sns +import ptitprince as pt +import ipywidgets as widgets +import matplotlib.pyplot as plt +import plotly.express as px + +from cmcrameri import cm +from nilearn import datasets, plotting, image +from nilearn.input_data import NiftiLabelsMasker +from matplotlib import colormaps, colors +from colorspacious import cspace_converter, cspace_convert +``` + +## Load the data + +### Brain development fMRI dataset +For this tutorial, we will use the brain development fMRI dataset, which includes phenotypic data as well fMRI data collected from children and adults during movie watching ([Richardson et al., 2018](https://doi.org/10.1038/s41467-018-03399-2)). + +**Note:** if you have already downloaded the data, you can change the path in the cell below to point to the appropriate directory. + +```python +development_dataset = datasets.fetch_development_fmri(data_dir='data/') +``` + +### Datasaurus +In the first part of the tutorial, we will also briefly use the dataset datasaurus. If you want to be able to execute the cells using that dataset, you will have to download the data from [kaggle](https://www.kaggle.com/datasets/tombutton/datasaurusdozen). + +**Note:** change the path in the cell below to point to the directory where you downloaded the data. + +```python +# Credit: Alberto Cairo (original datasaurus), and Justin Matejka and George Fitzmaurice (datasaurus dozen) +data = pd.read_csv("data/datasaurus.csv") +``` + + +## A picture is worth a thousand words +- Visualizations help us better understand the complexity of our data +- They allow us to support our findings +- And they help us share a message + + +```python +# rcParams allows us to modify the global parameters of our figures +# Here, we are simply ensuring that values between (-8,000,000, 8,000,000) are not shown in scientific notation +plt.rcParams['axes.formatter.useoffset'] = False +plt.rcParams['axes.formatter.limits'] = (-8000000, 8000000) +``` + +```python +def plot_category(category): + plt.scatter(data[data['dataset'] == category]['x'], data[data['dataset'] == category]['y']) + plt.xlabel('x') + plt.ylabel('y') + max_x = data[data['dataset'] == category]['x'].max() + max_y = data[data['dataset'] == category]['y'].max() + plt.text(max_x-0.22*max_x, max_y+0.10*max_y, f"Mean: ({data[data['dataset'] == category]['x'].mean().round(2)}, {data[data['dataset'] == category]['y'].mean().round(2)})") + plt.text(max_x-0.22*max_x, max_y+0.06*max_y, f"Std: ({data[data['dataset'] == category]['x'].std().round(2)}, {data[data['dataset'] == category]['y'].std().round(2)})") + + plt.show() + +widgets.interact( + plot_category, + category=widgets.Dropdown( + options=sorted(data['dataset'].unique()), + description='Dataset:' + ) +); +``` + + +### "Never trust summary statistics alone; always visualize your data" +

    Alberto Cairo

    + +Python offers a wide range of libraries for visualizing our data: +- High-level vs. low-level +- Static images vs. interactive plots +- General-purpose libraries (e.g., Matplotlib, Seaborn, Bokeh, and Plotly) +- Domain-specific libraries (e.g., Nilearn) + +But with great power comes great responsibility... + + +```python +data = pd.DataFrame( + { + 'Categories': ['A', 'B'], + 'Values': [6000000, 7066000] + }, + +) +``` + +```python +plt.bar(data['Categories'], data['Values']) +plt.ylim([5500000, 8000000]) +plt.yticks([]) +plt.title("What could you say about 'A' and 'B' if you only look at the figure?") +plt.show() +``` + +```python +plt.bar(data['Categories'], data['Values']) +plt.ylim([5500000, 8000000]) + +plt.title("What do you observe?") +plt.show() +``` + +```python +plt.bar(data['Categories'], data['Values']) + +# Ajoute les valeurs au-dessus des bars +for i, v in enumerate(data['Values']): + plt.text(i, v + 1, str(v), ha='center', va='bottom') + +plt.title("Is that better?") +plt.show() +``` + + +## A graph for every data type! + +There are several types of graphs. The chosen chart type depends on the variables you want to visualize: +- Univariate visualization:: **continuous variable** vs **categorical** +- Bivariate visualization: **categorical x categorical** vs **categorical x continuous** vs **continuous x continuous** + + + +### Univariate visualizations + +For a **continuous variable**, we can visualize its distribution: +- Histogram +- kde plot +- Strip plot + +For a **categorical variable**, we can visualize the quantity of observations for each category: +- Bar plot + + + +
    +If the `brain development fMRI dataset` is not under the data/ folder, change the path in the cell below for the appropriate one. +
    + + +```python +# Retrieve participants data +participants = pd.read_csv('data/development_fmri/development_fmri/participants.tsv', sep='\t') + +# Let's check what we have here +participants.head() +``` + +```python +# Let's check our data types +participants.dtypes +``` + +
    +The dataset contains continuous variables (e.g., `Age`, `ToM Booklet-Matched`, `FB_Composite`) and categorical variables (e.g., `AgeGroup`, `Child_Adult`, `Gender`). `ToM Booklet-Matched` represents a score on a task designed to assess Theory of Mind—the ability to attribute mental states (beliefs, desires, emotions, intentions) to oneself or others. +
    + +```python +# Let's look at the descriptive statistics related to the `Age` variable +print(participants['Age'].describe()) +``` + + +#### Histogram + +A histogram allows us to visualize the distribution of a given variable in a discrete manner by grouping its values into consecutive intervals (bins). This provides the frequency (i.e., the number of observations) within each of these intervals. + +You can generate a histogram in Matplotlib using the [hist function](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html). + + +```python jp-MarkdownHeadingCollapsed=true +# Let's visualize the distribution of `Age` +plt.hist(participants['Age']) +# Adding a title +plt.title("Age Distribution") +# Adding a title for he x and y label +plt.xlabel('Age') +plt.ylabel('Frequency') +``` + +
    +The science behind the number of bins – Part 1 +
    The hist function in matplolib groups data into 10 bins by default. However, an inappropriate number of bins (either too small or too large) will not represent the distribution accurately. +
    + + +
    +Bins in practice! +
    Modify the value of the `bins` parameter in the cell below to determine how many bins would most accurately reflects our `Age` variable. +
    + +```python +# Change the value of `bins` +plt.hist(participants['Age'], bins=10) +# Adding a title +plt.title("Age Distribution") +# Adding a title for he x and y label +plt.xlabel('Age') +plt.ylabel('Frequency') +``` + +
    +The science behind the number of bins – Part 2 +
    Several different rules exist for calculating the optimal number of bins, as well as the width of the intervals to be used for each. You can use these rules directly within the matplotlib hist function! Simply pass one of the valid rules as the bins parameter. +
    + +```python +help(plt.hist) +``` + +```python +# Let's visualize the distribution of `Age` +plt.hist(participants['Age'], bins='fd') +# Adding a title +plt.title("Age Distribution") +# Adding a title for he x and y label +plt.xlabel('Age') +plt.ylabel('Frequency') +``` + +#### kde plot + +The **Kernel Density Estimation (KDE)** allows us to visualize he distribution of our variable in a continuous (rather than discrete) manner by estimating a density function. However, much like bin size in a histogram, kernel density estimation is sensitive to the bandwidth! + +```python +# To visualize the kernel density estimation, we will use the `kdeplot` function in `seaborn` +sns.kdeplot(participants['Age']) +``` + +
    +KDE plot y axis - Part 1 +
    You’ve likely noticed that the y-axis values now range from 0 to 0.08 (as opposed to 0 to 50 for our histogram). In a KDE plot, the y-axis represents density, which is the probability per unit of the variable on the x-axis. In other words, it shows how "dense" the data is for a given x-value. Therefore, the peaks in this type of figure represent a higher density of points for a specific range of values (i.e., a higher probability of observing a value), while the troughs represent a lower density of points. +
    + + +
    +Take note! +
    Since this type of graph provides a continuous estimation, it might suggest that certain data points exist when, in reality, they do not. +
    + +```python +# We can also overlap a histogram with a kde plot in `seaborn` using the `histplot` function +sns.histplot( + participants['Age'], + kde=True, + bins='fd', + edgecolor=None +) +``` + +
    +KDE plot y axis - Part 2 +
    When overlaying a histogram with a KDE using `seaborn`, you will notice that the y-axis shows the frequency. However, the KDE curve remains a density curve. This occurs because `seaborn` scales the density curve to match the histogram by multiplying the curve by the number of observations and the bin width. +
    + + +#### Univariate strip plot + +**Histograms** and **KDE plots** allow us to visualize a variable's distribution in a discrete or continuous manner, respectively. However, these types of visualizations do not allow us to see the raw data itself. + +The **strip plot** allows us to visualize each individual data point. This can make it easier to identify the presence of outliers within our data. However, a scatter plot is not suitable if we have too many data points. + +```python +sns.stripplot( + x=participants['Age'] +) +``` + +
    +Strip plot reproducibility +
    Try to reproduce the strip plot for the `Age` variable in the two cells below. What do you notice? +
    + +```python +sns.stripplot( + x=participants['Age'] +) +``` + +```python +sns.stripplot( + x=participants['Age'] +) +``` + +
    +The `jitter` parameter +
    You might have noticed that the two scatter plots you generated from the same variable are not exactly identical. This happens because seaborn uses numpy.random to calculate the jitter. To make the jitter calculation reproducible, you can set a seed beforehand. +
    + +```python +np.random.seed(12) +sns.stripplot( + x=participants['Age'] +) +``` + +```python +np.random.seed(12) +sns.stripplot( + x=participants['Age'] +) +``` + +#### Bar plots + +The **bar plot** allows you to visualize and compare categorical variables by showing the frequency of different values or simply the values themselves. This type of graph is useful if you want to, for example, compare the number of people per group. + +```python +participants['AgeGroup'].value_counts() +``` + +```python +participants['AgeGroup'].value_counts().plot(kind='bar') +plt.ylabel('Counts') +``` + +```python +order = sorted(participants.AgeGroup.unique()) + +sns.countplot( + data=participants, + x='AgeGroup', + order=order, + hue='Gender', +) +``` + +### Summary + +| Graph type | Data type | Summary | +| --- | --- | --- | +| Histogram | Continuous | To visualize our data distribution in a discrete way | +| KDE plot | Continuous | To visualize our data distribution in a continuous way | +| Strip plot | Continuous | To visualize each data point individually | +| Bar plot | Categorical | To compare frequencies between groups/categories | + + + +### Bivariate Visualizations + +For a **continuous variable** x **continuous variable**, we can use: +- Scatter plot +- Bivariate KDE +- Hexplot +- Joint plot +- Heat map + +For a **categorical variable** x **continuous variable**, we can use: +- Box plot +- Violin plot +- Scatter plot +- Point plot +- Rain cloud plot +- Bar plot... (?) + + +#### Scatter plot - Continuous variable x continuous variable + +```python +plt.scatter(participants['Age'], participants['ToM Booklet-Matched']) +plt.xlabel('Age') +plt.ylabel('ToM Booklet-Matched') +``` + +
    +What do you notice? +
    Take the time to observe the figure that we have just generated. +
    + +```python +participants.groupby(['Child_Adult'])['ToM Booklet-Matched'].mean() +``` + +
    +Missing values +
    The `scatter` function in `matplotlib`, as well as the `regplot` function in `seaborn` automatically remove the missing values. +
    + + +
    +The `regplot` function +
    The `regplot` function in `seaborn` allows you to visualize the scatter plot while fitting a linear regression model to the data! +
    + +```python +sns.regplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'] +) +``` + +
    +Fitting the regression model +
    A linear regression does not seem to be the best way to model the relationship between our `Age` variable and our `ToM Booklet-Matched` variable. Change the value of the order parameter in the regplot function to 2 to check the fit of this model. +
    + +```python +sns.regplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'], + order=2 +) +``` + +
    +What if we are adding another variable! +
    We can visualize interactions between multiple variables via the `hue` parameter in the `lmplot` function in `seaborn`. In the cell below, we will explore the relation between `Age` (x axis) and `ToM Booklet-Matched` (y axis) based on the gender (hue). +
    + +```python +sns.lmplot( + x='Age', + y='ToM Booklet-Matched', + data=participants, + hue='Gender' +) +``` + +#### Bivariate KDE plot and hex plot - Continuous Variable x continuous Variable + +When we have a large number of data points and want to visualize the distribution of points across two variables, we risk having significant overplotting. In this case, it can be difficult to properly visualize the distribution of our data with a scatter plot. It is therefore possible to use other types of graphs: + +The **bivariate KDE plot** allows us to visualize how two variables are distributed in a two-dimensional space. Each contour represents a density zone. The closer the contours are, the higher the density—meaning where the data is most concentrated. + +The **hex plot** allows us to visualize the data point density in a discrete manner. It is essentially the equivalent of a histogram, but for visualizing two variables instead of one. Darker areas represent zones of high density. + +⚠ Bivariate kernel density estimation is more computationally demanding compared to the hex plot. + +```python +sns.kdeplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'], + fill=True +) +``` + +```python +sns.jointplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'], + kind='hex' +) +``` + +```python +def plot_jointplot(kind): + sns.jointplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'], + kind=kind + ) + plt.show() + +widgets.interact( + plot_jointplot, + kind=widgets.Dropdown( + options=['hex', 'reg', 'kde', 'scatter'], + description='Type:' + ) +); +``` + +```python +def plot_jointplot(kind): + mean, cov = [0, 1], [(1, .5), (.5, 1)] + x, y = np.random.multivariate_normal(mean, cov, 1000).T + + sns.jointplot( + x=x, + y=y, + kind=kind + ) + plt.show() + +widgets.interact( + plot_jointplot, + kind=widgets.Dropdown( + options=['scatter', 'hex', 'reg', 'kde'], + description='Type:' + ) +); +``` + +#### Strip plot - Continuous variable x categorical variable + +We have already discussed the **strip plot** in the context of univariate visualization, but we can also use it to look at mulitple groups/categories simultaneously. + +```python +order = sorted(participants.AgeGroup.unique())[:-1] + +sns.stripplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order +) +``` + + +#### Box plot - Continuous variable continue x categorical variable + + +Box plots allow us to visualize the distributions of one or multiple groups of continuous variables (e.g., age distributions across different experimental groups). The different components of the box plot represent different descriptive statistics: +- The line in the box represens the median. +- The extremities of the box (inferior-Q1 and superior-Q3 quartiles) represent the range where 50% of the data is located. +- The whiskers (i.e., the lines outside of the box) capture the range of the rest of the data. +- The points represent outliers—values greater than 1.5 x interquartile range (i.e., Q3-Q1) + Q3 or lower than Q1 - 1.5 x interquartile range. + + +```python +sns.boxplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order +) +``` + +#### Violon plot - Continuous variable x categorical variable + +The **violin plots allow** us to visualize the data distribution by using the density curves (aka the **kde** curves). The width of each curve corresponds to the approximate frequency of the points for each region (values on the y-axis). + +```python +sns.violinplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order +) +``` + +#### Raincloud - Continuous variable x categorical variable + +Why choosing between strip plot, box plot and violin plot when we can do them all at one! That's what **raincloud plots** allow us to do. + +*Note :* this type of graph is not natively integrated into `seaborn` or `matplotlib`. We will need to use the `ptitprince library` that we imported at the beginning of the tutorial (`import ptitprince as pt`). + +*Resource :* for more examples using raincloud plot, please refer to this [ptitprince tutorial](https://github.com/pog87/PtitPrince/blob/master/tutorial_python/raincloud_tutorial_python.ipynb). + +```python +pt.RainCloud( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order, + bw=0.6 +) +``` + +#### Point plot - Continuous variable x categorical variable + +The **point plot** allows us to compare means (or any other descriptive statistic) between groups while showing the uncertainty (e.g., 95% CI, standard deviation, etc.). This type of graph is useful to show trends between different categories or between different time points (e.g., if we have longitudinal measurements). + +```python +sns.pointplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + estimator='mean', + data = participants[participants.AgeGroup!='Adult'], + order=order, + errorbar=('ci', 95) +) +``` + +#### Bar plots for bivariate visualizations? + +You may have already seen bar plots used to represent a continuous variable. However, this type of visualization is not recommended for this kind of variable: +- It only allows for the visualization of certain descriptive statistics (e.g., the mean) without providing any information about the distribution of our variable. +- If you absolutely must use a bar plot, overlay it with a scatter plot! + +```python +sns.barplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order = order, palette='Blues' +) + + +sns.stripplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + jitter=True, + order = order, color = 'black') + +``` + +## And what if we added a little color to our figures? + + +### Perceptually Uniform vs. Non-Uniform Color Palettes +- Colors are perceived based on their hue (orange, red, green, etc.) and their luminosity (lightness vs. darkness of a hue). +- The characteristics of our photoreceptors mean that we do not process the light spectrum uniformly. +- The majority of the photoreceptors that allow us to see colors (cones) process long wavelengths (i.e., red, orange, yellow). +- Therefore, we do not perceive variations in green-blue hues as well as we do perceive variations in yellow-red hues. + +```python +colormaps['hsv'] +``` + +```python +cm.batlow +``` + +```python +import requests +from PIL import Image +from io import BytesIO +``` + +```python +response = requests.get('https://raw.githubusercontent.com/matplotlib/matplotlib/main/doc/_static/stinkbug.png') +img = np.asarray(Image.open(BytesIO(response.content))) +``` + +```python +def plot_img(cmap): + lum_img = img[:, :, 0] + if cmap == 'noir et blanc': + plt.imshow(img) + elif cmap == 'hsv': + plt.imshow(lum_img, cmap="hsv") + elif cmap == 'batlow': + plt.imshow(lum_img, cmap=cm.batlow) + +widgets.interact( + plot_img, + cmap=widgets.Dropdown( + options=['noir et blanc', 'hsv', 'batlow'], + value='noir et blanc', + description='Palette:' + ) +); +``` + +
    +What do you notice? +
    Compare the colored image using the `hsv` and `batlow` color palettes with the black and white (grayscale) version. +
    + + +
    +You don't have to throw away the rainbow +
    Google developed a perceptually uniform rainbow colormap called `turbo`, which is available on `matplotlib`. For more details about this colormap, please refer to this Google Research blog. +
    + + +### Discrete vs. Continuous Color Palettes + +Color palettes can be either continuous (like those we saw above) or discrete. Discrete color palettes are used to visualize categorical variables—where categories have no inherent order (e.g., Children with COVID vs. children without COVID vs. adults with COVID vs. adults without COVID). + + +#### [matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html) + +```python +colormaps['Set2'] +``` + +```python +colormaps['Set3'] +``` + +#### [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) + +```python +sns.color_palette('rocket', 10) +``` + +#### [cmcrameri](https://s-ink.org/scientific-colour-maps) + +```python +cm.batlow.resampled(10) +``` + +```python +cm.lipari.resampled(10) +``` + +
    +Discretizing Continuous Palettes +
    In the cells below, we saw that it is possible to discretize a continuous palette by specifying the number of colors we want. However, as mentioned, discrete palettes are typically used to visualize categories that have no inherent order. Therefore, using an ordered discrete palette is not always necessary. +
    + +```python +nb_colors = 10 +#np.random.seed(10) + +original_cmap = cm.lipari.resampled(nb_colors) + +original_colors = original_cmap(np.arange(nb_colors)) + +shuffled_colors = original_colors.copy() +np.random.shuffle(shuffled_colors) + +colors.ListedColormap(shuffled_colors) +``` + +### Diverging color palettes + +Diverging color palettes are useful when we have an interpretable central value. Diverging palettes can be applied to both discrete and continuous scales. For example: +- **Discrete diverging palettes**: If we have data collected using a Likert scale. Our values might range from 'Strongly Disagree' to 'Strongly Agree,' with 'Neutral' as the central value. In this case, we could visualize the values on the left (from 'Strongly Disagree' to 'Neutral') in shades of blue and the values on the right (from 'Neutral' to 'Strongly Agree') in shades of orange/red. +- **Continuous diverging palettes**: If we have negative and positive values (e.g., correlation coefficients), we could use zero as the central value. Negative values could be represented in shades of blue and positive values in shades of orange/red. + + +#### [matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html) + +```python +# Palette discrète divergente +colormaps['bwr'].resampled(9) +``` + +```python +# Palette continue divergente +colormaps['bwr'] +``` + +#### [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) + +```python +sns.color_palette("coolwarm", 9) +``` + +```python +sns.color_palette("coolwarm", as_cmap=True) +``` + +#### [cmcrameri](https://s-ink.org/scientific-colour-maps) + +```python +cm.vik.resampled(9) +``` + +```python +cm.vik +``` + +### Universally interpretable color palettes + +We talked about the perceptual uniformity of color palettes, but it is also important to consider the use of colors that are perceived universally. In fact, using certain color combinations can make it difficult for people with color vision deficiency to distinguish between data points. A few tips: +- Avoid red-green combinations since the most common form of color blindness +- Vary lightness and saturation to make sure the data remains readable even in greyscale +- Use the colorblind-friendly palettes provided by `matplotlib`, `seaborn` and `cmcrameri` + +```python +sns.color_palette("colorblind") +``` + +```python +converters = { + 'deuter50_space': { + "name": "sRGB1+CVD", + "cvd_type": "deuteranomaly", + "severity": 50 + }, + 'deuter100_space': { + "name": "sRGB1+CVD", + "cvd_type": "deuteranomaly", + "severity": 100 + }, + 'prot50_space': { + "name": "sRGB1+CVD", + "cvd_type": "protanomaly", + "severity": 50 + }, + 'prot100_space': { + "name": "sRGB1+CVD", + "cvd_type": "protanomaly", + "severity": 100 + }, + 'trit50_space': { + "name": "sRGB1+CVD", + "cvd_type": "tritanomaly", + "severity": 50 + }, + 'trit100_space': { + "name": "sRGB1+CVD", + "cvd_type": "tritanomaly", + "severity": 100 + } +} + +def show_palettes(cmap, n_colors=10): + f, ax = plt.subplots(7, 1, figsize=(10, 6)) #, layout="constrained") + plt.subplots_adjust(hspace=0.6) + + gradient = np.arange(n_colors).reshape(1, -1) + + if cmap in ['viridis', 'coolwarm', 'colorblind' ,'pastel', 'muted']: + cmap_sns = sns.color_palette(cmap, n_colors) + cmap_m = colors.ListedColormap(sns.color_palette(cmap, n_colors=n_colors)) + if cmap in ['batlow', 'vik']: + cmap_m = getattr(cm, cmap).resampled(n_colors) + cmap_sns = sns.color_palette(cmap_m(np.linspace(0, 1, n_colors))) + if cmap in ['bwr', 'hsv', 'PuOr', 'summer']: + cmap_m = colormaps[cmap].resampled(n_colors) + cmap_sns = sns.color_palette(cmap_m(np.linspace(0, 1, n_colors))) + + # Original palette + ax[0].imshow(gradient, aspect='auto', cmap=cmap_m) + ax[0].set_title('Original') + for idx, converter in enumerate(converters.keys()): + # Palette transformée + converted_palette = cspace_convert(cmap_sns, converters[converter], "sRGB1") + converted_palette = np.clip(converted_palette, 0, 1) + + ax[idx+1].imshow(gradient, aspect='auto', cmap=colors.ListedColormap(converted_palette)) + ax[idx+1].set_title(f"{converters[converter]['cvd_type']}-{converters[converter]['severity']}") + + for a in ax.flatten(): + a.set_yticks([]) + a.set_xticks([]) + + plt.show() + +widgets.interact( + show_palettes, + cmap=widgets.Dropdown( + options=['pastel', 'viridis', 'coolwarm', 'batlow', 'bwr', 'colorblind', 'hsv', 'PuOr', 'summer', 'vik', 'muted'], + value='viridis', + description='Palette:' + ), +); + +``` + +
    +More than colors +
    Choosing the right color palette is important, but there are also other strategies we can use to make our figures more accessible and easily interpretable. Instead of using only different colors to distinguish between categories, we could use different marker shapes (e.g., circle vs triangle). If our figure includes lines to, for example, illustrate the trajectory of a variable over time, we could use different line style (e.g., solid vs dashed). For more examples, see "The best charts for color blind viewers". +
    + + +## Anatomy of a figure + +We have already seen how to add or modify certain elements of our figures, such as the title and axis labels. In this section, we will discuss the art of figure-making in more detail. We will cover: +- Subplots +- Spines +- Ticks +- Grid +- Legend + + +### Subplots + +Up until now, we have primarily created our figures by directly calling certain functions from matplotlib and seaborn. However, if we want to create a figure with multiple panels (i.e., columns and/or rows), we will need to use subplots. + + +#### matplotlib + +```python +# Let's look at what we've used before. This code generates two separated figures. +plt.hist(participants['Age'], bins='fd') +plt.title("Age Distribution") +plt.xlabel('Age') +plt.ylabel('Frequency') +plt.show() + +plt.scatter(participants['Age'], participants['ToM Booklet-Matched']) +plt.xlabel('Age') +plt.ylabel('ToM Booklet-Matched') +plt.show() +``` + +```python +help(plt.subplots) +``` + +```python +# If we want to produce a single figure with these two graphs, we will have to use subplots +# Function syntax: plt.subplots(n_rows, n_cols, *) +fig, axes = plt.subplots(1, 2, figsize=(12, 4)) +axes[0].hist(participants['Age'], bins='fd') +axes[0].set_title("Age Distribution") +axes[0].set_xlabel('Age') +axes[0].set_ylabel('Frequency') + +axes[1].scatter(participants['Age'], participants['ToM Booklet-Matched']) +axes[1].set_title("Relation between Age and ToM scores") +axes[1].set_xlabel('Age') +axes[1].set_ylabel('ToM Booklet-Matched') +``` + +```python +# The subplots can even be used if we only have one graph +fig, axes = plt.subplots(figsize=(6, 4)) +axes.hist(participants['Age'], bins='fd') +axes.set_title("Age Distribution") +axes.set_xlabel('Age') +axes.set_ylabel('Frequency') +``` + +#### seaborn + +```python +# With searbon +fig, axes = plt.subplots(1, 2, figsize=(12, 4)) + +order = sorted(participants[participants.AgeGroup!='Adult']['AgeGroup'].unique()) +#order.sort() + +sns.boxplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order, + ax=axes[0] +) + +sns.violinplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order, + ax=axes[1] +) +``` + +### Spines +The spine (or border) refers to the lines around the plotting area. + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +ax.set_title("Distribution de l'age") +ax.set_xlabel('Age') +ax.set_ylabel('Compte') + +for spine in ax.spines.values(): + spine.set_color("red") + spine.set_linewidth(3) +``` + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +axes.set_title("Age Distribution") +axes.set_xlabel('Age') +axes.set_ylabel('Frequency') + +ax.spines[['right', 'top', 'left', 'bottom']].set_visible(False) # ax.spines.top.set_visible(False) +``` + +### Ticks + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +axes.set_title("Age Distribution") +axes.set_xlabel('Age') +axes.set_ylabel('Frequency') + +ax.tick_params( + axis='both', # Modification applied to both axes (x and y) + which='major', # Modification on 'major', 'minor' ou 'both' ticks + length=10, # Ticks length + width=2, # Ticks width + color='purple', # Ticks color + labelsize=12, # Label size + labelcolor='darkorange', # Label color + labelrotation=45 +) + +#from matplotlib.ticker import MultipleLocator +#ax.xaxis.set_minor_locator(MultipleLocator(2)) +``` + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +axes.set_title("Age Distribution") +axes.set_xlabel('Age') +axes.set_ylabel('Frequency') + +plt.tick_params( + axis='x', + which='both', + bottom=False, + top=False, + labelbottom=False) +``` + +
    +Modify the code +
    Based on the code provided in the previous cells, modify the cell below to generate a figure with the following characteristics: +
  • + No top or right spines +
  • +
  • + No tick marks on the y-axis +
  • +
  • + Minor ticks on the x-axis at intervals of 1 +
  • +
  • + X-axis labels rotated to 90 degrees +
  • +
  • + Titles for all axes as well as for the figure +
  • + +```python +# Add your code here +# ... +``` + +### Grid + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +ax.set_title("Distribution de l'age") +ax.set_xlabel('Age') +ax.set_ylabel('Compte') + + +ax.grid( + True, + color='gray', + linestyle='--', + linewidth=1 +) + +ax.set_axisbelow(True) # Equivalent to the `zorder` parameter +``` + +### Legend + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +sns.scatterplot(data=participants, x='Age', y='ToM Booklet-Matched', hue='Gender', ax=ax) +ax.set_xlabel('Age') +ax.set_ylabel('ToM Booklet-Matched') + +legend = ax.legend(fontsize=12, loc='lower right') + +legend.get_frame().set_edgecolor("black") +legend.get_frame().set_linewidth(2) +legend.get_frame().set_facecolor("white") +``` + +### Default parameters + +The default parameters for all elements of our figures (e.g., font, text size, line width, etc.) are defined in the `rcParams` object. These values can be modified, allowing us to apply a consistent figure style from one figure to another. + +```python +plt.rcParams +``` + +```python +STYLE = { + "font.cursive": "Comic Sans MS", + "font.size": 8, + "axes.linewidth": 1.2, + "axes.grid": True, + "axes.grid.axis": 'both', + 'axes.axisbelow': True, + "figure.dpi": 150 +} + +plt.rcParams.update(STYLE) +``` + +```python +fig, axes = plt.subplots(figsize=(6, 4)) +axes.hist(participants['Age'], bins='fd') +axes.set_title("Age Distribution") +axes.set_xlabel('Age') +axes.set_ylabel('Frequency') +``` + +```python +# To reset the default values +plt.rcdefaults() +``` + +
    +Choose your style +
    It is also possible to use predefined styles using the `style.use` function in `matplotlib`. You can consult the list of available styles here. See also the matplotlib documentation to learn more about style sheets and `rcParams`. +
    + +```python +print(plt.style.available) +``` + +
    +For example, 'ggplot' is a style that we can use to obtain figures similar to those generated by the `ggplot` library in R. +
    + +```python +plt.style.use('ggplot') +``` + +```python +fig, axes = plt.subplots(figsize=(6, 4)) +axes.hist(participants['Age'], bins='fd') +axes.set_title("Age Distribution") +axes.set_xlabel('Age') +axes.set_ylabel('Frequency') +``` + +## Interactive graphs + +Static figures are good, but interactive figures are better! In fact, interactive graphics offer us the opportunity to explore our data in ways that wouldn't be possible with static figures and to create dashboards (see [documentation](https://dash.plotly.com/?_gl=1*j1jto0*_gcl_au*MTY3MTkxNTc4MS4xNzczOTMwNTAw*_ga*MTM4NzMyMTAxMy4xNzczOTMwNTAy*_ga_6G7EE0JNSC*czE3NzM5MzA1MDIkbzEkZzAkdDE3NzM5MzA1MTAkajUyJGwwJGgw)). We can add information to our figures without cluttering them. + +The two main Python libraries that allow us to create interactive figures are `plotly` and `bokeh`. The `plotly` library offers a high-level interface (`plotly.express`) that allows us to create figures in a single line of code, while `bokeh` offers more customization options. In this tutorial, we will only discuss `plotly.express`, but if you want to try `bokeh`, you can consult [the documentation](https://docs.bokeh.org/en/latest/) + +```python +help(px.scatter) +``` + +```python +fig = px.scatter( + data_frame=participants, + x='Age', + y='ToM Booklet-Matched', + hover_data=['participant_id', 'Gender'], + color='Handedness', + symbol='Handedness' +) + +fig.show() +``` + +```python +participants.columns +``` + +```python +fig = px.scatter( + data_frame=participants, + x='Age', + y='ToM Booklet-Matched', + hover_data=['participant_id', 'Gender'], + color='Handedness', + symbol='Handedness' +) +fig.update_xaxes(range=[int(participants['Age'].min()-1), int(participants[participants['Child_Adult']=='child']['Age'].max()+1)]) +fig.update_traces(marker_size=10) +fig.show() +``` + +## Visualizing brain images! + +The `nilearn` library supports multiple functions to plot brain images (see [the list of functions](https://nilearn.github.io/stable/plotting/index.html)). In this section, we will see some of those function. + +--- + +Based on the [MAIN tutorial](https://main-educational.github.io/intro_nilearn/machine-learning-with-nilearn.html) + +```python +# Let's get our data +data = development_dataset.func +confounds = development_dataset.confounds +pheno = pd.DataFrame(development_dataset.phenotypic) +``` + +```python +data +``` + +```python +# Let's try visualizing our first file +plotting.view_img(data[0]) +``` + +```python +img = nib.load(data[0]) +img.shape +``` + +
    +Oups! +
    If we try to visualize our BOLD images, we get an error! This is perfectly normal, as our file contains 4D data—meaning that for every voxel (3D), we have a value for several points in time (repetition time; +1D). If we absolutely want to visualize these files, two options are available to us: +
      + 1. We can look at the activity across the whole-brain for a given time point +
    +
      + 2. We can look at the timeserie for a given voxel/parcel +
    +
    + +```python +# Let's retrieve our first volume for our first participant +first_volume = image.index_img(data[0], 0) + +plotting.view_img(first_volume, black_bg=False, cmap='turbo', symmetric_cmap=False) +``` + +```python +help(plotting.plot_stat_map) +``` + +```python +plotting.plot_stat_map( + first_volume, + draw_cross=False, + #cut_coords=(0, 4, 22), + display_mode='tiled' +) +``` + +```python +multiscale = datasets.fetch_atlas_basc_multiscale_2015(resolution=64, data_dir='../data') +atlas_filename = multiscale.maps + +# initialize masker (change verbosity) +masker = NiftiLabelsMasker(labels_img=atlas_filename, standardize=True, + memory='nilearn_cache', resampling_target="data", + detrend=True, verbose=0) +# Extract the timeseries for our first participant +time_series = masker.fit_transform(data[0], confounds=confounds[0]) +``` + +```python +time_series.shape +``` + +```python +parcel = 0 +plt.figure(figsize=(12,4)) +plt.plot(time_series.T[parcel]) +plt.title(f'Timeserie for parcel {parcelle}') +plt.xlabel('Volumes') +plt.ylabel('Amplitude') +``` + +
    +Visually compare the timeseries +
    Based on the code provided in the previous cell, modify the cell below to be able to compare the timeserie of parcel 0 and parcel 1. +
    + +```python +help(plt.legend) +``` + +```python +plt.figure(figsize=(12,4)) +plt.plot(time_series.T[0], label='Parcel 0', ls='--', lw=2, alpha=0.4) +plt.plot(time_series.T[1], label='Parcel 1', lw=2, zorder=1) +plt.title(f'Timeseries for parcels 0 and 1') +plt.xlabel('Volumes') +plt.ylabel('Amplitude') +plt.legend(loc='lower right') +``` + +## Machine learning models interpretation via visualization + + +### Load the data + +```python +from nilearn.connectome import ConnectivityMeasure + +correlation_measure = ConnectivityMeasure(kind='correlation', vectorize=True, + discard_diagonal=True) + + +all_features = [] # here is where we will put the data (a container) + +for i,sub in enumerate(data[:66]): + # extract the timeseries from the ROIs in the atlas + time_series = masker.fit_transform(sub, confounds=confounds[i]) + # create a region x region correlation matrix + correlation_matrix = correlation_measure.fit_transform([time_series])[0] + # add to our container + all_features.append(correlation_matrix) + # keep track of status + print('finished %s of %s'%(i+1,len(data[:66]))) + +np.savez_compressed('data/MAIN_BASC064_subsamp_features', a=all_features) +``` + +
    +Si vos données ne se trouvent pas dans le dossier data/, modifier le chemin dans la cellule ci-dessous. +
    + +```python +y_ageclass = pheno.head(66)['Child_Adult'] + +feat_file = 'data/MAIN_BASC064_subsamp_features.npz' +X_features = np.load(feat_file)['a'] +``` + +```python +print(f"Shape X: {X_features.shape}") +print(f"Shape y: {y_ageclass.shape}") +``` + +```python +sns.countplot(x = y_ageclass) +``` + +### Train the model + +```python +from sklearn.model_selection import train_test_split + +# Split the sample to training/test and +# stratify by age class, and also shuffle the data. + +X_train, X_test, y_train, y_test = train_test_split(X_features, # x + y_ageclass, # y + test_size = 0.2, # 80%/20% split + shuffle = True, # shuffle dataset + # before splitting + stratify = y_ageclass, # keep + # distribution + # of ageclass + # consistent + # betw. train + # & test sets. + random_state = 123 # same shuffle each + # time + ) + +from sklearn.svm import SVC +from sklearn.preprocessing import StandardScaler +from sklearn.metrics import classification_report, confusion_matrix + +scaler = StandardScaler().fit(X_train) +X_train_scl = scaler.transform(X_train) +X_test_scl = scaler.transform(X_test) + +l_svc = SVC(kernel='linear', class_weight='balanced') + +l_svc.fit(X_train_scl, y_train) # fit to training data +y_pred = l_svc.predict(X_test_scl) # classify age class using testing data + +acc = l_svc.score(X_test_scl, y_test) # get accuracy +cr = classification_report(y_pred=y_pred, y_true=y_test) # get prec., recall & f1 +cm = confusion_matrix(y_pred=y_pred, y_true=y_test) # get confusion matrix +``` + +```python +print(cr) +``` + +### Visualizing the coefficients + +We have trained our model and obtained a fairly high prediction score, which indicates that there is likely something in our data that is systematically linked to age. + +```python +print(l_svc.coef_.shape) +print(l_svc.coef_) +``` + +#### Correlation matrix +The features of our model correspond to the correlation between each pair of regions that we extracted. The coefficients of our model therefore represent the weight of each of these pairs of regions in predicting the age group. We can thus use a correlation matrix to visualize these weights. + +```python +feat_exp_matrix = correlation_measure.inverse_transform(l_svc.coef_)[0] + +plotting.plot_matrix(feat_exp_matrix, figure=(10, 8), + labels=range(feat_exp_matrix.shape[0]), + reorder='average', + tri='lower', vmax=0.01, vmin=-0.01) +``` + +#### Connectome + +We can also directly visualize the weight of our features on a brain! + +```python +# Regions coordinates +coords = plotting.find_parcellation_cut_coords(atlas_filename) +print(coords.shape) +``` + +```python +plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True) +``` + +```python +plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True, edge_threshold=0.006) +``` + +
    +Let's add some motion! +
    Nilearn has a function that allows us to visualize our connectome interactively! This makes it much easier for us to examine the weights of our features. +
    + +```python +plotting.view_connectome(feat_exp_matrix, coords, edge_threshold='90%') +``` + +
    +Gray features... +
    You may have noticed that our feature weights are plotted in gray. Why do you think that is? +
    + +```python +feat_exp_matrix_rm_diag = feat_exp_matrix +feat_exp_matrix_rm_diag[feat_exp_matrix==1] = 0 +plotting.view_connectome(feat_exp_matrix_rm_diag, coords, edge_threshold='90%') +``` + +We have a model that predicts the age group with very high predictive performance. We can see that the features allowing us to make this prediction are distributed throughout the brain. Can we publish our results now? +
    +
    No! We need to explore further to see if our model is biologically plausible... To do this, we are going to visualize our brain images for each of our groups. + +```python +children, adults = data[33:66], data[0:33] +avg_children, avg_adults = [], [] + +# For each participant, we are going to average the brain activity across all our measurement points to obtain a 3D image. +for child, adult in zip(children, adults): + avg_adults.append(image.mean_img(adult)) + avg_children.append(image.mean_img(child)) + +# We are going to average our individual 3D images for each participant, doing so for each of our groups separately. +avg_children = image.mean_img(avg_children) +avg_adults = image.mean_img(avg_adults) +``` + +```python +plotting.view_img(avg_children, black_bg=False, cut_coords=(0,-16,16), cmap='turbo', symmetric_cmap=False) +``` + +```python +plotting.view_img(avg_adults, black_bg=False, cut_coords=(0,-16,16), cmap='turbo', symmetric_cmap=False) +``` + +
    +What do you notice? +
    Look at the averaged images for each of the groups. Can you observe any differences? +
    + +```python +# Let's retrieve our first volume for our first participant +first_volume = image.index_img(data[0], 0) + +plotting.view_img(first_volume, black_bg=False, cmap='turbo', symmetric_cmap=False) +``` + +```python +# Let's retrieve our first volume for our 39th participant +first_volume = image.index_img(data[40], 0) + +plotting.view_img(first_volume, black_bg=False, cmap='turbo', symmetric_cmap=False) +``` + +## Supplementary resources + +- [`seaborn` tutorial on visualizing distributions of data](https://seaborn.pydata.org/tutorial/distributions.html) +- [Python Graph Gallery](https://python-graph-gallery.com/) +- [Mastering data charts: A comprehensive guide to visualization](https://www.atlassian.com/data/charts) +- [Common caveats to avoid](https://www.data-to-viz.com/caveats.html) +- [Nilearn visualization functions - (f)MRI data](https://nilearn.github.io/dev/plotting/index.html) +- [MNE python visualization tutorials - EEG/MEG data](https://mne.tools/stable/auto_tutorials/visualization/index.html) +- [DIPY visualization tutorials - Diffusion MRI data](https://docs.dipy.org/stable/examples_built/index#visualization) diff --git a/notebooks/visualization_fr.ipynb b/notebooks/visualization_fr.ipynb new file mode 100644 index 0000000..912a8e5 --- /dev/null +++ b/notebooks/visualization_fr.ipynb @@ -0,0 +1,2739 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "7fe3c968-1410-49ba-ac5c-3c3789dc1e1f", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "source": [ + "# Visualisation et interprétation de modèles\n", + "\n", + "\n", + "## Objectifs d'apprentissage \n", + "👀 Comprendre l'utilité des visualisations\n", + "
    📈 Comprendre quel type de graphique utilisé selon le type de données\n", + "
    🎨 Choisir adéquatement sa palette de couleurs\n", + "
    🔃 Apprendre comment modifier différents éléments de nos figures\n", + "
    🧠 Utiliser `nilearn` pour la visualisation de données de neuroimagerie\n", + "
    🤖 Comprendre comment la visualisation peut nous aider à interpréter nos modèles d'apprentissage machine\n", + "\n", + "## Organisation de la séance\n", + "\n", + "La séance comportera des sections théoriques et des sections pratiques:\n", + "\n", + "**Théorique**\n", + "\n", + "Nous allons discuté des principes théoriques de base en visualisation. Les principes suivants seront couverts:\n", + "- Types de graphiques (données tabulaires): univarié vs bivarié; catégoriel vs continue\n", + "- Palette de couleurs: perceptuellement uniformes vs non-uniformes; discrètes vs continues\n", + "- Types de graphiques (données de neuroimagerie): cartes statistiques, connectome, etc.\n", + "\n", + "Cette partie inclut plusieurs éléments interactifs pour amener les étudiant.e.s à former leur propre compréhension de la matière. Le code est déjà fourni, mais nous ne nous y attarderons pas. \n", + "\n", + "**Pratique**\n", + "\n", + "Nous allons mettre en pratique, au fur et à mesure, les principes théoriques. Nous allons utiliser les librairies suivantes:\n", + "- matplotlib\n", + "- seaborn\n", + "- ptitprince\n", + "- plotly\n", + "- nilearn\n", + "\n", + "Les étudiant.e.s devront modifier le code donné pour comprendre le rôle de différents paramètres de visualisation pour répondre à certaines questions à partir de ce que nous allons voir en classe ou à partir de références fournies (p. ex. documentation matplotlib)." + ] + }, + { + "cell_type": "markdown", + "id": "70ffeea7-819f-4607-9e78-b43cc095fe0c", + "metadata": {}, + "source": [ + "
    \n", + "Les encadrés jaunes contiennent des questions/exercices auxquelles les étudiant.e.s doivent répondre.\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "id": "d19ef6c6-9843-4a8b-bf8d-0af3049447a7", + "metadata": {}, + "source": [ + "
    \n", + "Les encadrés bleus contiennent des informations complémentaires sur les jeux de données et les fonctions utilisés.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7237bc73-6409-4a39-8149-c4987f29e5c1", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import nibabel as nib\n", + "import seaborn as sns\n", + "import ptitprince as pt\n", + "import ipywidgets as widgets\n", + "import matplotlib.pyplot as plt\n", + "import plotly.express as px\n", + "\n", + "from cmcrameri import cm\n", + "from nilearn import datasets, plotting, image\n", + "from nilearn.input_data import NiftiLabelsMasker\n", + "from matplotlib import colormaps, colors\n", + "from colorspacious import cspace_converter, cspace_convert" + ] + }, + { + "cell_type": "markdown", + "id": "d702f598", + "metadata": {}, + "source": [ + "## Télécharger les données \n", + "\n", + "### Jeu de données Brain development fMRI\n", + "\n", + "Nous allons utiliser les données provenant du jeu de données *brain development fMRI* qui inclu les données phénotypiques, ainsi que les données IRMf collectées auprès d'enfants et d'adultes lors d'une tâche de visionnement de film ([Richardson et al., 2018](https://doi.org/10.1038/s41467-018-03399-2)).\n", + "\n", + "**Note:** si vous avez déjà téléchargé vos données et qu'elles ne se trouvent pas dans le dossier data/, modifier le chemin dans la cellule ci-dessous." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d8f0b0b9", + "metadata": {}, + "outputs": [], + "source": [ + "development_dataset = datasets.fetch_development_fmri(data_dir='data/')" + ] + }, + { + "cell_type": "markdown", + "id": "51da4225-32dc-4721-ab14-abed0a4df7a7", + "metadata": {}, + "source": [ + "### Datasaurus\n", + "Pour la première partie de ce tutoriel, nous allons très brièvement utiliser le jeu de données Datasaurus. Si vous voulez être en mesure d'exécuter les cellules utilisant ce jeu de données, vous devrez le télécharger à partir de [kaggle](https://www.kaggle.com/datasets/tombutton/datasaurusdozen).\n", + "\n", + "**Note:** modifier le chemin dans la cellule si dessous au besoin." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0dbbbcf1-e9a0-41bc-8b84-3e36cbbd8e60", + "metadata": {}, + "outputs": [], + "source": [ + "# Credit: Alberto Cairo (original datasaurus), and Justin Matejka and George Fitzmaurice (datasaurus dozen)\n", + "data = pd.read_csv(\"data/datasaurus.csv\") " + ] + }, + { + "cell_type": "markdown", + "id": "cba69723-44c3-4508-a5ee-a515193d8c9d", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "## Une image vaut mille mots \n", + "- Les visualisations nous aide à mieux comprendre la complexité de nos données\n", + "- Elles nous permettent de supporter nos résultats\n", + "- Et de partager un message" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc862f23-1dfb-47b1-918c-147e6f2569a0", + "metadata": {}, + "outputs": [], + "source": [ + "# rcParams permet de modifier les paramètres globaux de nos figures\n", + "# Ici, on s'assure simplement que les valeurs entre (-8000000, 8000000) ne seront pas montrées en notation scientifique\n", + "plt.rcParams['axes.formatter.useoffset'] = False\n", + "plt.rcParams['axes.formatter.limits'] = (-8000000, 8000000) " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f1479b22-5c32-4de9-a894-35160a480ea5", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_category(category):\n", + " plt.scatter(data[data['dataset'] == category]['x'], data[data['dataset'] == category]['y'])\n", + " plt.xlabel('x')\n", + " plt.ylabel('y')\n", + " max_x = data[data['dataset'] == category]['x'].max()\n", + " max_y = data[data['dataset'] == category]['y'].max()\n", + " plt.text(max_x-0.22*max_x, max_y+0.10*max_y, f\"Mean: ({data[data['dataset'] == category]['x'].mean().round(2)}, {data[data['dataset'] == category]['y'].mean().round(2)})\")\n", + " plt.text(max_x-0.22*max_x, max_y+0.06*max_y, f\"Std: ({data[data['dataset'] == category]['x'].std().round(2)}, {data[data['dataset'] == category]['y'].std().round(2)})\")\n", + "\n", + " plt.show()\n", + "\n", + "widgets.interact(\n", + " plot_category,\n", + " category=widgets.Dropdown(\n", + " options=sorted(data['dataset'].unique()),\n", + " description='Dataset:'\n", + " )\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "5d19d31b-3a97-474c-bebb-0f13ac04c85b", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "### \"Ne faites jamais confiance aux statistiques descriptives, visualisez toujours vos données !\"\n", + "

    Albert Cairo, créateur du jeu de données Datasaurus

    \n", + "\n", + "Python nous offre une panoplie de librairie pour visualiser nos données:\n", + "- Haut niveau vs bas niveau\n", + "- Images statiques vs graphiques interactifs\n", + "- Plusieurs librairies générales (p.ex., matplotlib, seaborn, bokeh et plotly)\n", + "- Et spécifiques à un domaine (p. ex., nilearn)\n", + "\n", + "Mais avec un grand pouvoir viennent de grandes responsabilités..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc2c873e-d608-4022-a380-b2cd0cd6c04f", + "metadata": {}, + "outputs": [], + "source": [ + "data = pd.DataFrame(\n", + " {\n", + " 'Categories': ['A', 'B'],\n", + " 'Values': [6000000, 7066000]\n", + " },\n", + " \n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9ee868f9-7c04-4ddc-a068-46b7289077bf", + "metadata": {}, + "outputs": [], + "source": [ + "plt.bar(data['Categories'], data['Values'])\n", + "plt.ylim([5500000, 8000000])\n", + "plt.yticks([])\n", + "plt.title(\"Que pouvez-vous dire sur 'A' et 'B' si vous vous fiez uniquement à ce que vous voyez dans la figure ?\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e9c4c673-1a48-4f9f-97b6-d29bd084c39e", + "metadata": {}, + "outputs": [], + "source": [ + "plt.bar(data['Categories'], data['Values'])\n", + "plt.ylim([5500000, 8000000])\n", + " \n", + "plt.title(\"Que remarquez-vous ?\")\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82aaaa58-6c8f-4f2f-9634-f58ade71b8fd", + "metadata": {}, + "outputs": [], + "source": [ + "plt.bar(data['Categories'], data['Values'])\n", + "\n", + "# Ajoute les valeurs au-dessus des bars\n", + "for i, v in enumerate(data['Values']):\n", + " plt.text(i, v + 1, str(v), ha='center', va='bottom')\n", + " \n", + "plt.title(\"Mieux ?\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "6e55598d-529e-41d9-bde9-72f3dfae8f89", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "## À chaque type de données son type de graphique !\n", + "\n", + "Il existe plusieurs types de graphiques. Le type de graphique choisi dépend des variables que l'on veut visualiser:\n", + "- Visualisation univariée: **variable continue** vs **variable catégorielle**\n", + "- Visualisation bivariée: **catégorielle x catégorielle** vs **catégorielle x continue** vs **continue x continue**" + ] + }, + { + "cell_type": "markdown", + "id": "829dcb49-628f-4b61-81e7-de20c0f29ec7", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "### Visualisations univariées\n", + "\n", + "Pour une **variable continue**, nous pouvons visualiser sa distribution:\n", + "- Histogramme\n", + "- Estimation de la densité par noyau (*kde plot*)\n", + "- Nuage de points univarié (*strip plot*)\n", + " \n", + "Pour une **variable catégorielle**, nous pouvons visualiser la quantité d'observations pour chaque catégorie:\n", + "- Diagramme à barres (*bar plot*)" + ] + }, + { + "cell_type": "markdown", + "id": "019ba4e7-660d-4dd9-bf09-f54bf67fd8e4", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "" + }, + "tags": [] + }, + "source": [ + "
    \n", + "Si vos données ne se trouvent pas dans le dossier data/, modifier le chemin dans la cellule ci-dessous.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "473b14ef-f075-40d5-a5ae-7183f8c50e1e", + "metadata": {}, + "outputs": [], + "source": [ + "# Allons chercher les données\n", + "participants = pd.read_csv('data/development_fmri/development_fmri/participants.tsv', sep='\\t')\n", + "\n", + "# Regardons ce que nous avons dans nos données\n", + "participants.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b261b1ed-b7ae-4d5c-9511-9de670e7d3d9", + "metadata": {}, + "outputs": [], + "source": [ + "# Regardons le type de nos données\n", + "participants.dtypes" + ] + }, + { + "cell_type": "markdown", + "id": "fb599c4d-23a7-4368-a7b3-f234b42597c1", + "metadata": {}, + "source": [ + "
    \n", + "Le jeu de données contiennent des variables continues (p.ex. `Age`, `ToM Booklet-Matched`, `FB_Composite`) et des variables catégorielles (p.ex. `AgeGroup`, `Child_Adult`, `Gender`). `ToM Booklet-Matched` représente un score à une tâche visant à évaluer la théorie de l'esprit, c'est-à-dire la capacité à attribuer des états mentaux (croyances, désirs, émotions, intentions) à soi-même ou aux autres.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20965f72-f1d2-4ecb-94b4-0cfcfc8455c1", + "metadata": {}, + "outputs": [], + "source": [ + "# Regardons d'abord les statistiques descriptives associées à la variable continue `Age`\n", + "print(participants['Age'].describe())" + ] + }, + { + "cell_type": "markdown", + "id": "60726fed-f93c-4639-ac98-85ffb6ec8e80", + "metadata": { + "editable": true, + "slideshow": { + "slide_type": "slide" + }, + "tags": [] + }, + "source": [ + "#### Histogramme\n", + "\n", + "Un histogramme nous permet de visualiser la distribution d'une variable donnée de manière discrète en groupant ses valeurs dans des intervalles consécutifs (bins). On obtient donc la fréquence (c.-à-d. le nombre d'observations) dans chacun de ces intervalles.\n", + "\n", + "Il est possible de générer un histogramme dans matplotlib grâce à [la fonction `hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c332d133-f1a8-42b9-a29f-b20f8e73f9c7", + "metadata": {}, + "outputs": [], + "source": [ + "# Regardons maintenant sa distribution\n", + "plt.hist(participants['Age'])\n", + "# Ajoutons un titre\n", + "plt.title(\"Distribution de l'age\")\n", + "# Ajoutons un titre à l'axe des x et des y\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Compte')" + ] + }, + { + "cell_type": "markdown", + "id": "eca45e76-c80e-45b7-a19d-71914416945d", + "metadata": {}, + "source": [ + "
    \n", + "La science derrière le nombre de bins à visualiser - partie 1\n", + "
    La fonction hist dans matplolib groupe les données en 10 bins par défaut. Par contre, un nombre de bins inadéquat (soit trop petit, soit trop grand) ne montre pas la distribution de manière représentative.\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "id": "2de74802-65da-4fa8-8267-2c36f94555d0", + "metadata": {}, + "source": [ + "
    \n", + "Le nombre de bins en pratique !\n", + "
    Modifier la valeur du paramètre `bins` dans la cellule ci-dessous pour déterminer le nombre de bins qui représenterait le mieux notre variable `Age`\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dcd2ae46-2746-4a0a-b3b8-e19c811ac281", + "metadata": {}, + "outputs": [], + "source": [ + "# Modifier la valeur du paramètre `bins`\n", + "plt.hist(participants['Age'], bins=10)\n", + "# Ajoutons un titre\n", + "plt.title(\"Distribution de l'age\")\n", + "# Ajoutons un titre à l'axe des x et des y\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Compte')" + ] + }, + { + "cell_type": "markdown", + "id": "8c227775-9b11-4d3f-8576-78c45d00c174", + "metadata": {}, + "source": [ + "
    \n", + "La science derrière le nombre de bins à visualiser - partie 2\n", + "
    Il existe différentes règles permettant de calculer le nombre de bins optimal, ainsi que l'intervalle à utiliser pour chacune de ces bins. Il est possible d'utiliser ces règles à même la fonction hist de matplotlib ! Il ne suffit que de préciser une des règles valides au paramètre `bins`.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bd1a025-ef3f-4433-b781-624db50519fb", + "metadata": {}, + "outputs": [], + "source": [ + "help(plt.hist)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "55c9b964-315f-4b07-b3e4-5ebb0d6a75db", + "metadata": {}, + "outputs": [], + "source": [ + "# Regardons maintenant sa distribution\n", + "plt.hist(participants['Age'], bins='fd')\n", + "# Ajoutons un titre\n", + "plt.title(\"Distribution de l'age\")\n", + "# Ajoutons un titre à l'axe des x et des y\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Compte')" + ] + }, + { + "cell_type": "markdown", + "id": "465b2338-d6fe-49dd-9021-493fa165ac75", + "metadata": {}, + "source": [ + "#### Estimation de la densité par noyau (*kde plot*)\n", + "\n", + "L'**estimation de la densité par noyau** (ou *kernel density estimation*) nous permet de visualiser la distribution de notre variable de manière continue (plutôt que discrète) en estimant une fonction de densité. Par contre, similairement à la taille des bins pour l'histogramme, l'estimation de la densité par noyau est sensible à la largeur du noyau !" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bdc96ada-1839-4426-b625-ddbca5fe21b2", + "metadata": {}, + "outputs": [], + "source": [ + "# Pour visualiser l'estimation de la densité par noyau, nous allons utiliser la fonction\n", + "# `kdeplot` dans la librairie `seaborn`\n", + "\n", + "sns.kdeplot(participants['Age'])" + ] + }, + { + "cell_type": "markdown", + "id": "85ca2aac-7306-472d-a7c8-54fd3d1babd7", + "metadata": {}, + "source": [ + "
    \n", + "L'axe des y sur un kde plot - partie 1\n", + "
    Vous avez peut-être remarqué que les valeurs sur l'axe des y vont maintenant de 0 à 0.08 (vs 0 à 50 pour notre histogramme). Dans un kde plot, l'axe des y représente la densité, c'est-à-dire la probabilité par unité de la variable sur l'axe des x. En d'autres mots, à quel point les données sont denses pour une valeur de x donnée. Donc, les pics dans ce type de figure représentent une plus grande densité de points pour un étendu de valeurs donné (c.-à-d. une plus grande probabilité qu'une valeur soit observée), alors que les creux représentent une plus faible densité de points.\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "id": "d45d0798-be51-4176-9e3d-85b49577f8db", + "metadata": {}, + "source": [ + "
    \n", + "À noter !\n", + "
    Puisque ce type de graphique fourni une estimation continue, nous pourrions penser que certaines données existent alors qu'en réalité de n'est pas le cas.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92db2e5e-a084-441a-85ac-5a837018d35b", + "metadata": {}, + "outputs": [], + "source": [ + "# On peut également supperposer l'histogramme et le kde plot avec seaborn\n", + "sns.histplot(\n", + " participants['Age'], \n", + " kde=True, \n", + " bins='fd', \n", + " edgecolor=None\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "5c5d021b-4948-48ea-b62b-78632220a5f4", + "metadata": {}, + "source": [ + "
    \n", + "L'axe des y sur un kde plot - partie 2\n", + "
    Lorsque l'on supperpose un histogramme avec un kde avec `seaborn`, vous remarquerez que l'axe des y montre la fréquence. Cependant, la courbe kde reste tout de même une courbe de densité. Cela se produit car seaborn met à l'échelle la courbe densité avec l'histogramme en multipliant la courbe par le nombre d'observation et la taille des bins. \n", + "
    " + ] + }, + { + "cell_type": "markdown", + "id": "8e8d64cd-6efe-4e03-ae67-73c03a2fbc3c", + "metadata": {}, + "source": [ + "#### Nuage de points univarié\n", + "\n", + "L'**histogramme** et l'**estimation de la densité par noyau** nous permettent de visualiser la distribution d'une variable de manière discrète ou continue, respectivement. Cependant, ces types de visualisations ne nous permettent pas de visualiser les données brutes.\n", + "\n", + "Le **nuage de points univarié** nous permet de visualiser chaque point de données individuellement. Cela peut nous permettre d'identifier plus facilement la présence de valeurs abbérantes dans nos données. Cependant, le nuage de points n'est pas adapté si nous avons trop de points de données." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "16d41fcd-dc9b-45e5-bf9f-a76964422182", + "metadata": {}, + "outputs": [], + "source": [ + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "69d2b0df-522a-495f-9eb1-aebe6711f8b1", + "metadata": {}, + "source": [ + "
    \n", + "Réplication du nuage de points\n", + "
    Essayez de répliquer le nuage de points de la variable `Age` dans deux cellules différentes ci-dessous. Que remarquez-vous ?\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e05fc9b4-700b-4720-84c0-d4db11e062a3", + "metadata": {}, + "outputs": [], + "source": [ + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "689c73ec-a827-4d98-a225-8fe5c0842bd2", + "metadata": {}, + "outputs": [], + "source": [ + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "fac872ac-025d-44b7-adec-9d77afe6c2f2", + "metadata": {}, + "source": [ + "
    \n", + "Le paramètre `jitter`\n", + "
    Vous avez peut-être remarqué que les deux nuages de points que vous avez générés à partir de la même variable ne sont pas tout à fait identiques. Cela se produit car `seaborn` utilise `numpy.random` pour le calcul du jitter. Pour rendre le calcul du jitter reproductible, il est possible de fixer une seed a priori.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84c59cd1-d49e-422a-9155-a314b3bd7453", + "metadata": {}, + "outputs": [], + "source": [ + "np.random.seed(12)\n", + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68f63b4d-6662-457e-96c8-ed8054169d79", + "metadata": {}, + "outputs": [], + "source": [ + "np.random.seed(12)\n", + "sns.stripplot(\n", + " x=participants['Age']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a9df3eac-a369-49e7-8595-53b53f5cf41d", + "metadata": {}, + "source": [ + "#### Diagramme à barres\n", + "\n", + "Le **diagramme à barres** (*bar plot*) permet de visualiser/comparer des variables catégorielles en montrant les fréquences des différentes valeurs ou simplement les différentes valeurs. Ce type de graphique est utile si l'on veut, par exemple, comparer le nombre de personnes par groupe." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "71c0dd71-b9e3-48a4-a072-9b17eac911c7", + "metadata": {}, + "outputs": [], + "source": [ + "participants['AgeGroup'].value_counts()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97616227-7a10-4f63-b1f3-2a35feb4eb3a", + "metadata": {}, + "outputs": [], + "source": [ + "participants['AgeGroup'].value_counts().plot(kind='bar')\n", + "plt.ylabel('Counts')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e958cc7-8381-470f-a570-72335dc5fef3", + "metadata": {}, + "outputs": [], + "source": [ + "order = sorted(participants.AgeGroup.unique())\n", + "\n", + "sns.countplot(\n", + " data=participants,\n", + " x='AgeGroup',\n", + " order=order,\n", + " hue='Gender',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "56d7fc7c-704c-4079-98f6-ac5ef455610f", + "metadata": {}, + "source": [ + "### Résumé\n", + "\n", + "| Type de graphique | Type de données | Résumé |\n", + "| --- | --- | --- |\n", + "| Histogramme | Continue | Pour visualiser la distribution de nos données de manières discrètes |\n", + "| Estimation de la densité par noyau | Continue | Pour visualiser la distribution de nos données de manières continues |\n", + "| Nuage de points | Continue | Pour visualiser chaque point de données individuellement |\n", + "| Graphique à barres | Catégorielle | Pour comparer les fréquences entre différents groupes/catégories |\n" + ] + }, + { + "cell_type": "markdown", + "id": "88f04f1d-183d-40e7-ba61-95f2433e3d42", + "metadata": {}, + "source": [ + "### Visualisations bivariées\n", + "\n", + "Pour une **variable continue** x **variable continue**, nous pouvons utiliser:\n", + "- Nuage de points\n", + "- Estimation de la densité par noyau bivariée\n", + "- *Hexplot*\n", + "- *Joint plot*\n", + "- *Heat map*\n", + "\n", + "Pour une **variable catégorielle** x **variable continue**, nous pouvons utiliser:\n", + "- Boîte à moustache (*box plot*)\n", + "- Diagramme en violon (*violin plot*)\n", + "- Nuage de points (*scatter plot*)\n", + "- Tracé de points (*point plot*)\n", + "- *Rain cloud plot*\n", + "- Diagramme à barres (*bar plot*)... (?)" + ] + }, + { + "cell_type": "markdown", + "id": "034b2e27-ccc2-40bb-9412-f26d5689d692", + "metadata": {}, + "source": [ + "#### Nuage de points - Variable continue x variable continue" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eae06367-0a2a-460a-96d9-cb773e0fa2ec", + "metadata": {}, + "outputs": [], + "source": [ + "plt.scatter(participants['Age'], participants['ToM Booklet-Matched'])\n", + "plt.xlabel('Age')\n", + "plt.ylabel('ToM Booklet-Matched')" + ] + }, + { + "cell_type": "markdown", + "id": "8e4fa34f-9852-432b-829b-8885396a175c", + "metadata": {}, + "source": [ + "
    \n", + "Que remarquez-vous ?\n", + "
    Prenez le temps d'observer la figure que nous venons de générer.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45922286-dd55-4a5a-b9eb-3b39a2fbe615", + "metadata": {}, + "outputs": [], + "source": [ + "participants.groupby(['Child_Adult'])['ToM Booklet-Matched'].mean()" + ] + }, + { + "cell_type": "markdown", + "id": "7824b776-175f-4c46-b14f-fa56025616e3", + "metadata": {}, + "source": [ + "
    \n", + "Valeurs manquantes\n", + "
    La fonction `scatter` dans `matplotlib`, ainsi que la fonction `regplot` dans `seaborn` (voir les cellules suivantes), permettent de supprimer automatiquement les valeurs manquantes lors de la visualisation.\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "id": "03df6418-ec9a-41be-9d5f-2eec7cafb9ee", + "metadata": {}, + "source": [ + "
    \n", + "La fonction `regplot`\n", + "
    La fonction `regplot` dans seaborn permet de visualiser le nuage de points tout en ajustant un modèle de régression linéaire aux données !\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4685cfb0-afe3-45a3-91ed-0e6417f5ceb3", + "metadata": {}, + "outputs": [], + "source": [ + "sns.regplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched']\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "4a1c8517-5a9f-4e36-9a8c-d253602e9e3b", + "metadata": {}, + "source": [ + "
    \n", + "Ajustement du modèle de régression\n", + "
    Une régression linéaire ne semble pas être la meilleure façon de modéliser la relation entre notre variable `Age` et notre variable `ToM Booklet-Matched`. Modifier la valeur du paramètre `order` de la fonction `regplot` à 2 pour vérifier l'ajustement de ce modèle. \n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f791b6f2-eafb-44c7-beec-0c4fa479777e", + "metadata": {}, + "outputs": [], + "source": [ + "sns.regplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched'],\n", + " order=2\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "80cf7823-9d62-4c19-ba13-0211f9d71885", + "metadata": {}, + "source": [ + "
    \n", + "Et si nous ajoutions une variable de plus !\n", + "
    Nous pouvons visualiser les interactions multivariées grâce au paramètre `hue` de la fonction `lmplot` de `seaborn`. Dans la cellule ci-dessous, nous allons explorer la relation entre l'âge (x) et le score au ToM Booklet-Matched (y) en fonction du genre (hue).\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07ddd9fb-1611-46cc-b21c-5584d4406346", + "metadata": {}, + "outputs": [], + "source": [ + "sns.lmplot(\n", + " x='Age', \n", + " y='ToM Booklet-Matched',\n", + " data=participants,\n", + " hue='Gender'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "511d7c87-1965-441b-a97a-8dce2cac0255", + "metadata": {}, + "source": [ + "#### Estimation de la densité par noyau bivariée et hex plot - Variable continue x variable continue\n", + "\n", + "Lorsque nous avons beaucoup de points de données et que nous voulons visualiser la distribution de points en prenant en compte deux variables, nous risquons d'avoir beaucoup de chevauchement de points (*overplotting*). Dans ce cas, il peut être difficile de bien visualiser la distribution de nos données avec un nuage de points. Il est donc possible d'utiliser d'autres types de graphiques:\n", + "\n", + "L'**estimation de la densité par noyau bivariée** nous permet de visualiser la manière dont deux variables se distribuent dans un espace à deux dimensions. Chaque contour représente une zone densité. Plus les contours sont proches, plus la densité est élevée, c'est-à-dire là où les données sont le plus concentrées. \n", + "\n", + "Le **hex plot** permet de visualiser la densité des points de données de manière discrète. C'est donc l'équivalent d'un histogramme mais pour visualiser deux variables plutôt qu'une. Les zones plus foncées représentent des zones de densité élévée.\n", + "\n", + "⚠ L'estimation de la densité par noyau bivariée est plus demandant en termes de computation comparativement au *hex plot*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "257d7f46-88f9-45ea-9ae3-9363f17b5b73", + "metadata": {}, + "outputs": [], + "source": [ + "sns.kdeplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched'],\n", + " fill=True\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "99ff567f-0637-4164-8925-852c797cb4c1", + "metadata": {}, + "outputs": [], + "source": [ + "sns.jointplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched'],\n", + " kind='hex'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "77fff24f-fe04-4fa1-9cdf-88355c5a8ee8", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_jointplot(kind):\n", + " sns.jointplot(\n", + " x=participants['Age'], \n", + " y=participants['ToM Booklet-Matched'],\n", + " kind=kind\n", + " )\n", + " plt.show()\n", + "\n", + "widgets.interact(\n", + " plot_jointplot,\n", + " kind=widgets.Dropdown(\n", + " options=['hex', 'reg', 'kde', 'scatter'],\n", + " description='Type:'\n", + " )\n", + ");" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57b35514-a4c9-4c35-9dd6-e6445026e1ac", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_jointplot(kind):\n", + " mean, cov = [0, 1], [(1, .5), (.5, 1)]\n", + " x, y = np.random.multivariate_normal(mean, cov, 1000).T\n", + " \n", + " sns.jointplot(\n", + " x=x, \n", + " y=y,\n", + " kind=kind\n", + " )\n", + " plt.show()\n", + "\n", + "widgets.interact(\n", + " plot_jointplot,\n", + " kind=widgets.Dropdown(\n", + " options=['scatter', 'hex', 'reg', 'kde'],\n", + " description='Type:'\n", + " )\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "ab3e0b7c-df14-4aa0-b259-794e29af2687", + "metadata": {}, + "source": [ + "#### Nuage de points (stripplot) - Variable continue x variable catégorielle\n", + "\n", + "Nous avons déjà parlé du nuage de points (*stripplot*) au niveau univarié, mais nous pouvons également visualiser le nuage de points de plusieurs catégories simultanément." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df0a447c-3f56-4681-9d60-7f2527cbc39a", + "metadata": {}, + "outputs": [], + "source": [ + "order = sorted(participants.AgeGroup.unique())[:-1]\n", + "\n", + "sns.stripplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'], \n", + " order=order\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "88888956-20fc-428a-be59-58a29bc50cef", + "metadata": {}, + "source": [ + "#### Boîtes à moustache - Variable continue x variable catégorielle\n", + "\n", + "Les **boîtes à moustaches** nous permettent de visualiser les distributions d'un ou plusieurs groupes de variables continues (p.ex. distributions d'âge pour différents groupes expérimentaux). Les différentes composantes de la boîte à moustache représentent différentes statistiques descriptives:\n", + "- La ligne dans la boîte représente la médiane.\n", + "- Les limites de la boîte (quartiles inférieur—Q1 et supérieur—Q3) représentent l'étendue où se situe 50% des données.\n", + "- Les moustaches (c.-à-d. les lignes à l'extérieur de la boîte) capturent l'étendue du reste des données.\n", + "- Les points représentent les valeurs aberrantes—valeurs supérieures à 1.5 fois l'intervalle inter-quartile (Q3-Q1) + Q3 ou inférieures à 1.5 fois l'intervalle inter-quartile - Q1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67371457-ec27-4c1a-9481-e1f35b42d31e", + "metadata": {}, + "outputs": [], + "source": [ + "sns.boxplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " order=order\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "8b37171f-00e5-488c-a326-2289e724b14f", + "metadata": {}, + "source": [ + "#### Diagrammes en violon - Variable continue x variable catégorielle\n", + "\n", + "Les **diagrammes en violon** nous permettent de visualiser la distribution des données en utilisant des courbes de densité (aka courbes d'**estimation de la densité par noyau**). La largeur de chaque courbe correspond à la fréquence approximative des points pour chaque région (valeurs sur l'axe des y)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9a38695-6210-448a-8ca4-f9940388af28", + "metadata": {}, + "outputs": [], + "source": [ + "sns.violinplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'], \n", + " order=order\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "15efa9b0-e35c-4997-a20d-330ab3ad487a", + "metadata": {}, + "source": [ + "#### Diagrammes raincloud - Variable continue x variable catégorielle\n", + "\n", + "Pourquoi choisir entre un nuage de points, une boîte à moustaches et un diagramme en violon quand nous pouvons faire les trois en même temps ! C'est ce que nous permettent de faire les **diagrammes raincloud** (*raincloud plot*).\n", + "\n", + "*Note :* Ce type de graphique n'est pas nativement intégré par `seaborn` ou `matplotlib`. Nous devrons utiliser la librairie `ptitprince` que nous avons importé au début du tutoriel (`import ptitprince as pt`).\n", + "\n", + "*Ressource :* Pour plus d'exemples sur les RainCloud plot, voir le tutoriel de [ptitprince](https://github.com/pog87/PtitPrince/blob/master/tutorial_python/raincloud_tutorial_python.ipynb)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c04c3c78-75cd-47dd-9708-93a2344fb54d", + "metadata": {}, + "outputs": [], + "source": [ + "pt.RainCloud(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'], \n", + " order=order,\n", + " bw=0.6\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d6128c77-7e2c-41ad-aa46-6df0019cfd90", + "metadata": {}, + "source": [ + "#### Tracé de points - Variable continue x variable catégorielle\n", + "\n", + "Le **tracé de points** nous permettent de comparer les moyennes (ou autre statistique descriptive) entre différents groupes, tout en montrant l'incertitude (p. ex. intervalle de confiance à 95%, écart-type, etc.). Ce type de visualisation est pratique pour montrer des tendances entre différentes catégories ou entre différents temps de mesure (p.ex. si nous avons des mesures longitudinales)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b3b7f45b-6493-404b-bf56-1f5721d354e9", + "metadata": {}, + "outputs": [], + "source": [ + "sns.pointplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " estimator='mean',\n", + " data = participants[participants.AgeGroup!='Adult'], \n", + " order=order,\n", + " errorbar=('ci', 95)\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "6d6cae11-c355-4da1-8b13-f036366719bd", + "metadata": {}, + "source": [ + "#### Diagramme à barres pour visualiser une variable continue x une variable catégorielle ?\n", + "\n", + "Vous avez peut-être déjà vu l'utilisation de diagramme à barres pour représenté un variable continue.\n", + "Cependant, ce type de visualisation pour ce type de variable n'est pas recommandé:\n", + "- Permet uniquement de visualiser certains statistiques descriptives (p.ex. la moyenne) sans fournir aucune information sur la distribution de notre variable\n", + "- Si vous voulez absolument utiliser un diagramme à barres, superposez-le avec un nuage de points !" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13184fef-746a-44cf-a3d0-4eebfc25458e", + "metadata": {}, + "outputs": [], + "source": [ + "sns.barplot(\n", + " x='AgeGroup',\n", + " y = 'ToM Booklet-Matched',\n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " order = order, palette='Blues'\n", + ")\n", + "\n", + "\n", + "sns.stripplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched',\n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " jitter=True,\n", + " order = order, color = 'black')\n" + ] + }, + { + "cell_type": "markdown", + "id": "8a976f2a-4944-4b7b-a4dd-71f3a852e3ca", + "metadata": {}, + "source": [ + "## Et si on ajoutait un peu de couleurs à nos figures ?" + ] + }, + { + "cell_type": "markdown", + "id": "68e1f171-876e-4c86-8b24-33917a0b1f37", + "metadata": {}, + "source": [ + "### Palettes de couleurs perceptuelle uniformes vs non-uniformes\n", + "- Couleurs sont perçues selon leur teinte (orange, rouge, vert, etc.) et leur luminosité (clareté vs obscurité d'une teinte)\n", + "- Les caractéristiques de nos photorécepteurs font en sorte que nous ne traitons pas le spectre de lumière de manière uniforme\n", + "- La majorité des photorécepteurs nous permettant de voir les couleurs (cônes) traite les longueurs d'onde longues (c.-à-d., rouge, orange, jaune)\n", + "- Donc, nous ne percevons pas très bien les variations dans les teintes vertes-bleues comparativement aux variations dans les teintes jaunes-rouges" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b21cf7da-33cb-4334-93f4-9f1a9a4fa476", + "metadata": {}, + "outputs": [], + "source": [ + "colormaps['hsv']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eafe14c7-4b28-4ea4-be59-2aec8530738f", + "metadata": {}, + "outputs": [], + "source": [ + "cm.batlow" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eef2848b-7792-4c1f-aaf3-ccc339a129eb", + "metadata": {}, + "outputs": [], + "source": [ + "import requests\n", + "from PIL import Image\n", + "from io import BytesIO" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7611dee9-0528-4899-ab4c-1bda9a658206", + "metadata": {}, + "outputs": [], + "source": [ + "response = requests.get('https://raw.githubusercontent.com/matplotlib/matplotlib/main/doc/_static/stinkbug.png')\n", + "img = np.asarray(Image.open(BytesIO(response.content)))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8735a4ea-0afe-46f6-8889-70d24a0dca48", + "metadata": {}, + "outputs": [], + "source": [ + "def plot_img(cmap):\n", + " lum_img = img[:, :, 0]\n", + " if cmap == 'noir et blanc':\n", + " plt.imshow(img)\n", + " elif cmap == 'hsv':\n", + " plt.imshow(lum_img, cmap=\"hsv\")\n", + " elif cmap == 'batlow':\n", + " plt.imshow(lum_img, cmap=cm.batlow)\n", + "\n", + "widgets.interact(\n", + " plot_img,\n", + " cmap=widgets.Dropdown(\n", + " options=['noir et blanc', 'hsv', 'batlow'],\n", + " value='noir et blanc',\n", + " description='Palette:'\n", + " )\n", + ");" + ] + }, + { + "cell_type": "markdown", + "id": "f77abfa1-7763-457f-a22a-9c6a75ac7d1f", + "metadata": {}, + "source": [ + "
    \n", + "Que remarquez-vous ?\n", + "
    Comparer l'image colorée avec les palettes de couleurs `hsv` et `batlow` avec l'image en noir et blanc.\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "id": "f7788b67-677d-4885-a63b-2c7e20964cb2", + "metadata": {}, + "source": [ + "
    \n", + "Ne jetez pas l'arc-en-ciel à la poubelle\n", + "
    Google a développé la palette de couleur `turbo`, une palette de couleur arc-en-ciel perceptuellement uniforme. Pour plus de détails, consultez le blog de Google research.\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "id": "012e371d-c345-4963-9aae-ad511922c297", + "metadata": {}, + "source": [ + "### Plattes de couleurs discrètes vs continues\n", + "\n", + "Les palettes de couleurs peuvents soient être continues (comme ce que nous avons vu plus haut) ou discrète. Les palettes de couleurs discrètes sont utilisées pour visualiser les variables catégorielles—où les catégories n'ont pas d'ordre inhérent (p. ex. Enfants avec covid vs enfants sans covid vs adultes avec covid vs adultes sans covid)." + ] + }, + { + "cell_type": "markdown", + "id": "84fb045d-b59c-427f-9f9f-56fe762a5a57", + "metadata": {}, + "source": [ + "#### [matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b726eb33-b475-4aff-bfab-c1773e1683c2", + "metadata": {}, + "outputs": [], + "source": [ + "colormaps['Set2']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a094205a-c558-49d5-9625-86a7df7357a7", + "metadata": {}, + "outputs": [], + "source": [ + "colormaps['Set3']" + ] + }, + { + "cell_type": "markdown", + "id": "09d84ac8-02cf-463c-8c72-32d45b3bac17", + "metadata": {}, + "source": [ + "#### [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad365aaf-eb1f-4874-8958-7cb5ec561ea5", + "metadata": {}, + "outputs": [], + "source": [ + "sns.color_palette('rocket', 10)" + ] + }, + { + "cell_type": "markdown", + "id": "4aa57803-e66a-45eb-872c-261623c47429", + "metadata": {}, + "source": [ + "#### [cmcrameri](https://s-ink.org/scientific-colour-maps)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "df1db6fb-6ad3-4f4e-9375-b3dc915eff63", + "metadata": {}, + "outputs": [], + "source": [ + "cm.batlow.resampled(10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9dd84b40-8d10-43d1-8a5f-71a8e2a03adf", + "metadata": {}, + "outputs": [], + "source": [ + "cm.lipari.resampled(10)" + ] + }, + { + "cell_type": "markdown", + "id": "27cc49be-02ac-4ace-a104-752878b6139e", + "metadata": {}, + "source": [ + "
    \n", + "Discrétisation des palettes continues\n", + "
    Dans les cellules ci-dessous, nous avons vu qu'il était possible de discrétiser une palette continue en spécifiant le nombre de couleurs que nous voulons. Par contre, comme nous l'avons mentionné les palettes discrètes sont utilisées pour visualiser des catégories qui n'ont pas d'ordre inhérent. Donc nous ne sommes pas obligé d'avoir une palette discrète ordonnée.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "feb640e9-10aa-4d28-9605-16e18b484b0d", + "metadata": {}, + "outputs": [], + "source": [ + "nb_colors = 10\n", + "#np.random.seed(10)\n", + "\n", + "original_cmap = cm.lipari.resampled(nb_colors)\n", + "\n", + "original_colors = original_cmap(np.arange(nb_colors))\n", + "\n", + "shuffled_colors = original_colors.copy()\n", + "np.random.shuffle(shuffled_colors)\n", + "\n", + "colors.ListedColormap(shuffled_colors)" + ] + }, + { + "cell_type": "markdown", + "id": "f49a1daf-c542-45b1-8ffc-2a43e50da03a", + "metadata": {}, + "source": [ + "### Plattes de couleurs divergentes\n", + "\n", + "Les palettes de couleurs divergentes sont utiles si nous avons une valeur centrale interprétable. Les palettes divergentes s'appliquent autant pour les palettes discrètes ou continues. Par exemple:\n", + "- **Palettes discrètes divergentes**: Si nous avons des données collectées grâce à une échelle de likert. Nos valeurs pourraient varier de 'Fortement en désaccord' à 'Fortement en accord' avec comme valeur centrale 'Neutre'. Dans ce cas, nous pourrions visualiser les valeurs de gauche (de 'Fortement en désaccord' à 'neutre') dans des teintes de bleues et les valeurs de droite (de 'Neutre' à 'Fortement en accord') dans les teintes de oranges/rouges.\n", + "- **Palettes continues divergentes**: Si nous avons des valeurs négatives et positives (p.ex. des valeurs de corrélations), nous pourrions utiliser le zéro comme valeur centrale, les valeurs négatives pourraient être représentées dans des teintes de bleues et la valeurs positives dans des teintes de oranges/rouges." + ] + }, + { + "cell_type": "markdown", + "id": "cca903e7-4cf5-4499-9e35-02f5dcd8600d", + "metadata": {}, + "source": [ + "#### [matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7bf82533-bdd6-4361-9feb-44519d62d569", + "metadata": {}, + "outputs": [], + "source": [ + "# Palette discrète divergente\n", + "colormaps['bwr'].resampled(9)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0e54cd44-d098-4487-9182-b9653620db9a", + "metadata": {}, + "outputs": [], + "source": [ + "# Palette continue divergente\n", + "colormaps['bwr']" + ] + }, + { + "cell_type": "markdown", + "id": "5e03d59c-ba94-4b20-9779-d1d9fc710b4c", + "metadata": {}, + "source": [ + "#### [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "46ba2353-d279-4de2-8726-7ab94132f221", + "metadata": {}, + "outputs": [], + "source": [ + "sns.color_palette(\"coolwarm\", 9)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ceb3a71d-46eb-4777-99ae-6c00159236f7", + "metadata": {}, + "outputs": [], + "source": [ + "sns.color_palette(\"coolwarm\", as_cmap=True)" + ] + }, + { + "cell_type": "markdown", + "id": "3d4b7e41-7ffa-4aa9-a7ad-db81756df281", + "metadata": {}, + "source": [ + "#### [cmcrameri](https://s-ink.org/scientific-colour-maps)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2228a30e-51ed-4a89-89f7-b96345237d9b", + "metadata": {}, + "outputs": [], + "source": [ + "cm.vik.resampled(9)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d4de0331-45db-4944-9b74-b3cfe56db752", + "metadata": {}, + "outputs": [], + "source": [ + "cm.vik" + ] + }, + { + "cell_type": "markdown", + "id": "dbb42716-b3b4-47c3-85e0-453b4b27e53c", + "metadata": {}, + "source": [ + "### Palettes de couleurs lisibles universellement\n", + "\n", + "Nous avons parlé de l'uniformité perceptuelle des palettes de couleurs, mais il est également important de considérer l'utilisation de couleurs perçues universellement. En effet, l'utilisation de certaines combinaisons de couleurs pourrait être difficile à distinguer pour des personnes ayant une dyschromatopsie. Quelques conseils:\n", + "- Évitez les combinaisons de vert et rouge\n", + "- Variez la luminosité et la saturation des couleurs pour s'assurer que vos figures restent lisibles en noir et blanc\n", + "- Utilisez les palettes de couleur lisibles universellement de `matplotlib`, `seaborn` ou `cmcrameri`\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a7b4ac11-efa3-432b-b7f8-7b2f84de212d", + "metadata": {}, + "outputs": [], + "source": [ + "sns.color_palette(\"colorblind\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3fc57c58-68e1-4b30-adcb-9632db371c29", + "metadata": {}, + "outputs": [], + "source": [ + "converters = {\n", + " 'deuter50_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"deuteranomaly\",\n", + " \"severity\": 50\n", + " },\n", + " 'deuter100_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"deuteranomaly\",\n", + " \"severity\": 100\n", + " },\n", + " 'prot50_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"protanomaly\",\n", + " \"severity\": 50\n", + " },\n", + " 'prot100_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"protanomaly\",\n", + " \"severity\": 100\n", + " },\n", + " 'trit50_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"tritanomaly\",\n", + " \"severity\": 50\n", + " },\n", + " 'trit100_space': {\n", + " \"name\": \"sRGB1+CVD\",\n", + " \"cvd_type\": \"tritanomaly\",\n", + " \"severity\": 100\n", + " }\n", + "}\n", + "\n", + "def show_palettes(cmap, n_colors=10):\n", + " f, ax = plt.subplots(7, 1, figsize=(10, 6)) #, layout=\"constrained\")\n", + " plt.subplots_adjust(hspace=0.6)\n", + " \n", + " gradient = np.arange(n_colors).reshape(1, -1)\n", + "\n", + " if cmap in ['viridis', 'coolwarm', 'colorblind' ,'pastel', 'muted']:\n", + " cmap_sns = sns.color_palette(cmap, n_colors)\n", + " cmap_m = colors.ListedColormap(sns.color_palette(cmap, n_colors=n_colors))\n", + " if cmap in ['batlow', 'vik']:\n", + " cmap_m = getattr(cm, cmap).resampled(n_colors)\n", + " cmap_sns = sns.color_palette(cmap_m(np.linspace(0, 1, n_colors)))\n", + " if cmap in ['bwr', 'hsv', 'PuOr', 'summer']:\n", + " cmap_m = colormaps[cmap].resampled(n_colors)\n", + " cmap_sns = sns.color_palette(cmap_m(np.linspace(0, 1, n_colors)))\n", + "\n", + " # Original palette\n", + " ax[0].imshow(gradient, aspect='auto', cmap=cmap_m)\n", + " ax[0].set_title('Original')\n", + " for idx, converter in enumerate(converters.keys()):\n", + " # Palette transformée\n", + " converted_palette = cspace_convert(cmap_sns, converters[converter], \"sRGB1\")\n", + " converted_palette = np.clip(converted_palette, 0, 1)\n", + "\n", + " ax[idx+1].imshow(gradient, aspect='auto', cmap=colors.ListedColormap(converted_palette))\n", + " ax[idx+1].set_title(f\"{converters[converter]['cvd_type']}-{converters[converter]['severity']}\")\n", + "\n", + " for a in ax.flatten():\n", + " a.set_yticks([])\n", + " a.set_xticks([])\n", + " \n", + " plt.show()\n", + "\n", + "widgets.interact(\n", + " show_palettes,\n", + " cmap=widgets.Dropdown(\n", + " options=['pastel', 'viridis', 'coolwarm', 'batlow', 'bwr', 'colorblind', 'hsv', 'PuOr', 'summer', 'vik', 'muted'],\n", + " value='viridis',\n", + " description='Palette:'\n", + " ),\n", + ");\n" + ] + }, + { + "cell_type": "markdown", + "id": "12287644-403b-4d4f-af4c-24df7a1aa031", + "metadata": {}, + "source": [ + "
    \n", + "Au-delà des couleurs\n", + "
    Choisir la bonne palette de couleurs est important, mais il existe également d'autres stratégies que nous pouvons utiliser pour rendre nos figures plus accessibles et facilement interprétables. Au lieu d'utiliser uniquement différentes couleurs pour distinguer différentes catégories, nous pourrions utiliser différentes formes géométriques. Si notre figure comporte des lignes pour, par exemple, illustrer la trajectoire d'une variable dans le temps, nous pourrions utiliser différents types de pointillés. Pour plus d'exemples, voir \"The best charts for color blind viewers\".\n", + "
    " + ] + }, + { + "cell_type": "markdown", + "id": "3dc5003b-553d-40e7-b882-c41ae62bff96", + "metadata": {}, + "source": [ + "## L'anatomie d'une figure\n", + "\n", + "Nous avons déjà vu comment ajouter ou modifier quelques éléments de nos figures, comme le titre et le nom des axes. Dans cette section, nous allons discuter plus en détails de l'art de faire des figures. Nous allons couvrir:\n", + "- Sous-graphes (*subplots*)\n", + "- Bordures\n", + "- Graduations (*ticks*)\n", + "- Grille\n", + "- Légende" + ] + }, + { + "cell_type": "markdown", + "id": "4012f001-b49c-4352-8bad-87b4844f7db8", + "metadata": {}, + "source": [ + "### Sous-graphes\n", + "\n", + "Jusqu'à présent nous avons principalement créé nos figures en appelant directement certaines fonctions de `matplotlib` et de `seaborn`. Par contre, si nous voulons créer une figure avec plusieurs panneaux/axes (c.-à-d., colonnes et/ou rangées), nous allons devoir utiliser des **sous-graphes** (*subplots*)." + ] + }, + { + "cell_type": "markdown", + "id": "41de742c-fbff-49e7-9669-f778e75f185a", + "metadata": {}, + "source": [ + "#### matplotlib" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "41725d76-fb65-4505-812d-721c117c126c", + "metadata": {}, + "outputs": [], + "source": [ + "# Reprenons le code que nous avons utilisé précédemment. Ce code produit deux figures séparées\n", + "plt.hist(participants['Age'], bins='fd')\n", + "plt.title(\"Distribution de l'age\")\n", + "plt.xlabel('Age')\n", + "plt.ylabel('Compte')\n", + "plt.show()\n", + "\n", + "plt.scatter(participants['Age'], participants['ToM Booklet-Matched'])\n", + "plt.xlabel('Age')\n", + "plt.ylabel('ToM Booklet-Matched')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e58ebbf4-8f10-4858-9de8-972b35295f64", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "help(plt.subplots)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6a9f9845-1b82-4723-bdfd-75b14590237a", + "metadata": {}, + "outputs": [], + "source": [ + "# Si nous voulons produire une seule figure avec ces deux graphiques nous allons devoir utiliser les subplots\n", + "# Syntaxe de la fonctionL plt.subplots(n_rows, n_cols, *)\n", + "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", + "axes[0].hist(participants['Age'], bins='fd')\n", + "axes[0].set_title(\"Distribution de l'age\")\n", + "axes[0].set_xlabel('Age')\n", + "axes[0].set_ylabel('Compte')\n", + "\n", + "axes[1].scatter(participants['Age'], participants['ToM Booklet-Matched'])\n", + "axes[1].set_title(\"Relation entre Age et scores ToM\")\n", + "axes[1].set_xlabel('Age')\n", + "axes[1].set_ylabel('ToM Booklet-Matched')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab909a51-6c79-414b-86cf-f14ad0898ba5", + "metadata": {}, + "outputs": [], + "source": [ + "# La fonction subplots peut même être utilisé lorsque l'on a un seul graphe\n", + "fig, axes = plt.subplots(figsize=(6, 4))\n", + "axes.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Distribution de l'age\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Compte')" + ] + }, + { + "cell_type": "markdown", + "id": "1f84023b-a40a-4107-96a4-9170f4ac64f2", + "metadata": {}, + "source": [ + "#### seaborn" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "97aa3481-127b-49dc-ba4b-3869ee7bccbc", + "metadata": {}, + "outputs": [], + "source": [ + "# Avec searbon\n", + "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", + "\n", + "order = sorted(participants[participants.AgeGroup!='Adult']['AgeGroup'].unique())\n", + "#order.sort()\n", + "\n", + "sns.boxplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " order=order,\n", + " ax=axes[0]\n", + ")\n", + "\n", + "sns.violinplot(\n", + " x='AgeGroup', \n", + " y = 'ToM Booklet-Matched', \n", + " data = participants[participants.AgeGroup!='Adult'],\n", + " order=order,\n", + " ax=axes[1]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "b8684899-02a2-47f3-ba98-dd524fd0211f", + "metadata": {}, + "source": [ + "### Bordure\n", + "La bordure correspond aux lignes autours de la zone de tracé." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c1f875ee-d881-469c-98ad-6971db15c32d", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "ax.set_title(\"Distribution de l'age\")\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('Compte')\n", + "\n", + "for spine in ax.spines.values():\n", + " spine.set_color(\"red\")\n", + " spine.set_linewidth(3)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f51d8b30-3181-4812-ae57-2bac77001efd", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "ax.set_title(\"Distribution de l'age\")\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('Compte')\n", + "\n", + "ax.spines[['right', 'top', 'left', 'bottom']].set_visible(False) # ax.spines.top.set_visible(False)" + ] + }, + { + "cell_type": "markdown", + "id": "e8d250f6-6e14-46ba-8559-0827304b01d4", + "metadata": {}, + "source": [ + "### Graduations" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b4ab9580-9f60-45e4-9915-733d806b14f7", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "ax.set_title(\"Distribution de l'age\")\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('Compte')\n", + "\n", + "ax.tick_params(\n", + " axis='both', # Modification appliquée aux deux axes (x et y)\n", + " which='major', # Modification des graduations 'major', 'minor' ou 'both'\n", + " length=10, # Longueur des graduations\n", + " width=2, # largeur des graduations\n", + " color='purple', # Couleur des graduations\n", + " labelsize=12, # Taille des étiquettes\n", + " labelcolor='darkorange', # Couleur des étiquettes\n", + " labelrotation=45\n", + ")\n", + "\n", + "#from matplotlib.ticker import MultipleLocator\n", + "#ax.xaxis.set_minor_locator(MultipleLocator(2))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd1952eb-ad95-4d44-97f5-8bd486549769", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "ax.set_title(\"Distribution de l'age\")\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('Compte')\n", + "\n", + "plt.tick_params(\n", + " axis='x',\n", + " which='both', \n", + " bottom=False, \n", + " top=False, \n", + " labelbottom=False)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "id": "eee08e4e-2473-4075-baf3-efb25dbe6899", + "metadata": {}, + "source": [ + "
    \n", + "Modifiez le code\n", + "
    À partir du code fourni dans les cellules ci-dessus, modifiez la cellule du bas pour générer une figure avec les caractéristiques suivantes:\n", + "
  • \n", + " Pas de bordure au haut et à gauche\n", + "
  • \n", + "
  • \n", + " Pas de graduation sur l'axe des y\n", + "
  • \n", + "
  • \n", + " Des graduations mineurs sur l'axe des x à intervalle de 1\n", + "
  • \n", + "
  • \n", + " Les étiquettes sur l'axe des x affichées à 90 degré\n", + "
  • \n", + "
  • \n", + " Des titres pour tous les axes ainsi que pour la figure\n", + "
  • " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b755dea-75e4-41c3-82e5-57d0ce1571a4", + "metadata": {}, + "outputs": [], + "source": [ + "# Ajouter votre code dans cette cellule\n", + "# ..." + ] + }, + { + "cell_type": "markdown", + "id": "c9f7193a-d10f-491d-81d5-7b0226810bb7", + "metadata": {}, + "source": [ + "### Grille" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "58c7a538-2139-483b-8f77-d3aa7d016b25", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "ax.hist(participants['Age'], bins='fd')\n", + "ax.set_title(\"Distribution de l'age\")\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('Compte')\n", + "\n", + "\n", + "ax.grid(\n", + " True,\n", + " color='gray',\n", + " linestyle='--',\n", + " linewidth=1\n", + ")\n", + "\n", + "ax.set_axisbelow(True) # Équivalent à l'utilisation du paramètre `zorder`" + ] + }, + { + "cell_type": "markdown", + "id": "9750cf87-7b19-40e5-9e94-1a6c8fd58f15", + "metadata": {}, + "source": [ + "### Légende" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19b8c1c5-70eb-4f1d-abce-f208a97e2b42", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(6, 4))\n", + "\n", + "sns.scatterplot(data=participants, x='Age', y='ToM Booklet-Matched', hue='Gender', ax=ax)\n", + "ax.set_xlabel('Age')\n", + "ax.set_ylabel('ToM Booklet-Matched')\n", + "\n", + "legend = ax.legend(fontsize=12, loc='lower right')\n", + "\n", + "legend.get_frame().set_edgecolor(\"black\")\n", + "legend.get_frame().set_linewidth(2)\n", + "legend.get_frame().set_facecolor(\"white\")" + ] + }, + { + "cell_type": "markdown", + "id": "15fb179d-a48e-4267-9654-59e3543f1528", + "metadata": {}, + "source": [ + "### Paramètres par défaut\n", + "\n", + "Les paramètres par défaut pour tous les éléments de nos figures (p. ex. police, taille du texte, épaisseur des lignes, etc.) sont définis dans l'object `rcParams`. Ces valeurs peuvent être modifiées, nous permettant d'appliquer un style de figure consistant d'une figure à l'autre." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ec3305f-d81d-4c18-8086-6a7573aace18", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "plt.rcParams" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e8647130-227d-406f-9100-0fab187465c6", + "metadata": {}, + "outputs": [], + "source": [ + "STYLE = {\n", + " \"font.cursive\": \"Comic Sans MS\",\n", + " \"font.size\": 8,\n", + " \"axes.linewidth\": 1.2,\n", + " \"axes.grid\": True,\n", + " \"axes.grid.axis\": 'both',\n", + " 'axes.axisbelow': True,\n", + " \"figure.dpi\": 150\n", + "}\n", + "\n", + "plt.rcParams.update(STYLE)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d965ff6-cad7-4d68-8353-77d87ebe8bfc", + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(figsize=(6, 4))\n", + "axes.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Distribution de l'age\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Compte')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e84e6d32-3d96-4782-ba09-67a04a692464", + "metadata": {}, + "outputs": [], + "source": [ + "# Pour remettre les valeurs par défaut\n", + "plt.rcdefaults()" + ] + }, + { + "cell_type": "markdown", + "id": "c272b2dd-e0d0-42e6-b0a3-f2ef53852154", + "metadata": {}, + "source": [ + "
    \n", + "Choisissez votre style\n", + "
    Il est également possible d'utiliser des styles pré-définis grâce à la fonction `style.use` de matplotlib. Vous pouvez consulter la liste de styles offerts ici. Voir également la documentation de matplotlib pour en apprendre plus sur les feuilles de style et `rcParams`.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3e7bef12-e3d7-4fee-9ab2-74a1911c6fa7", + "metadata": {}, + "outputs": [], + "source": [ + "print(plt.style.available)" + ] + }, + { + "cell_type": "markdown", + "id": "00dbfb21-c6b7-4dee-ad67-aa6db3feca0d", + "metadata": {}, + "source": [ + "
    \n", + "Par exemple, 'ggplot' est un style que nous pouvons utiliser pour obtenir des figures semblables à celles générées par la librairie `ggplot` en R.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "498fa45c-2693-4364-aa26-1b88a58b745c", + "metadata": {}, + "outputs": [], + "source": [ + "plt.style.use('ggplot')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80353500-b6d6-4362-a63a-4a2d70164616", + "metadata": {}, + "outputs": [], + "source": [ + "fig, axes = plt.subplots(figsize=(6, 4))\n", + "axes.hist(participants['Age'], bins='fd')\n", + "axes.set_title(\"Distribution de l'age\")\n", + "axes.set_xlabel('Age')\n", + "axes.set_ylabel('Compte')" + ] + }, + { + "cell_type": "markdown", + "id": "f21c21df-af8d-4735-83d2-885fe5f8c46b", + "metadata": {}, + "source": [ + "## Graphiques interactifs\n", + "\n", + "Les figures statiques c'est bien, mais les figures interactives c'est mieux ! En fait, les graphiques interactifs nous offrent l'opportunité d'explorer nos données d'une manière qui ne serait pas possible avec des figures statiques et de créer des dashboards (voir [documentation](https://dash.plotly.com/?_gl=1*j1jto0*_gcl_au*MTY3MTkxNTc4MS4xNzczOTMwNTAw*_ga*MTM4NzMyMTAxMy4xNzczOTMwNTAy*_ga_6G7EE0JNSC*czE3NzM5MzA1MDIkbzEkZzAkdDE3NzM5MzA1MTAkajUyJGwwJGgw)). Nous allons pouvoir ajouter des informations à nos figures, sans les encombrer.\n", + "\n", + "Les deux principales librairies en python nous permettant de faire des figures interactives sont `plotly` et `bokeh`. La librairie `plotly` offre un interface de haut niveau (`plotly.express`) permettant de créer des figures en une seule ligne de code, alors que `bokeh` offre plus d'options de personnalisation. Dans ce tutoriel, nous allons uniquement discuter de `plotly.express`, mais si vous voulez essayer `bokeh`, vous pouvez consulter [la documentation](https://docs.bokeh.org/en/latest/)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18a2b0a9-8299-4887-9207-255f679babb0", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "help(px.scatter)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfab865d-3208-40a3-84f7-ea2c52f97d90", + "metadata": {}, + "outputs": [], + "source": [ + "fig = px.scatter(\n", + " data_frame=participants, \n", + " x='Age', \n", + " y='ToM Booklet-Matched',\n", + " hover_data=['participant_id', 'Gender'],\n", + " color='Handedness',\n", + " symbol='Handedness'\n", + ")\n", + "\n", + "fig.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fcf186a9-7b00-44ae-8843-5cfa0f2a7520", + "metadata": {}, + "outputs": [], + "source": [ + "participants.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2f9bc8e-0847-4926-b058-02ec9d192877", + "metadata": {}, + "outputs": [], + "source": [ + "fig = px.scatter(\n", + " data_frame=participants, \n", + " x='Age', \n", + " y='ToM Booklet-Matched',\n", + " hover_data=['participant_id', 'Gender'],\n", + " color='Handedness',\n", + " symbol='Handedness'\n", + ")\n", + "fig.update_xaxes(range=[int(participants['Age'].min()-1), int(participants[participants['Child_Adult']=='child']['Age'].max()+1)])\n", + "fig.update_traces(marker_size=10)\n", + "fig.show()" + ] + }, + { + "cell_type": "markdown", + "id": "7828124c-a55c-4b6f-9a50-b52cd6b2c5eb", + "metadata": {}, + "source": [ + "## Visualiser des images de cerveaux !\n", + "\n", + "La librairie `nilearn` supporte plusieurs fonctions pour visualiser des images de cerveaux (voir la [liste de toutes les fonctions](https://nilearn.github.io/stable/plotting/index.html)). Dans cette section, nous allons voir quelque unes de ces fonctions.\n", + "\n", + "---\n", + "\n", + "Basé sur le [tutoriel MAIN](https://main-educational.github.io/intro_nilearn/machine-learning-with-nilearn.html)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "163b13a4-3e50-4557-91d6-696d58977595", + "metadata": {}, + "outputs": [], + "source": [ + "# Allons tout d'abord chercher nos données\n", + "data = development_dataset.func\n", + "confounds = development_dataset.confounds\n", + "pheno = pd.DataFrame(development_dataset.phenotypic)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8de6b379-c131-4eec-8af4-2afbcf9487c0", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "54ece402-fcd2-4131-8968-abd847afc8f0", + "metadata": {}, + "outputs": [], + "source": [ + "# Essayons de visualiser notre premier fichier\n", + "plotting.view_img(data[0])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d722995-2f57-4725-9230-5aeeb4068fed", + "metadata": {}, + "outputs": [], + "source": [ + "img = nib.load(data[0])\n", + "img.shape" + ] + }, + { + "cell_type": "markdown", + "id": "ec879242-9800-4e81-afbe-142ce494cf97", + "metadata": {}, + "source": [ + "
    \n", + "Oups !\n", + "
    Si nous essayons de visualiser nos images bold, nous obtenons une erreur ! C'est tout à fait normal, puisque notre fichier contient des données 4D, c'est-à-dire que pour chaque voxel (3D), nous avons une valeur pour plusieurs points dans le temps (temps de répétition; +1D). Si nous voulons absoluement visualiser ces fichiers, deux options s'offrent à nous:\n", + "
      \n", + " 1. Soit nous nous intéressons à l'activité sur l'ensemble du cerveau à un temps donné\n", + "
    \n", + "
      \n", + " 2. Soit nous nous intéressons au décours temporel pour un voxel/parcelle donné\n", + "
    \n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fc331ce-94b0-4de9-a2c7-e7f0cccbe8ef", + "metadata": {}, + "outputs": [], + "source": [ + "# Allons chercher notre premier volume (i.e., notre première image de cerveau)\n", + "premier_volume = image.index_img(data[0], 0)\n", + "\n", + "plotting.view_img(premier_volume, black_bg=False, cmap='turbo', symmetric_cmap=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d59f34df-4e5a-4bab-a7f3-76de70365e74", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "help(plotting.plot_stat_map)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "74a4affe-7a9b-4d73-b5f8-7e9136c3e2ce", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.plot_stat_map(\n", + " premier_volume, \n", + " draw_cross=False,\n", + " #cut_coords=(0, 4, 22),\n", + " display_mode='tiled'\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db301379-df28-49e0-96f8-299391dd1b24", + "metadata": {}, + "outputs": [], + "source": [ + "multiscale = datasets.fetch_atlas_basc_multiscale_2015(resolution=64, data_dir='../data')\n", + "atlas_filename = multiscale.maps\n", + "\n", + "# initialize masker (change verbosity)\n", + "masker = NiftiLabelsMasker(labels_img=atlas_filename, standardize=True,\n", + " memory='nilearn_cache', resampling_target=\"data\",\n", + " detrend=True, verbose=0)\n", + "# Extraction des séries temporelles pour notre premier sujet\n", + "time_series = masker.fit_transform(data[0], confounds=confounds[0])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c7ef848-e75a-4769-a960-83a973d14a69", + "metadata": {}, + "outputs": [], + "source": [ + "time_series.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d348d1fc-99ff-427b-8251-06eb459780fb", + "metadata": {}, + "outputs": [], + "source": [ + "parcelle = 0\n", + "plt.figure(figsize=(12,4))\n", + "plt.plot(time_series.T[parcelle])\n", + "plt.title(f'Décours temporel de la parcelle {parcelle}')\n", + "plt.xlabel('Volumes')\n", + "plt.ylabel('Amplitude')" + ] + }, + { + "cell_type": "markdown", + "id": "7da0ce19-dbaa-4976-ad03-e5b1aa88c76d", + "metadata": {}, + "source": [ + "
    \n", + "Comparer visuellement les décours temporels\n", + "
    À partir du code fourni dans la cellule d'avant, modifier la cellule ci-dessous pour être en mesure de comparer le décours temporel de la parcelle 0 et de la parcelle 1.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d12883c4-398b-4c4d-afdd-5be5a78c7532", + "metadata": {}, + "outputs": [], + "source": [ + "help(plt.legend)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7a91dc2-7ad5-4feb-9383-5bd1cd1d529b", + "metadata": {}, + "outputs": [], + "source": [ + "plt.figure(figsize=(12,4))\n", + "plt.plot(time_series.T[0], label='Parcelle 0', ls='--', lw=2, alpha=0.4)\n", + "plt.plot(time_series.T[1], label='Parcelle 1', lw=2, zorder=1)\n", + "plt.title(f'Décours temporel des parcelles 0 et 1')\n", + "plt.xlabel('Volumes')\n", + "plt.ylabel('Amplitude')\n", + "plt.legend(loc='lower right')" + ] + }, + { + "cell_type": "markdown", + "id": "fe979fb9-12db-417e-9c22-89b8107fd2c2", + "metadata": {}, + "source": [ + "## Interpréter les modèles d'apprentissage machine via la visualisation" + ] + }, + { + "cell_type": "markdown", + "id": "19d43830-3293-4014-9cf5-1102f2762320", + "metadata": {}, + "source": [ + "### Charger les données" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ab274378-13fe-49cf-b98b-8f6f25af5837", + "metadata": {}, + "outputs": [], + "source": [ + "from nilearn.connectome import ConnectivityMeasure\n", + "\n", + "correlation_measure = ConnectivityMeasure(kind='correlation', vectorize=True,\n", + " discard_diagonal=True)\n", + "\n", + "\n", + "all_features = [] # here is where we will put the data (a container)\n", + "\n", + "for i,sub in enumerate(data[:66]):\n", + " # extract the timeseries from the ROIs in the atlas\n", + " time_series = masker.fit_transform(sub, confounds=confounds[i])\n", + " # create a region x region correlation matrix\n", + " correlation_matrix = correlation_measure.fit_transform([time_series])[0]\n", + " # add to our container\n", + " all_features.append(correlation_matrix)\n", + " # keep track of status\n", + " print('finished %s of %s'%(i+1,len(data[:66])))\n", + "\n", + "np.savez_compressed('data/MAIN_BASC064_subsamp_features', a=all_features)" + ] + }, + { + "cell_type": "markdown", + "id": "25228970-3a18-44cd-a019-27cc2263627d", + "metadata": {}, + "source": [ + "
    \n", + "Si vos données ne se trouvent pas dans le dossier data/, modifier le chemin dans la cellule ci-dessous.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "88c238d8-cb8c-4f8c-8a74-0a0bfb8b8dd0", + "metadata": {}, + "outputs": [], + "source": [ + "y_ageclass = pheno.head(66)['Child_Adult']\n", + "\n", + "feat_file = 'data/MAIN_BASC064_subsamp_features.npz'\n", + "X_features = np.load(feat_file)['a']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ee2373d-20a1-4a5f-b4e2-b283c344fafd", + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"Shape X: {X_features.shape}\")\n", + "print(f\"Shape y: {y_ageclass.shape}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "80129398-0ddd-4153-afb9-9c66b142dff9", + "metadata": {}, + "outputs": [], + "source": [ + "sns.countplot(x = y_ageclass)" + ] + }, + { + "cell_type": "markdown", + "id": "9a9702ec-bdfb-4e2a-93db-7a2fca786ac3", + "metadata": {}, + "source": [ + "### Entraîner le modèle" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b0735997-5c51-4af3-a8eb-5da9a7560227", + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.model_selection import train_test_split\n", + "\n", + "# Split the sample to training/test and\n", + "# stratify by age class, and also shuffle the data.\n", + "\n", + "X_train, X_test, y_train, y_test = train_test_split(X_features, # x\n", + " y_ageclass, # y\n", + " test_size = 0.2, # 80%/20% split \n", + " shuffle = True, # shuffle dataset\n", + " # before splitting\n", + " stratify = y_ageclass, # keep\n", + " # distribution\n", + " # of ageclass\n", + " # consistent\n", + " # betw. train\n", + " # & test sets.\n", + " random_state = 123 # same shuffle each\n", + " # time\n", + " )\n", + "\n", + "from sklearn.svm import SVC\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.metrics import classification_report, confusion_matrix\n", + "\n", + "scaler = StandardScaler().fit(X_train)\n", + "X_train_scl = scaler.transform(X_train)\n", + "X_test_scl = scaler.transform(X_test)\n", + "\n", + "l_svc = SVC(kernel='linear', class_weight='balanced')\n", + "\n", + "l_svc.fit(X_train_scl, y_train) # fit to training data\n", + "y_pred = l_svc.predict(X_test_scl) # classify age class using testing data\n", + "\n", + "acc = l_svc.score(X_test_scl, y_test) # get accuracy\n", + "cr = classification_report(y_pred=y_pred, y_true=y_test) # get prec., recall & f1\n", + "cm = confusion_matrix(y_pred=y_pred, y_true=y_test) # get confusion matrix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2f4345ab-fee5-4db0-86b3-bcd6ebbdf050", + "metadata": {}, + "outputs": [], + "source": [ + "print(cr)" + ] + }, + { + "cell_type": "markdown", + "id": "ea103a73-8c20-4287-99e5-df10d6cc5147", + "metadata": {}, + "source": [ + "### Visualisation des coefficients\n", + "\n", + "Nous avons entraîné notre modèle et obtenu un score de prédiction assez élevé, ce qui nous indique qu'il y a possiblement quelque chose dans nos données qui est systématiquement lié à l'âge. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e847f9d9-6c23-4ad6-835e-e9af08324a4f", + "metadata": {}, + "outputs": [], + "source": [ + "print(l_svc.coef_.shape)\n", + "print(l_svc.coef_)" + ] + }, + { + "cell_type": "markdown", + "id": "5fb261c4-e6cd-43cf-b119-1888d6a7beeb", + "metadata": {}, + "source": [ + "#### Matrice de corrélation\n", + "Les attributs de notre modèle correspond à la corrélation entre chaque paire de régions que nous avons extraites. Les coefficients de notre modèle représente donc le poids de chacune de ces paires de régions dans la prédiction du groupe d'âge. Nous pouvons donc uiliser une matrice de corrélation pour visualiser ces poids." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd4ab8b7-2ddd-4a63-9c5e-127ce5b4f4bb", + "metadata": {}, + "outputs": [], + "source": [ + "feat_exp_matrix = correlation_measure.inverse_transform(l_svc.coef_)[0]\n", + "\n", + "plotting.plot_matrix(feat_exp_matrix, figure=(10, 8), \n", + " labels=range(feat_exp_matrix.shape[0]),\n", + " reorder='average',\n", + " tri='lower', vmax=0.01, vmin=-0.01)" + ] + }, + { + "cell_type": "markdown", + "id": "0fbc6f06-6625-4806-81fa-6d0dd48be515", + "metadata": {}, + "source": [ + "#### Connectome\n", + "\n", + "Nous pouvons aussi visualiser directement le poids de nos attributs sur un cerveau !" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02ca50c0-40d4-4c70-b262-287268035303", + "metadata": {}, + "outputs": [], + "source": [ + "# Coordonnées de nos régions\n", + "coords = plotting.find_parcellation_cut_coords(atlas_filename)\n", + "print(coords.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec4555da-ab29-4e8e-bcb8-5dd8a617cf17", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e02adcdf-3e75-42fb-8b03-a23dd990d097", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True, edge_threshold=0.006)" + ] + }, + { + "cell_type": "markdown", + "id": "6491c5fb-f6d7-46b0-9b13-45a2ebfc4309", + "metadata": {}, + "source": [ + "
    \n", + "Ajoutons du mouvement !\n", + "
    Nilearn possède une fonction nous permettant de visualiser notre connectome de manière interactive ! Cela nous permet de plus facilement examiner le poids de nos attributs.\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65e8661f-619e-48ba-ada4-4892a3405d40", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.view_connectome(feat_exp_matrix, coords, edge_threshold='90%')" + ] + }, + { + "cell_type": "markdown", + "id": "c9ab0638-3fdd-4e05-82c3-ece605ea2827", + "metadata": {}, + "source": [ + "
    \n", + "Attributs tous gris...\n", + "
    Vous avez peut-être remarqué que les poids de nos attributs sont tracés en gris. Pourquoi pensez-vous que c'est le cas ?\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bf0eb66-a0e0-4887-883a-00be14565747", + "metadata": {}, + "outputs": [], + "source": [ + "feat_exp_matrix_rm_diag = feat_exp_matrix\n", + "feat_exp_matrix_rm_diag[feat_exp_matrix==1] = 0\n", + "plotting.view_connectome(feat_exp_matrix_rm_diag, coords, edge_threshold='90%')" + ] + }, + { + "cell_type": "markdown", + "id": "d5c60913-4db1-49f6-8abf-780dc8c5db24", + "metadata": {}, + "source": [ + "Nous avons un modèle permettant de prédire le groupe d'âge avec une très bonne performance prédictive. Nous pouvons voir que les attributs nous permettant de faire cette prédiction sont distribués dans le cerveau. Est-ce que nous pouvons maintenant publier nos résultats ?...\n", + "
    \n", + "
    Non ! Il nous faut explorer davantage pour voir si notre modèle est biologiquement plausible... Pour cela, nous allons visualiser nos images cérébrales pour chacun de nos groupes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6e25d6ed-1e05-4293-bd67-260475a570b2", + "metadata": {}, + "outputs": [], + "source": [ + "children, adults = data[33:66], data[0:33]\n", + "avg_children, avg_adults = [], []\n", + "\n", + "# Pour chaque participant.e, nous allons moyenner l'activité cérébrale à travers tous nos points de mesure pour obtenir une image 3D\n", + "for child, adult in zip(children, adults):\n", + " avg_adults.append(image.mean_img(adult))\n", + " avg_children.append(image.mean_img(child))\n", + "\n", + "# Nous allons moyenner nos images 3D individuelles pour chaque participant.e et ce pour chacun de nos groupes séparément\n", + "avg_children = image.mean_img(avg_children)\n", + "avg_adults = image.mean_img(avg_adults)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "288b7f4d-ee28-4fb6-9172-7327b1a8269e", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.view_img(avg_children, black_bg=False, cut_coords=(0,-16,16))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa9254cc-ad38-4c84-8ab1-6cce2b446810", + "metadata": {}, + "outputs": [], + "source": [ + "plotting.view_img(avg_adults, black_bg=False)" + ] + }, + { + "cell_type": "markdown", + "id": "0eebd450-fa9a-43da-bcb6-86602dbb8605", + "metadata": {}, + "source": [ + "
    \n", + "Que remarquez-vous ?\n", + "
    Regardez les images moyennées pour chacun des groupes. Pouvez-vous observer certaines différences ?\n", + "
    " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "427649c0-55bf-43d1-9299-c73d70e782a7", + "metadata": {}, + "outputs": [], + "source": [ + "# Allons chercher notre premier volume pour notre premier participant\n", + "premier_volume = image.index_img(data[0], 0)\n", + "\n", + "plotting.view_img(premier_volume, black_bg=False, cmap='turbo', symmetric_cmap=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "24ef603c-ac11-4731-8adc-cb98aa92fb15", + "metadata": {}, + "outputs": [], + "source": [ + "# Allons chercher notre premier volume pour notre 39e participant\n", + "premier_volume = image.index_img(data[40], 0)\n", + "\n", + "plotting.view_img(premier_volume, black_bg=False, cmap='turbo', symmetric_cmap=False)" + ] + }, + { + "cell_type": "markdown", + "id": "c71007ad-6b51-4798-beab-abdab0d6f442", + "metadata": {}, + "source": [ + "## Ressources supplémentaires\n", + "\n", + "- [Tutoriel de `seaborn` sur la visualisation des distributions de données](https://seaborn.pydata.org/tutorial/distributions.html)\n", + "- [Python Graph Gallery](https://python-graph-gallery.com/)\n", + "- [Guide compréhensif de la visualisation](https://www.atlassian.com/data/charts)\n", + "- [Guide des pratiques de visualisation à éviter](https://www.data-to-viz.com/caveats.html)\n", + "- [Fonctions de visualisation dans nilearn - Données IRM(f)](https://nilearn.github.io/dev/plotting/index.html)\n", + "- [Tutoriels de visualisation dans MNE python - Données EEG/MEG](https://mne.tools/stable/auto_tutorials/visualization/index.html)\n", + "- [Tutoriels de visualisation dans DIPY - Données IRM de diffusion](https://docs.dipy.org/stable/examples_built/index#visualization)" + ] + } + ], + "metadata": { + "@deathbeds/jupyterlab-fonts": { + "fontLicenses": {}, + "fonts": {}, + "styles": { + ":root": {} + } + }, + "jupytext": { + "formats": "ipynb,md" + }, + "kernelspec": { + "display_name": "Python cours visu", + "language": "python", + "name": "visu-env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/visualization_fr.md b/notebooks/visualization_fr.md new file mode 100644 index 0000000..ac699d7 --- /dev/null +++ b/notebooks/visualization_fr.md @@ -0,0 +1,1529 @@ +--- +jupyter: + jupytext: + formats: ipynb,md + text_representation: + extension: .md + format_name: markdown + format_version: '1.3' + jupytext_version: 1.19.1 + kernelspec: + display_name: Python cours visu + language: python + name: visu-env +--- + + +# Visualisation et interprétation de modèles + + +## Objectifs d'apprentissage +👀 Comprendre l'utilité des visualisations +
    📈 Comprendre quel type de graphique utilisé selon le type de données +
    🎨 Choisir adéquatement sa palette de couleurs +
    🔃 Apprendre comment modifier différents éléments de nos figures +
    🧠 Utiliser `nilearn` pour la visualisation de données de neuroimagerie +
    🤖 Comprendre comment la visualisation peut nous aider à interpréter nos modèles d'apprentissage machine + +## Organisation de la séance + +La séance comportera des sections théoriques et des sections pratiques: + +**Théorique** + +Nous allons discuté des principes théoriques de base en visualisation. Les principes suivants seront couverts: +- Types de graphiques (données tabulaires): univarié vs bivarié; catégoriel vs continue +- Palette de couleurs: perceptuellement uniformes vs non-uniformes; discrètes vs continues +- Types de graphiques (données de neuroimagerie): cartes statistiques, connectome, etc. + +Cette partie inclut plusieurs éléments interactifs pour amener les étudiant.e.s à former leur propre compréhension de la matière. Le code est déjà fourni, mais nous ne nous y attarderons pas. + +**Pratique** + +Nous allons mettre en pratique, au fur et à mesure, les principes théoriques. Nous allons utiliser les librairies suivantes: +- matplotlib +- seaborn +- ptitprince +- plotly +- nilearn + +Les étudiant.e.s devront modifier le code donné pour comprendre le rôle de différents paramètres de visualisation pour répondre à certaines questions à partir de ce que nous allons voir en classe ou à partir de références fournies (p. ex. documentation matplotlib). + + +
    +Les encadrés jaunes contiennent des questions/exercices auxquelles les étudiant.e.s doivent répondre. +
    + + +
    +Les encadrés bleus contiennent des informations complémentaires sur les jeux de données et les fonctions utilisés. +
    + +```python editable=true slideshow={"slide_type": ""} +import numpy as np +import pandas as pd +import nibabel as nib +import seaborn as sns +import ptitprince as pt +import ipywidgets as widgets +import matplotlib.pyplot as plt +import plotly.express as px + +from cmcrameri import cm +from nilearn import datasets, plotting, image +from nilearn.input_data import NiftiLabelsMasker +from matplotlib import colormaps, colors +from colorspacious import cspace_converter, cspace_convert +``` + +## Télécharger les données + +### Jeu de données Brain development fMRI + +Nous allons utiliser les données provenant du jeu de données *brain development fMRI* qui inclu les données phénotypiques, ainsi que les données IRMf collectées auprès d'enfants et d'adultes lors d'une tâche de visionnement de film ([Richardson et al., 2018](https://doi.org/10.1038/s41467-018-03399-2)). + +**Note:** si vous avez déjà téléchargé vos données et qu'elles ne se trouvent pas dans le dossier data/, modifier le chemin dans la cellule ci-dessous. + +```python +development_dataset = datasets.fetch_development_fmri(data_dir='data/') +``` + +### Datasaurus +Pour la première partie de ce tutoriel, nous allons très brièvement utiliser le jeu de données Datasaurus. Si vous voulez être en mesure d'exécuter les cellules utilisant ce jeu de données, vous devrez le télécharger à partir de [kaggle](https://www.kaggle.com/datasets/tombutton/datasaurusdozen). + +**Note:** modifier le chemin dans la cellule si dessous au besoin. + +```python +# Credit: Alberto Cairo (original datasaurus), and Justin Matejka and George Fitzmaurice (datasaurus dozen) +data = pd.read_csv("data/datasaurus.csv") +``` + + +## Une image vaut mille mots +- Les visualisations nous aide à mieux comprendre la complexité de nos données +- Elles nous permettent de supporter nos résultats +- Et de partager un message + + +```python +# rcParams permet de modifier les paramètres globaux de nos figures +# Ici, on s'assure simplement que les valeurs entre (-8000000, 8000000) ne seront pas montrées en notation scientifique +plt.rcParams['axes.formatter.useoffset'] = False +plt.rcParams['axes.formatter.limits'] = (-8000000, 8000000) +``` + +```python +def plot_category(category): + plt.scatter(data[data['dataset'] == category]['x'], data[data['dataset'] == category]['y']) + plt.xlabel('x') + plt.ylabel('y') + max_x = data[data['dataset'] == category]['x'].max() + max_y = data[data['dataset'] == category]['y'].max() + plt.text(max_x-0.22*max_x, max_y+0.10*max_y, f"Mean: ({data[data['dataset'] == category]['x'].mean().round(2)}, {data[data['dataset'] == category]['y'].mean().round(2)})") + plt.text(max_x-0.22*max_x, max_y+0.06*max_y, f"Std: ({data[data['dataset'] == category]['x'].std().round(2)}, {data[data['dataset'] == category]['y'].std().round(2)})") + + plt.show() + +widgets.interact( + plot_category, + category=widgets.Dropdown( + options=sorted(data['dataset'].unique()), + description='Dataset:' + ) +); +``` + + +### "Ne faites jamais confiance aux statistiques descriptives, visualisez toujours vos données !" +

    Albert Cairo, créateur du jeu de données Datasaurus

    + +Python nous offre une panoplie de librairie pour visualiser nos données: +- Haut niveau vs bas niveau +- Images statiques vs graphiques interactifs +- Plusieurs librairies générales (p.ex., matplotlib, seaborn, bokeh et plotly) +- Et spécifiques à un domaine (p. ex., nilearn) + +Mais avec un grand pouvoir viennent de grandes responsabilités... + + +```python +data = pd.DataFrame( + { + 'Categories': ['A', 'B'], + 'Values': [6000000, 7066000] + }, + +) +``` + +```python +plt.bar(data['Categories'], data['Values']) +plt.ylim([5500000, 8000000]) +plt.yticks([]) +plt.title("Que pouvez-vous dire sur 'A' et 'B' si vous vous fiez uniquement à ce que vous voyez dans la figure ?") +plt.show() +``` + +```python +plt.bar(data['Categories'], data['Values']) +plt.ylim([5500000, 8000000]) + +plt.title("Que remarquez-vous ?") +plt.show() +``` + +```python +plt.bar(data['Categories'], data['Values']) + +# Ajoute les valeurs au-dessus des bars +for i, v in enumerate(data['Values']): + plt.text(i, v + 1, str(v), ha='center', va='bottom') + +plt.title("Mieux ?") +plt.show() +``` + + +## À chaque type de données son type de graphique ! + +Il existe plusieurs types de graphiques. Le type de graphique choisi dépend des variables que l'on veut visualiser: +- Visualisation univariée: **variable continue** vs **variable catégorielle** +- Visualisation bivariée: **catégorielle x catégorielle** vs **catégorielle x continue** vs **continue x continue** + + + +### Visualisations univariées + +Pour une **variable continue**, nous pouvons visualiser sa distribution: +- Histogramme +- Estimation de la densité par noyau (*kde plot*) +- Nuage de points univarié (*strip plot*) + +Pour une **variable catégorielle**, nous pouvons visualiser la quantité d'observations pour chaque catégorie: +- Diagramme à barres (*bar plot*) + + + +
    +Si vos données ne se trouvent pas dans le dossier data/, modifier le chemin dans la cellule ci-dessous. +
    + + +```python +# Allons chercher les données +participants = pd.read_csv('data/development_fmri/development_fmri/participants.tsv', sep='\t') + +# Regardons ce que nous avons dans nos données +participants.head() +``` + +```python +# Regardons le type de nos données +participants.dtypes +``` + +
    +Le jeu de données contiennent des variables continues (p.ex. `Age`, `ToM Booklet-Matched`, `FB_Composite`) et des variables catégorielles (p.ex. `AgeGroup`, `Child_Adult`, `Gender`). `ToM Booklet-Matched` représente un score à une tâche visant à évaluer la théorie de l'esprit, c'est-à-dire la capacité à attribuer des états mentaux (croyances, désirs, émotions, intentions) à soi-même ou aux autres. +
    + +```python +# Regardons d'abord les statistiques descriptives associées à la variable continue `Age` +print(participants['Age'].describe()) +``` + + +#### Histogramme + +Un histogramme nous permet de visualiser la distribution d'une variable donnée de manière discrète en groupant ses valeurs dans des intervalles consécutifs (bins). On obtient donc la fréquence (c.-à-d. le nombre d'observations) dans chacun de ces intervalles. + +Il est possible de générer un histogramme dans matplotlib grâce à [la fonction `hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) + + +```python +# Regardons maintenant sa distribution +plt.hist(participants['Age']) +# Ajoutons un titre +plt.title("Distribution de l'age") +# Ajoutons un titre à l'axe des x et des y +plt.xlabel('Age') +plt.ylabel('Compte') +``` + +
    +La science derrière le nombre de bins à visualiser - partie 1 +
    La fonction hist dans matplolib groupe les données en 10 bins par défaut. Par contre, un nombre de bins inadéquat (soit trop petit, soit trop grand) ne montre pas la distribution de manière représentative. +
    + + +
    +Le nombre de bins en pratique ! +
    Modifier la valeur du paramètre `bins` dans la cellule ci-dessous pour déterminer le nombre de bins qui représenterait le mieux notre variable `Age` +
    + +```python +# Modifier la valeur du paramètre `bins` +plt.hist(participants['Age'], bins=10) +# Ajoutons un titre +plt.title("Distribution de l'age") +# Ajoutons un titre à l'axe des x et des y +plt.xlabel('Age') +plt.ylabel('Compte') +``` + +
    +La science derrière le nombre de bins à visualiser - partie 2 +
    Il existe différentes règles permettant de calculer le nombre de bins optimal, ainsi que l'intervalle à utiliser pour chacune de ces bins. Il est possible d'utiliser ces règles à même la fonction hist de matplotlib ! Il ne suffit que de préciser une des règles valides au paramètre `bins`. +
    + +```python +help(plt.hist) +``` + +```python +# Regardons maintenant sa distribution +plt.hist(participants['Age'], bins='fd') +# Ajoutons un titre +plt.title("Distribution de l'age") +# Ajoutons un titre à l'axe des x et des y +plt.xlabel('Age') +plt.ylabel('Compte') +``` + +#### Estimation de la densité par noyau (*kde plot*) + +L'**estimation de la densité par noyau** (ou *kernel density estimation*) nous permet de visualiser la distribution de notre variable de manière continue (plutôt que discrète) en estimant une fonction de densité. Par contre, similairement à la taille des bins pour l'histogramme, l'estimation de la densité par noyau est sensible à la largeur du noyau ! + +```python +# Pour visualiser l'estimation de la densité par noyau, nous allons utiliser la fonction +# `kdeplot` dans la librairie `seaborn` + +sns.kdeplot(participants['Age']) +``` + +
    +L'axe des y sur un kde plot - partie 1 +
    Vous avez peut-être remarqué que les valeurs sur l'axe des y vont maintenant de 0 à 0.08 (vs 0 à 50 pour notre histogramme). Dans un kde plot, l'axe des y représente la densité, c'est-à-dire la probabilité par unité de la variable sur l'axe des x. En d'autres mots, à quel point les données sont denses pour une valeur de x donnée. Donc, les pics dans ce type de figure représentent une plus grande densité de points pour un étendu de valeurs donné (c.-à-d. une plus grande probabilité qu'une valeur soit observée), alors que les creux représentent une plus faible densité de points. +
    + + +
    +À noter ! +
    Puisque ce type de graphique fourni une estimation continue, nous pourrions penser que certaines données existent alors qu'en réalité de n'est pas le cas. +
    + +```python +# On peut également supperposer l'histogramme et le kde plot avec seaborn +sns.histplot( + participants['Age'], + kde=True, + bins='fd', + edgecolor=None +) +``` + +
    +L'axe des y sur un kde plot - partie 2 +
    Lorsque l'on supperpose un histogramme avec un kde avec `seaborn`, vous remarquerez que l'axe des y montre la fréquence. Cependant, la courbe kde reste tout de même une courbe de densité. Cela se produit car seaborn met à l'échelle la courbe densité avec l'histogramme en multipliant la courbe par le nombre d'observation et la taille des bins. +
    + + +#### Nuage de points univarié + +L'**histogramme** et l'**estimation de la densité par noyau** nous permettent de visualiser la distribution d'une variable de manière discrète ou continue, respectivement. Cependant, ces types de visualisations ne nous permettent pas de visualiser les données brutes. + +Le **nuage de points univarié** nous permet de visualiser chaque point de données individuellement. Cela peut nous permettre d'identifier plus facilement la présence de valeurs abbérantes dans nos données. Cependant, le nuage de points n'est pas adapté si nous avons trop de points de données. + +```python +sns.stripplot( + x=participants['Age'] +) +``` + +
    +Réplication du nuage de points +
    Essayez de répliquer le nuage de points de la variable `Age` dans deux cellules différentes ci-dessous. Que remarquez-vous ? +
    + +```python +sns.stripplot( + x=participants['Age'] +) +``` + +```python +sns.stripplot( + x=participants['Age'] +) +``` + +
    +Le paramètre `jitter` +
    Vous avez peut-être remarqué que les deux nuages de points que vous avez générés à partir de la même variable ne sont pas tout à fait identiques. Cela se produit car `seaborn` utilise `numpy.random` pour le calcul du jitter. Pour rendre le calcul du jitter reproductible, il est possible de fixer une seed a priori. +
    + +```python +np.random.seed(12) +sns.stripplot( + x=participants['Age'] +) +``` + +```python +np.random.seed(12) +sns.stripplot( + x=participants['Age'] +) +``` + +#### Diagramme à barres + +Le **diagramme à barres** (*bar plot*) permet de visualiser/comparer des variables catégorielles en montrant les fréquences des différentes valeurs ou simplement les différentes valeurs. Ce type de graphique est utile si l'on veut, par exemple, comparer le nombre de personnes par groupe. + +```python +participants['AgeGroup'].value_counts() +``` + +```python +participants['AgeGroup'].value_counts().plot(kind='bar') +plt.ylabel('Counts') +``` + +```python +order = sorted(participants.AgeGroup.unique()) + +sns.countplot( + data=participants, + x='AgeGroup', + order=order, + hue='Gender', +) +``` + +### Résumé + +| Type de graphique | Type de données | Résumé | +| --- | --- | --- | +| Histogramme | Continue | Pour visualiser la distribution de nos données de manières discrètes | +| Estimation de la densité par noyau | Continue | Pour visualiser la distribution de nos données de manières continues | +| Nuage de points | Continue | Pour visualiser chaque point de données individuellement | +| Graphique à barres | Catégorielle | Pour comparer les fréquences entre différents groupes/catégories | + + + +### Visualisations bivariées + +Pour une **variable continue** x **variable continue**, nous pouvons utiliser: +- Nuage de points +- Estimation de la densité par noyau bivariée +- *Hexplot* +- *Joint plot* +- *Heat map* + +Pour une **variable catégorielle** x **variable continue**, nous pouvons utiliser: +- Boîte à moustache (*box plot*) +- Diagramme en violon (*violin plot*) +- Nuage de points (*scatter plot*) +- Tracé de points (*point plot*) +- *Rain cloud plot* +- Diagramme à barres (*bar plot*)... (?) + + +#### Nuage de points - Variable continue x variable continue + +```python +plt.scatter(participants['Age'], participants['ToM Booklet-Matched']) +plt.xlabel('Age') +plt.ylabel('ToM Booklet-Matched') +``` + +
    +Que remarquez-vous ? +
    Prenez le temps d'observer la figure que nous venons de générer. +
    + +```python +participants.groupby(['Child_Adult'])['ToM Booklet-Matched'].mean() +``` + +
    +Valeurs manquantes +
    La fonction `scatter` dans `matplotlib`, ainsi que la fonction `regplot` dans `seaborn` (voir les cellules suivantes), permettent de supprimer automatiquement les valeurs manquantes lors de la visualisation. +
    + + +
    +La fonction `regplot` +
    La fonction `regplot` dans seaborn permet de visualiser le nuage de points tout en ajustant un modèle de régression linéaire aux données ! +
    + +```python +sns.regplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'] +) +``` + +
    +Ajustement du modèle de régression +
    Une régression linéaire ne semble pas être la meilleure façon de modéliser la relation entre notre variable `Age` et notre variable `ToM Booklet-Matched`. Modifier la valeur du paramètre `order` de la fonction `regplot` à 2 pour vérifier l'ajustement de ce modèle. +
    + +```python +sns.regplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'], + order=2 +) +``` + +
    +Et si nous ajoutions une variable de plus ! +
    Nous pouvons visualiser les interactions multivariées grâce au paramètre `hue` de la fonction `lmplot` de `seaborn`. Dans la cellule ci-dessous, nous allons explorer la relation entre l'âge (x) et le score au ToM Booklet-Matched (y) en fonction du genre (hue). +
    + +```python +sns.lmplot( + x='Age', + y='ToM Booklet-Matched', + data=participants, + hue='Gender' +) +``` + +#### Estimation de la densité par noyau bivariée et hex plot - Variable continue x variable continue + +Lorsque nous avons beaucoup de points de données et que nous voulons visualiser la distribution de points en prenant en compte deux variables, nous risquons d'avoir beaucoup de chevauchement de points (*overplotting*). Dans ce cas, il peut être difficile de bien visualiser la distribution de nos données avec un nuage de points. Il est donc possible d'utiliser d'autres types de graphiques: + +L'**estimation de la densité par noyau bivariée** nous permet de visualiser la manière dont deux variables se distribuent dans un espace à deux dimensions. Chaque contour représente une zone densité. Plus les contours sont proches, plus la densité est élevée, c'est-à-dire là où les données sont le plus concentrées. + +Le **hex plot** permet de visualiser la densité des points de données de manière discrète. C'est donc l'équivalent d'un histogramme mais pour visualiser deux variables plutôt qu'une. Les zones plus foncées représentent des zones de densité élévée. + +⚠ L'estimation de la densité par noyau bivariée est plus demandant en termes de computation comparativement au *hex plot*. + +```python +sns.kdeplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'], + fill=True +) +``` + +```python +sns.jointplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'], + kind='hex' +) +``` + +```python +def plot_jointplot(kind): + sns.jointplot( + x=participants['Age'], + y=participants['ToM Booklet-Matched'], + kind=kind + ) + plt.show() + +widgets.interact( + plot_jointplot, + kind=widgets.Dropdown( + options=['hex', 'reg', 'kde', 'scatter'], + description='Type:' + ) +); +``` + +```python +def plot_jointplot(kind): + mean, cov = [0, 1], [(1, .5), (.5, 1)] + x, y = np.random.multivariate_normal(mean, cov, 1000).T + + sns.jointplot( + x=x, + y=y, + kind=kind + ) + plt.show() + +widgets.interact( + plot_jointplot, + kind=widgets.Dropdown( + options=['scatter', 'hex', 'reg', 'kde'], + description='Type:' + ) +); +``` + +#### Nuage de points (stripplot) - Variable continue x variable catégorielle + +Nous avons déjà parlé du nuage de points (*stripplot*) au niveau univarié, mais nous pouvons également visualiser le nuage de points de plusieurs catégories simultanément. + +```python +order = sorted(participants.AgeGroup.unique())[:-1] + +sns.stripplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order +) +``` + +#### Boîtes à moustache - Variable continue x variable catégorielle + +Les **boîtes à moustaches** nous permettent de visualiser les distributions d'un ou plusieurs groupes de variables continues (p.ex. distributions d'âge pour différents groupes expérimentaux). Les différentes composantes de la boîte à moustache représentent différentes statistiques descriptives: +- La ligne dans la boîte représente la médiane. +- Les limites de la boîte (quartiles inférieur—Q1 et supérieur—Q3) représentent l'étendue où se situe 50% des données. +- Les moustaches (c.-à-d. les lignes à l'extérieur de la boîte) capturent l'étendue du reste des données. +- Les points représentent les valeurs aberrantes—valeurs supérieures à 1.5 fois l'intervalle inter-quartile (Q3-Q1) + Q3 ou inférieures à 1.5 fois l'intervalle inter-quartile - Q1. + +```python +sns.boxplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order +) +``` + +#### Diagrammes en violon - Variable continue x variable catégorielle + +Les **diagrammes en violon** nous permettent de visualiser la distribution des données en utilisant des courbes de densité (aka courbes d'**estimation de la densité par noyau**). La largeur de chaque courbe correspond à la fréquence approximative des points pour chaque région (valeurs sur l'axe des y). + +```python +sns.violinplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order +) +``` + +#### Diagrammes raincloud - Variable continue x variable catégorielle + +Pourquoi choisir entre un nuage de points, une boîte à moustaches et un diagramme en violon quand nous pouvons faire les trois en même temps ! C'est ce que nous permettent de faire les **diagrammes raincloud** (*raincloud plot*). + +*Note :* Ce type de graphique n'est pas nativement intégré par `seaborn` ou `matplotlib`. Nous devrons utiliser la librairie `ptitprince` que nous avons importé au début du tutoriel (`import ptitprince as pt`). + +*Ressource :* Pour plus d'exemples sur les RainCloud plot, voir le tutoriel de [ptitprince](https://github.com/pog87/PtitPrince/blob/master/tutorial_python/raincloud_tutorial_python.ipynb). + +```python +pt.RainCloud( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order, + bw=0.6 +) +``` + +#### Tracé de points - Variable continue x variable catégorielle + +Le **tracé de points** nous permettent de comparer les moyennes (ou autre statistique descriptive) entre différents groupes, tout en montrant l'incertitude (p. ex. intervalle de confiance à 95%, écart-type, etc.). Ce type de visualisation est pratique pour montrer des tendances entre différentes catégories ou entre différents temps de mesure (p.ex. si nous avons des mesures longitudinales). + +```python +sns.pointplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + estimator='mean', + data = participants[participants.AgeGroup!='Adult'], + order=order, + errorbar=('ci', 95) +) +``` + +#### Diagramme à barres pour visualiser une variable continue x une variable catégorielle ? + +Vous avez peut-être déjà vu l'utilisation de diagramme à barres pour représenté un variable continue. +Cependant, ce type de visualisation pour ce type de variable n'est pas recommandé: +- Permet uniquement de visualiser certains statistiques descriptives (p.ex. la moyenne) sans fournir aucune information sur la distribution de notre variable +- Si vous voulez absolument utiliser un diagramme à barres, superposez-le avec un nuage de points ! + +```python +sns.barplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order = order, palette='Blues' +) + + +sns.stripplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + jitter=True, + order = order, color = 'black') + +``` + +## Et si on ajoutait un peu de couleurs à nos figures ? + + +### Palettes de couleurs perceptuelle uniformes vs non-uniformes +- Couleurs sont perçues selon leur teinte (orange, rouge, vert, etc.) et leur luminosité (clareté vs obscurité d'une teinte) +- Les caractéristiques de nos photorécepteurs font en sorte que nous ne traitons pas le spectre de lumière de manière uniforme +- La majorité des photorécepteurs nous permettant de voir les couleurs (cônes) traite les longueurs d'onde longues (c.-à-d., rouge, orange, jaune) +- Donc, nous ne percevons pas très bien les variations dans les teintes vertes-bleues comparativement aux variations dans les teintes jaunes-rouges + +```python +colormaps['hsv'] +``` + +```python +cm.batlow +``` + +```python +import requests +from PIL import Image +from io import BytesIO +``` + +```python +response = requests.get('https://raw.githubusercontent.com/matplotlib/matplotlib/main/doc/_static/stinkbug.png') +img = np.asarray(Image.open(BytesIO(response.content))) +``` + +```python +def plot_img(cmap): + lum_img = img[:, :, 0] + if cmap == 'noir et blanc': + plt.imshow(img) + elif cmap == 'hsv': + plt.imshow(lum_img, cmap="hsv") + elif cmap == 'batlow': + plt.imshow(lum_img, cmap=cm.batlow) + +widgets.interact( + plot_img, + cmap=widgets.Dropdown( + options=['noir et blanc', 'hsv', 'batlow'], + value='noir et blanc', + description='Palette:' + ) +); +``` + +
    +Que remarquez-vous ? +
    Comparer l'image colorée avec les palettes de couleurs `hsv` et `batlow` avec l'image en noir et blanc. +
    + + +
    +Ne jetez pas l'arc-en-ciel à la poubelle +
    Google a développé la palette de couleur `turbo`, une palette de couleur arc-en-ciel perceptuellement uniforme. Pour plus de détails, consultez le blog de Google research. +
    + + +### Plattes de couleurs discrètes vs continues + +Les palettes de couleurs peuvents soient être continues (comme ce que nous avons vu plus haut) ou discrète. Les palettes de couleurs discrètes sont utilisées pour visualiser les variables catégorielles—où les catégories n'ont pas d'ordre inhérent (p. ex. Enfants avec covid vs enfants sans covid vs adultes avec covid vs adultes sans covid). + + +#### [matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html) + +```python +colormaps['Set2'] +``` + +```python +colormaps['Set3'] +``` + +#### [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) + +```python +sns.color_palette('rocket', 10) +``` + +#### [cmcrameri](https://s-ink.org/scientific-colour-maps) + +```python +cm.batlow.resampled(10) +``` + +```python +cm.lipari.resampled(10) +``` + +
    +Discrétisation des palettes continues +
    Dans les cellules ci-dessous, nous avons vu qu'il était possible de discrétiser une palette continue en spécifiant le nombre de couleurs que nous voulons. Par contre, comme nous l'avons mentionné les palettes discrètes sont utilisées pour visualiser des catégories qui n'ont pas d'ordre inhérent. Donc nous ne sommes pas obligé d'avoir une palette discrète ordonnée. +
    + +```python +nb_colors = 10 +#np.random.seed(10) + +original_cmap = cm.lipari.resampled(nb_colors) + +original_colors = original_cmap(np.arange(nb_colors)) + +shuffled_colors = original_colors.copy() +np.random.shuffle(shuffled_colors) + +colors.ListedColormap(shuffled_colors) +``` + +### Plattes de couleurs divergentes + +Les palettes de couleurs divergentes sont utiles si nous avons une valeur centrale interprétable. Les palettes divergentes s'appliquent autant pour les palettes discrètes ou continues. Par exemple: +- **Palettes discrètes divergentes**: Si nous avons des données collectées grâce à une échelle de likert. Nos valeurs pourraient varier de 'Fortement en désaccord' à 'Fortement en accord' avec comme valeur centrale 'Neutre'. Dans ce cas, nous pourrions visualiser les valeurs de gauche (de 'Fortement en désaccord' à 'neutre') dans des teintes de bleues et les valeurs de droite (de 'Neutre' à 'Fortement en accord') dans les teintes de oranges/rouges. +- **Palettes continues divergentes**: Si nous avons des valeurs négatives et positives (p.ex. des valeurs de corrélations), nous pourrions utiliser le zéro comme valeur centrale, les valeurs négatives pourraient être représentées dans des teintes de bleues et la valeurs positives dans des teintes de oranges/rouges. + + +#### [matplotlib](https://matplotlib.org/stable/gallery/color/colormap_reference.html) + +```python +# Palette discrète divergente +colormaps['bwr'].resampled(9) +``` + +```python +# Palette continue divergente +colormaps['bwr'] +``` + +#### [seaborn](https://seaborn.pydata.org/tutorial/color_palettes.html) + +```python +sns.color_palette("coolwarm", 9) +``` + +```python +sns.color_palette("coolwarm", as_cmap=True) +``` + +#### [cmcrameri](https://s-ink.org/scientific-colour-maps) + +```python +cm.vik.resampled(9) +``` + +```python +cm.vik +``` + +### Palettes de couleurs lisibles universellement + +Nous avons parlé de l'uniformité perceptuelle des palettes de couleurs, mais il est également important de considérer l'utilisation de couleurs perçues universellement. En effet, l'utilisation de certaines combinaisons de couleurs pourrait être difficile à distinguer pour des personnes ayant une dyschromatopsie. Quelques conseils: +- Évitez les combinaisons de vert et rouge +- Variez la luminosité et la saturation des couleurs pour s'assurer que vos figures restent lisibles en noir et blanc +- Utilisez les palettes de couleur lisibles universellement de `matplotlib`, `seaborn` ou `cmcrameri` + + +```python +sns.color_palette("colorblind") +``` + +```python +converters = { + 'deuter50_space': { + "name": "sRGB1+CVD", + "cvd_type": "deuteranomaly", + "severity": 50 + }, + 'deuter100_space': { + "name": "sRGB1+CVD", + "cvd_type": "deuteranomaly", + "severity": 100 + }, + 'prot50_space': { + "name": "sRGB1+CVD", + "cvd_type": "protanomaly", + "severity": 50 + }, + 'prot100_space': { + "name": "sRGB1+CVD", + "cvd_type": "protanomaly", + "severity": 100 + }, + 'trit50_space': { + "name": "sRGB1+CVD", + "cvd_type": "tritanomaly", + "severity": 50 + }, + 'trit100_space': { + "name": "sRGB1+CVD", + "cvd_type": "tritanomaly", + "severity": 100 + } +} + +def show_palettes(cmap, n_colors=10): + f, ax = plt.subplots(7, 1, figsize=(10, 6)) #, layout="constrained") + plt.subplots_adjust(hspace=0.6) + + gradient = np.arange(n_colors).reshape(1, -1) + + if cmap in ['viridis', 'coolwarm', 'colorblind' ,'pastel', 'muted']: + cmap_sns = sns.color_palette(cmap, n_colors) + cmap_m = colors.ListedColormap(sns.color_palette(cmap, n_colors=n_colors)) + if cmap in ['batlow', 'vik']: + cmap_m = getattr(cm, cmap).resampled(n_colors) + cmap_sns = sns.color_palette(cmap_m(np.linspace(0, 1, n_colors))) + if cmap in ['bwr', 'hsv', 'PuOr', 'summer']: + cmap_m = colormaps[cmap].resampled(n_colors) + cmap_sns = sns.color_palette(cmap_m(np.linspace(0, 1, n_colors))) + + # Original palette + ax[0].imshow(gradient, aspect='auto', cmap=cmap_m) + ax[0].set_title('Original') + for idx, converter in enumerate(converters.keys()): + # Palette transformée + converted_palette = cspace_convert(cmap_sns, converters[converter], "sRGB1") + converted_palette = np.clip(converted_palette, 0, 1) + + ax[idx+1].imshow(gradient, aspect='auto', cmap=colors.ListedColormap(converted_palette)) + ax[idx+1].set_title(f"{converters[converter]['cvd_type']}-{converters[converter]['severity']}") + + for a in ax.flatten(): + a.set_yticks([]) + a.set_xticks([]) + + plt.show() + +widgets.interact( + show_palettes, + cmap=widgets.Dropdown( + options=['pastel', 'viridis', 'coolwarm', 'batlow', 'bwr', 'colorblind', 'hsv', 'PuOr', 'summer', 'vik', 'muted'], + value='viridis', + description='Palette:' + ), +); + +``` + +
    +Au-delà des couleurs +
    Choisir la bonne palette de couleurs est important, mais il existe également d'autres stratégies que nous pouvons utiliser pour rendre nos figures plus accessibles et facilement interprétables. Au lieu d'utiliser uniquement différentes couleurs pour distinguer différentes catégories, nous pourrions utiliser différentes formes géométriques. Si notre figure comporte des lignes pour, par exemple, illustrer la trajectoire d'une variable dans le temps, nous pourrions utiliser différents types de pointillés. Pour plus d'exemples, voir "The best charts for color blind viewers". +
    + + +## L'anatomie d'une figure + +Nous avons déjà vu comment ajouter ou modifier quelques éléments de nos figures, comme le titre et le nom des axes. Dans cette section, nous allons discuter plus en détails de l'art de faire des figures. Nous allons couvrir: +- Sous-graphes (*subplots*) +- Bordures +- Graduations (*ticks*) +- Grille +- Légende + + +### Sous-graphes + +Jusqu'à présent nous avons principalement créé nos figures en appelant directement certaines fonctions de `matplotlib` et de `seaborn`. Par contre, si nous voulons créer une figure avec plusieurs panneaux/axes (c.-à-d., colonnes et/ou rangées), nous allons devoir utiliser des **sous-graphes** (*subplots*). + + +#### matplotlib + +```python +# Reprenons le code que nous avons utilisé précédemment. Ce code produit deux figures séparées +plt.hist(participants['Age'], bins='fd') +plt.title("Distribution de l'age") +plt.xlabel('Age') +plt.ylabel('Compte') +plt.show() + +plt.scatter(participants['Age'], participants['ToM Booklet-Matched']) +plt.xlabel('Age') +plt.ylabel('ToM Booklet-Matched') +plt.show() +``` + +```python +help(plt.subplots) +``` + +```python +# Si nous voulons produire une seule figure avec ces deux graphiques nous allons devoir utiliser les subplots +# Syntaxe de la fonctionL plt.subplots(n_rows, n_cols, *) +fig, axes = plt.subplots(1, 2, figsize=(12, 4)) +axes[0].hist(participants['Age'], bins='fd') +axes[0].set_title("Distribution de l'age") +axes[0].set_xlabel('Age') +axes[0].set_ylabel('Compte') + +axes[1].scatter(participants['Age'], participants['ToM Booklet-Matched']) +axes[1].set_title("Relation entre Age et scores ToM") +axes[1].set_xlabel('Age') +axes[1].set_ylabel('ToM Booklet-Matched') +``` + +```python +# La fonction subplots peut même être utilisé lorsque l'on a un seul graphe +fig, axes = plt.subplots(figsize=(6, 4)) +axes.hist(participants['Age'], bins='fd') +axes.set_title("Distribution de l'age") +axes.set_xlabel('Age') +axes.set_ylabel('Compte') +``` + +#### seaborn + +```python +# Avec searbon +fig, axes = plt.subplots(1, 2, figsize=(12, 4)) + +order = sorted(participants[participants.AgeGroup!='Adult']['AgeGroup'].unique()) +#order.sort() + +sns.boxplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order, + ax=axes[0] +) + +sns.violinplot( + x='AgeGroup', + y = 'ToM Booklet-Matched', + data = participants[participants.AgeGroup!='Adult'], + order=order, + ax=axes[1] +) +``` + +### Bordure +La bordure correspond aux lignes autours de la zone de tracé. + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +ax.set_title("Distribution de l'age") +ax.set_xlabel('Age') +ax.set_ylabel('Compte') + +for spine in ax.spines.values(): + spine.set_color("red") + spine.set_linewidth(3) +``` + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +ax.set_title("Distribution de l'age") +ax.set_xlabel('Age') +ax.set_ylabel('Compte') + +ax.spines[['right', 'top', 'left', 'bottom']].set_visible(False) # ax.spines.top.set_visible(False) +``` + +### Graduations + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +ax.set_title("Distribution de l'age") +ax.set_xlabel('Age') +ax.set_ylabel('Compte') + +ax.tick_params( + axis='both', # Modification appliquée aux deux axes (x et y) + which='major', # Modification des graduations 'major', 'minor' ou 'both' + length=10, # Longueur des graduations + width=2, # largeur des graduations + color='purple', # Couleur des graduations + labelsize=12, # Taille des étiquettes + labelcolor='darkorange', # Couleur des étiquettes + labelrotation=45 +) + +#from matplotlib.ticker import MultipleLocator +#ax.xaxis.set_minor_locator(MultipleLocator(2)) +``` + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +ax.set_title("Distribution de l'age") +ax.set_xlabel('Age') +ax.set_ylabel('Compte') + +plt.tick_params( + axis='x', + which='both', + bottom=False, + top=False, + labelbottom=False) +``` + +
    +Modifiez le code +
    À partir du code fourni dans les cellules ci-dessus, modifiez la cellule du bas pour générer une figure avec les caractéristiques suivantes: +
  • + Pas de bordure au haut et à gauche +
  • +
  • + Pas de graduation sur l'axe des y +
  • +
  • + Des graduations mineurs sur l'axe des x à intervalle de 1 +
  • +
  • + Les étiquettes sur l'axe des x affichées à 90 degré +
  • +
  • + Des titres pour tous les axes ainsi que pour la figure +
  • + +```python +# Ajouter votre code dans cette cellule +# ... +``` + +### Grille + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +ax.hist(participants['Age'], bins='fd') +ax.set_title("Distribution de l'age") +ax.set_xlabel('Age') +ax.set_ylabel('Compte') + + +ax.grid( + True, + color='gray', + linestyle='--', + linewidth=1 +) + +ax.set_axisbelow(True) # Équivalent à l'utilisation du paramètre `zorder` +``` + +### Légende + +```python +fig, ax = plt.subplots(figsize=(6, 4)) + +sns.scatterplot(data=participants, x='Age', y='ToM Booklet-Matched', hue='Gender', ax=ax) +ax.set_xlabel('Age') +ax.set_ylabel('ToM Booklet-Matched') + +legend = ax.legend(fontsize=12, loc='lower right') + +legend.get_frame().set_edgecolor("black") +legend.get_frame().set_linewidth(2) +legend.get_frame().set_facecolor("white") +``` + +### Paramètres par défaut + +Les paramètres par défaut pour tous les éléments de nos figures (p. ex. police, taille du texte, épaisseur des lignes, etc.) sont définis dans l'object `rcParams`. Ces valeurs peuvent être modifiées, nous permettant d'appliquer un style de figure consistant d'une figure à l'autre. + +```python +plt.rcParams +``` + +```python +STYLE = { + "font.cursive": "Comic Sans MS", + "font.size": 8, + "axes.linewidth": 1.2, + "axes.grid": True, + "axes.grid.axis": 'both', + 'axes.axisbelow': True, + "figure.dpi": 150 +} + +plt.rcParams.update(STYLE) +``` + +```python +fig, axes = plt.subplots(figsize=(6, 4)) +axes.hist(participants['Age'], bins='fd') +axes.set_title("Distribution de l'age") +axes.set_xlabel('Age') +axes.set_ylabel('Compte') +``` + +```python +# Pour remettre les valeurs par défaut +plt.rcdefaults() +``` + +
    +Choisissez votre style +
    Il est également possible d'utiliser des styles pré-définis grâce à la fonction `style.use` de matplotlib. Vous pouvez consulter la liste de styles offerts ici. Voir également la documentation de matplotlib pour en apprendre plus sur les feuilles de style et `rcParams`. +
    + +```python +print(plt.style.available) +``` + +
    +Par exemple, 'ggplot' est un style que nous pouvons utiliser pour obtenir des figures semblables à celles générées par la librairie `ggplot` en R. +
    + +```python +plt.style.use('ggplot') +``` + +```python +fig, axes = plt.subplots(figsize=(6, 4)) +axes.hist(participants['Age'], bins='fd') +axes.set_title("Distribution de l'age") +axes.set_xlabel('Age') +axes.set_ylabel('Compte') +``` + +## Graphiques interactifs + +Les figures statiques c'est bien, mais les figures interactives c'est mieux ! En fait, les graphiques interactifs nous offrent l'opportunité d'explorer nos données d'une manière qui ne serait pas possible avec des figures statiques et de créer des dashboards (voir [documentation](https://dash.plotly.com/?_gl=1*j1jto0*_gcl_au*MTY3MTkxNTc4MS4xNzczOTMwNTAw*_ga*MTM4NzMyMTAxMy4xNzczOTMwNTAy*_ga_6G7EE0JNSC*czE3NzM5MzA1MDIkbzEkZzAkdDE3NzM5MzA1MTAkajUyJGwwJGgw)). Nous allons pouvoir ajouter des informations à nos figures, sans les encombrer. + +Les deux principales librairies en python nous permettant de faire des figures interactives sont `plotly` et `bokeh`. La librairie `plotly` offre un interface de haut niveau (`plotly.express`) permettant de créer des figures en une seule ligne de code, alors que `bokeh` offre plus d'options de personnalisation. Dans ce tutoriel, nous allons uniquement discuter de `plotly.express`, mais si vous voulez essayer `bokeh`, vous pouvez consulter [la documentation](https://docs.bokeh.org/en/latest/). + +```python +help(px.scatter) +``` + +```python +fig = px.scatter( + data_frame=participants, + x='Age', + y='ToM Booklet-Matched', + hover_data=['participant_id', 'Gender'], + color='Handedness', + symbol='Handedness' +) + +fig.show() +``` + +```python +participants.columns +``` + +```python +fig = px.scatter( + data_frame=participants, + x='Age', + y='ToM Booklet-Matched', + hover_data=['participant_id', 'Gender'], + color='Handedness', + symbol='Handedness' +) +fig.update_xaxes(range=[int(participants['Age'].min()-1), int(participants[participants['Child_Adult']=='child']['Age'].max()+1)]) +fig.update_traces(marker_size=10) +fig.show() +``` + +## Visualiser des images de cerveaux ! + +La librairie `nilearn` supporte plusieurs fonctions pour visualiser des images de cerveaux (voir la [liste de toutes les fonctions](https://nilearn.github.io/stable/plotting/index.html)). Dans cette section, nous allons voir quelque unes de ces fonctions. + +--- + +Basé sur le [tutoriel MAIN](https://main-educational.github.io/intro_nilearn/machine-learning-with-nilearn.html) + +```python +# Allons tout d'abord chercher nos données +data = development_dataset.func +confounds = development_dataset.confounds +pheno = pd.DataFrame(development_dataset.phenotypic) +``` + +```python +data +``` + +```python +# Essayons de visualiser notre premier fichier +plotting.view_img(data[0]) +``` + +```python +img = nib.load(data[0]) +img.shape +``` + +
    +Oups ! +
    Si nous essayons de visualiser nos images bold, nous obtenons une erreur ! C'est tout à fait normal, puisque notre fichier contient des données 4D, c'est-à-dire que pour chaque voxel (3D), nous avons une valeur pour plusieurs points dans le temps (temps de répétition; +1D). Si nous voulons absoluement visualiser ces fichiers, deux options s'offrent à nous: +
      + 1. Soit nous nous intéressons à l'activité sur l'ensemble du cerveau à un temps donné +
    +
      + 2. Soit nous nous intéressons au décours temporel pour un voxel/parcelle donné +
    +
    + +```python +# Allons chercher notre premier volume (i.e., notre première image de cerveau) +premier_volume = image.index_img(data[0], 0) + +plotting.view_img(premier_volume, black_bg=False, cmap='turbo', symmetric_cmap=False) +``` + +```python +help(plotting.plot_stat_map) +``` + +```python +plotting.plot_stat_map( + premier_volume, + draw_cross=False, + #cut_coords=(0, 4, 22), + display_mode='tiled' +) +``` + +```python +multiscale = datasets.fetch_atlas_basc_multiscale_2015(resolution=64, data_dir='../data') +atlas_filename = multiscale.maps + +# initialize masker (change verbosity) +masker = NiftiLabelsMasker(labels_img=atlas_filename, standardize=True, + memory='nilearn_cache', resampling_target="data", + detrend=True, verbose=0) +# Extraction des séries temporelles pour notre premier sujet +time_series = masker.fit_transform(data[0], confounds=confounds[0]) +``` + +```python +time_series.shape +``` + +```python +parcelle = 0 +plt.figure(figsize=(12,4)) +plt.plot(time_series.T[parcelle]) +plt.title(f'Décours temporel de la parcelle {parcelle}') +plt.xlabel('Volumes') +plt.ylabel('Amplitude') +``` + +
    +Comparer visuellement les décours temporels +
    À partir du code fourni dans la cellule d'avant, modifier la cellule ci-dessous pour être en mesure de comparer le décours temporel de la parcelle 0 et de la parcelle 1. +
    + +```python +help(plt.legend) +``` + +```python +plt.figure(figsize=(12,4)) +plt.plot(time_series.T[0], label='Parcelle 0', ls='--', lw=2, alpha=0.4) +plt.plot(time_series.T[1], label='Parcelle 1', lw=2, zorder=1) +plt.title(f'Décours temporel des parcelles 0 et 1') +plt.xlabel('Volumes') +plt.ylabel('Amplitude') +plt.legend(loc='lower right') +``` + +## Interpréter les modèles d'apprentissage machine via la visualisation + + +### Charger les données + +```python +from nilearn.connectome import ConnectivityMeasure + +correlation_measure = ConnectivityMeasure(kind='correlation', vectorize=True, + discard_diagonal=True) + + +all_features = [] # here is where we will put the data (a container) + +for i,sub in enumerate(data[:66]): + # extract the timeseries from the ROIs in the atlas + time_series = masker.fit_transform(sub, confounds=confounds[i]) + # create a region x region correlation matrix + correlation_matrix = correlation_measure.fit_transform([time_series])[0] + # add to our container + all_features.append(correlation_matrix) + # keep track of status + print('finished %s of %s'%(i+1,len(data[:66]))) + +np.savez_compressed('data/MAIN_BASC064_subsamp_features', a=all_features) +``` + +
    +Si vos données ne se trouvent pas dans le dossier data/, modifier le chemin dans la cellule ci-dessous. +
    + +```python +y_ageclass = pheno.head(66)['Child_Adult'] + +feat_file = 'data/MAIN_BASC064_subsamp_features.npz' +X_features = np.load(feat_file)['a'] +``` + +```python +print(f"Shape X: {X_features.shape}") +print(f"Shape y: {y_ageclass.shape}") +``` + +```python +sns.countplot(x = y_ageclass) +``` + +### Entraîner le modèle + +```python +from sklearn.model_selection import train_test_split + +# Split the sample to training/test and +# stratify by age class, and also shuffle the data. + +X_train, X_test, y_train, y_test = train_test_split(X_features, # x + y_ageclass, # y + test_size = 0.2, # 80%/20% split + shuffle = True, # shuffle dataset + # before splitting + stratify = y_ageclass, # keep + # distribution + # of ageclass + # consistent + # betw. train + # & test sets. + random_state = 123 # same shuffle each + # time + ) + +from sklearn.svm import SVC +from sklearn.preprocessing import StandardScaler +from sklearn.metrics import classification_report, confusion_matrix + +scaler = StandardScaler().fit(X_train) +X_train_scl = scaler.transform(X_train) +X_test_scl = scaler.transform(X_test) + +l_svc = SVC(kernel='linear', class_weight='balanced') + +l_svc.fit(X_train_scl, y_train) # fit to training data +y_pred = l_svc.predict(X_test_scl) # classify age class using testing data + +acc = l_svc.score(X_test_scl, y_test) # get accuracy +cr = classification_report(y_pred=y_pred, y_true=y_test) # get prec., recall & f1 +cm = confusion_matrix(y_pred=y_pred, y_true=y_test) # get confusion matrix +``` + +```python +print(cr) +``` + +### Visualisation des coefficients + +Nous avons entraîné notre modèle et obtenu un score de prédiction assez élevé, ce qui nous indique qu'il y a possiblement quelque chose dans nos données qui est systématiquement lié à l'âge. + +```python +print(l_svc.coef_.shape) +print(l_svc.coef_) +``` + +#### Matrice de corrélation +Les attributs de notre modèle correspond à la corrélation entre chaque paire de régions que nous avons extraites. Les coefficients de notre modèle représente donc le poids de chacune de ces paires de régions dans la prédiction du groupe d'âge. Nous pouvons donc uiliser une matrice de corrélation pour visualiser ces poids. + +```python +feat_exp_matrix = correlation_measure.inverse_transform(l_svc.coef_)[0] + +plotting.plot_matrix(feat_exp_matrix, figure=(10, 8), + labels=range(feat_exp_matrix.shape[0]), + reorder='average', + tri='lower', vmax=0.01, vmin=-0.01) +``` + +#### Connectome + +Nous pouvons aussi visualiser directement le poids de nos attributs sur un cerveau ! + +```python +# Coordonnées de nos régions +coords = plotting.find_parcellation_cut_coords(atlas_filename) +print(coords.shape) +``` + +```python +plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True) +``` + +```python +plotting.plot_connectome(feat_exp_matrix, coords, colorbar=True, edge_threshold=0.006) +``` + +
    +Ajoutons du mouvement ! +
    Nilearn possède une fonction nous permettant de visualiser notre connectome de manière interactive ! Cela nous permet de plus facilement examiner le poids de nos attributs. +
    + +```python +plotting.view_connectome(feat_exp_matrix, coords, edge_threshold='90%') +``` + +
    +Attributs tous gris... +
    Vous avez peut-être remarqué que les poids de nos attributs sont tracés en gris. Pourquoi pensez-vous que c'est le cas ? +
    + +```python +feat_exp_matrix_rm_diag = feat_exp_matrix +feat_exp_matrix_rm_diag[feat_exp_matrix==1] = 0 +plotting.view_connectome(feat_exp_matrix_rm_diag, coords, edge_threshold='90%') +``` + +Nous avons un modèle permettant de prédire le groupe d'âge avec une très bonne performance prédictive. Nous pouvons voir que les attributs nous permettant de faire cette prédiction sont distribués dans le cerveau. Est-ce que nous pouvons maintenant publier nos résultats ?... +
    +
    Non ! Il nous faut explorer davantage pour voir si notre modèle est biologiquement plausible... Pour cela, nous allons visualiser nos images cérébrales pour chacun de nos groupes. + +```python +children, adults = data[33:66], data[0:33] +avg_children, avg_adults = [], [] + +# Pour chaque participant.e, nous allons moyenner l'activité cérébrale à travers tous nos points de mesure pour obtenir une image 3D +for child, adult in zip(children, adults): + avg_adults.append(image.mean_img(adult)) + avg_children.append(image.mean_img(child)) + +# Nous allons moyenner nos images 3D individuelles pour chaque participant.e et ce pour chacun de nos groupes séparément +avg_children = image.mean_img(avg_children) +avg_adults = image.mean_img(avg_adults) +``` + +```python +plotting.view_img(avg_children, black_bg=False, cut_coords=(0,-16,16)) +``` + +```python +plotting.view_img(avg_adults, black_bg=False) +``` + +
    +Que remarquez-vous ? +
    Regardez les images moyennées pour chacun des groupes. Pouvez-vous observer certaines différences ? +
    + +```python +# Allons chercher notre premier volume pour notre premier participant +premier_volume = image.index_img(data[0], 0) + +plotting.view_img(premier_volume, black_bg=False, cmap='turbo', symmetric_cmap=False) +``` + +```python +# Allons chercher notre premier volume pour notre 39e participant +premier_volume = image.index_img(data[40], 0) + +plotting.view_img(premier_volume, black_bg=False, cmap='turbo', symmetric_cmap=False) +``` + +## Ressources supplémentaires + +- [Tutoriel de `seaborn` sur la visualisation des distributions de données](https://seaborn.pydata.org/tutorial/distributions.html) +- [Python Graph Gallery](https://python-graph-gallery.com/) +- [Guide compréhensif de la visualisation](https://www.atlassian.com/data/charts) +- [Guide des pratiques de visualisation à éviter](https://www.data-to-viz.com/caveats.html) +- [Fonctions de visualisation dans nilearn - Données IRM(f)](https://nilearn.github.io/dev/plotting/index.html) +- [Tutoriels de visualisation dans MNE python - Données EEG/MEG](https://mne.tools/stable/auto_tutorials/visualization/index.html) +- [Tutoriels de visualisation dans DIPY - Données IRM de diffusion](https://docs.dipy.org/stable/examples_built/index#visualization) diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..0000825 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,15 @@ +nilearn +cmcrameri +numpy +pandas +seaborn +ipywidgets +matplotlib +notebook +ipykernel +ptitprince +colorspacious +nibabel +pillow +scikit-learn +plotly \ No newline at end of file