{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sujet 5 - Analyse des dialogues dans l'Avare de Molière" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Récupérer les données" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "L’Observatoire de la vie littéraire (OBVIL) promeut une approche de l'analyse des textes littéraires fondée sur le numérique. Dans le cadre du Projet Molière, des pièces de cet auteur ont été numérisées et sont accessibles librement dans différents formats utilisables par un programme informatique. Nous allons utiliser ici les textes sous format markdown accessibles [ici](http://dramacode.github.io/markdown/moliere_avare.txt)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "data_url = 'http://dramacode.github.io/markdown/moliere_avare.txt'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pour plus de reproductibilité, nous allons télécharger les données dans ce répertoire GitLab d'abord puis nous allons lire ce fichier plutôt que l'url directement." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "data_file = \"moliere_avare.txt\"\n", "\n", "import os\n", "import urllib.request\n", "if not os.path.exists(data_file):\n", " urllib.request.urlretrieve(data_url, data_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nous pouvons regarder les premières lignes de ce fichier:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "%load_ext rpy2.ipython" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "---\n", "identifier: moliere_avare \n", "creator: Molière. \n", "date: 1668 \n", "title: L'Avare. Comédie \n", "---\n", "\n", "\n", "L'AVARE,\n", "\n", "COMÉDIE.\n", "\n", "Par J.B.P. MOLIÈRE.\n", "\n", "À PARIS, Chez JEAN RIBOU, au Palais, vis à vis la Porte de l'Église de la Sainte Chapelle, à l'Image Saint-Louis. M. DC. LXIX. *AVEC PRIVILÈGE DU ROI*\n", "\n", "\n", "\n", "# ACTEURS.\n", " – Harpagon, Père de Cléante et d'Élise, et Amoureux de Mariane.\n", " – Cléante, Fils d'Harpagon, Amant de Mariane.\n", " – Élise, Fille d'Harpagon, Amante de Valère.\n", " – Valère, Fils d'Anselme, et Amant d'Élise.\n", " – Mariane, Amante de Cléante, et aimée d'Harpagon.\n", " – Anselme, Père de Valère et de Mariane.\n", " – Frosine, Femme d'Intrigue.\n", " – Maitre Simon, Courtier.\n", " – Maitre Jacques, Cuisinier et Cocher d'Harpagon.\n", " – La Flèche, Valet de Cléante.\n", " – Dame Claude, Servante d'Harpagon.\n", " – Brindavoine, laquais d'Harpagon.\n", " – La Merluche, laquais d'Harpagon.\n", " – Le commissaire, et son clerc.\n", "La Scène est à Paris.\n", "\n", "\n", "\n", "# L'Avare, *Comédie.*.\n", "\n", "\n", "## Acte Premier.\n", "\n", "\n", "### Scène Première.\n", "Valère, Élise\n", "\n", "\n", " VALÈRE.\n", "Hé quoi, charmante Élise, vous devenez mélancolique, après les obligeantes assurances que vous avez eu la bonté de me donner de votre foi ?Je vous vois soupirer, hélas, au milieu de ma joie !Est-ce du regret, dites-moi, de m'avoir fait heureux ? et vous repentez-vous de cet engagement où mes feux ont pu vous contraindre ?\n", "\n", " ÉLISE.\n", "Non, Valère, je ne puis pas me repentir de tout ce que je fais pour vous. Je m'y sens entraîner par une trop douce puissance, et je n'ai pas même la force de souhaiter que les choses ne fussent pas. Mais, à vous dire vrai, le succès me donne de l'inquiétude ; et je crains fort de vous aimer un peu plus que je ne devrais.\n", "\n", " VALÈRE.\n", "Hé que pouvez-vous craindre, Élise, dans les bontés que vous avez pour moi ?\n" ] } ], "source": [ "%%sh\n", "head -n 55 moliere_avare.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comme nous pouvons le voir, les actes sont marqués par un double-dièse en début de ligne et les scènes sont marquées par un triple-dièse en début de ligne. Les dialogues de personnages ont l'air d'être sur 2 lignes : une première avec le nom du personnage en majuscule et une deuxième avec les répliques du personnage.\n", "\n", "Nous allons tenter de réarranger les données sous forme de tableau, comme ceci :\n", "\n", "| Personnage | Acte | Scene | Nombre de Mots | Nombre de Repliques |\n", "|:-------|:------|:-------|:------------|:----------------|\n", "| nom du personnage | acte dans lequel il apparait | scène dans laquelle il figure | le nombre de mots qu'il parle | le nombre de repliques |\n", "\n", "Nous allons créer un fonction qui va remplacer les caractères à accent en caractère \"normaux\":" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [], "source": [ "import unicodedata\n", "\n", "def remove_accents(input_str):\n", " nfkd_form = unicodedata.normalize('NFKD', input_str)\n", " return u\"\".join([c for c in nfkd_form if not unicodedata.combining(c)])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comme nous pouvons le voir, il y a quelques discordances entre la liste des personnages énumérés en-dessous du numéro de scène et les répliques dans le dialogue." ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "VALERE is not listed under scene 19 acte 3 but is speaking\n", "No characters listed for scene 26 acte 4\n", "HARPAGON is not listed under scene 26 acte 4 but is speaking\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
actenombre_de_motsnombre_de_repliquespersonnagescene
015968VALERE1
114738ELISE1
2172510CLEANTE2
311509ELISE2
4139634HARPAGON3
5124232LA FLECHE3
6114723ELISE4
7121129CLEANTE4
81104453HARPAGON4
9162122VALERE5
10123820HARPAGON5
111364ELISE5
12235021CLEANTE6
13285320LA FLECHE6
14200MAITRE SIMON7
1521599HARPAGON7
1621266CLEANTE7
172131LA FLECHE7
18211FROSINE8
192211HARPAGON8
2022816LA FLECHE9
2121165FROSINE9
22252035HARPAGON10
232123435FROSINE10
24355734HARPAGON11
253743CLEANTE11
26331ELISE11
27324711VALERE11
28300DAME CLAUDE11
29300MAITRE JACQUES11
..................
824111HARPAGON26
835876HARPAGON27
8451047LE COMMISSAIRE27
85500SON CLERC27
86500MAITRE JACQUES28
87517219HARPAGON28
8851548LE COMMISSAIRE28
89500SON CLERC28
90561130VALERE29
91543430HARPAGON29
92500LE COMMISSAIRE29
93500SON CLERC29
94500MAITRE JACQUES29
955101ELISE30
96500MARIANE30
97541FROSINE30
9851244HARPAGON30
995201VALERE30
100500MAITRE JACQUES30
101500LE COMMISSAIRE30
102500SON CLERC30
103538314ANSELME31
104524511HARPAGON31
105500ELISE31
10651903MARIANE31
107500FROSINE31
108534614VALERE31
109500MAITRE JACQUES31
110500LE COMMISSAIRE31
111500SON CLERC31
\n", "

112 rows × 5 columns

\n", "
" ], "text/plain": [ " acte nombre_de_mots nombre_de_repliques personnage scene\n", "0 1 596 8 VALERE 1\n", "1 1 473 8 ELISE 1\n", "2 1 725 10 CLEANTE 2\n", "3 1 150 9 ELISE 2\n", "4 1 396 34 HARPAGON 3\n", "5 1 242 32 LA FLECHE 3\n", "6 1 147 23 ELISE 4\n", "7 1 211 29 CLEANTE 4\n", "8 1 1044 53 HARPAGON 4\n", "9 1 621 22 VALERE 5\n", "10 1 238 20 HARPAGON 5\n", "11 1 36 4 ELISE 5\n", "12 2 350 21 CLEANTE 6\n", "13 2 853 20 LA FLECHE 6\n", "14 2 0 0 MAITRE SIMON 7\n", "15 2 159 9 HARPAGON 7\n", "16 2 126 6 CLEANTE 7\n", "17 2 13 1 LA FLECHE 7\n", "18 2 1 1 FROSINE 8\n", "19 2 21 1 HARPAGON 8\n", "20 2 281 6 LA FLECHE 9\n", "21 2 116 5 FROSINE 9\n", "22 2 520 35 HARPAGON 10\n", "23 2 1234 35 FROSINE 10\n", "24 3 557 34 HARPAGON 11\n", "25 3 74 3 CLEANTE 11\n", "26 3 3 1 ELISE 11\n", "27 3 247 11 VALERE 11\n", "28 3 0 0 DAME CLAUDE 11\n", "29 3 0 0 MAITRE JACQUES 11\n", ".. ... ... ... ... ...\n", "82 4 11 1 HARPAGON 26\n", "83 5 87 6 HARPAGON 27\n", "84 5 104 7 LE COMMISSAIRE 27\n", "85 5 0 0 SON CLERC 27\n", "86 5 0 0 MAITRE JACQUES 28\n", "87 5 172 19 HARPAGON 28\n", "88 5 154 8 LE COMMISSAIRE 28\n", "89 5 0 0 SON CLERC 28\n", "90 5 611 30 VALERE 29\n", "91 5 434 30 HARPAGON 29\n", "92 5 0 0 LE COMMISSAIRE 29\n", "93 5 0 0 SON CLERC 29\n", "94 5 0 0 MAITRE JACQUES 29\n", "95 5 10 1 ELISE 30\n", "96 5 0 0 MARIANE 30\n", "97 5 4 1 FROSINE 30\n", "98 5 124 4 HARPAGON 30\n", "99 5 20 1 VALERE 30\n", "100 5 0 0 MAITRE JACQUES 30\n", "101 5 0 0 LE COMMISSAIRE 30\n", "102 5 0 0 SON CLERC 30\n", "103 5 383 14 ANSELME 31\n", "104 5 245 11 HARPAGON 31\n", "105 5 0 0 ELISE 31\n", "106 5 190 3 MARIANE 31\n", "107 5 0 0 FROSINE 31\n", "108 5 346 14 VALERE 31\n", "109 5 0 0 MAITRE JACQUES 31\n", "110 5 0 0 LE COMMISSAIRE 31\n", "111 5 0 0 SON CLERC 31\n", "\n", "[112 rows x 5 columns]" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "acte = 0\n", "scene = 0\n", "infos_scene = {}\n", "\n", "data = []\n", "with open(data_file) as f:\n", " lines = f.readlines()\n", "nbline = 0\n", "while nbline < len(lines):\n", " line = lines[nbline]\n", " if line.startswith(\"## Acte\"):\n", " acte += 1\n", " elif line.startswith(\"### Sc\"):\n", " if infos_scene:\n", " data += infos_scene.values()\n", " scene += 1\n", " nbline += 1\n", " line = lines[nbline]\n", " if line.strip():\n", " infos_scene = {l.strip().upper():{'acte':acte,'scene':scene,'personnage':l.strip().upper(),'nombre_de_mots':0,'nombre_de_repliques':0} for l in remove_accents(line.strip()).split(\",\")}\n", " else:\n", " infos_scene = {}\n", " print(\"No characters listed for scene\",scene,\"acte\",acte)\n", " elif re.search(r\"^ [A-ZÈÉ ]+.$\",line):\n", " assert acte and scene\n", " personnage = remove_accents(re.search(r\"^ ([A-ZÈÉ ]+).$\",line).groups()[0])\n", " nbline += 1\n", " line = lines[nbline]\n", " assert line.strip() # check line is not empty\n", " nombre_de_mots = len(line.split()) # on va supposer que la ponctuation est négligeable dans le compte\n", " if personnage not in infos_scene:\n", " print(personnage,\"is not listed under scene\",scene,\"acte\",acte,\"but is speaking\")\n", " infos_scene[personnage] = {'acte':acte,'scene':scene,'personnage':personnage,'nombre_de_mots':0,'nombre_de_repliques':0}\n", " infos_scene[personnage]['nombre_de_repliques'] += 1\n", " infos_scene[personnage]['nombre_de_mots'] += nombre_de_mots\n", " nbline += 1\n", "df = pd.DataFrame(data)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Voyons voir s'il y a bien le bon nombre d'Actes et de Scènes:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\n", "32\n" ] } ], "source": [ "%%sh\n", "grep -c \"## Acte\" moliere_avare.txt\n", "grep -c \"### Scène\" moliere_avare.txt" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5,)" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.acte.value_counts().shape" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(31,)" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.scene.value_counts().shape" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(15,)" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.personnage.value_counts().shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On voit donc qu'on retrouve le même nombre d'actes et de scènes qu'avec l'analyse de texte via `grep`. On retrouve aussi les 15 personnages listées en début de document." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " – Harpagon, Père de Cléante et d'Élise, et Amoureux de Mariane.\n", " – Cléante, Fils d'Harpagon, Amant de Mariane.\n", " – Élise, Fille d'Harpagon, Amante de Valère.\n", " – Valère, Fils d'Anselme, et Amant d'Élise.\n", " – Mariane, Amante de Cléante, et aimée d'Harpagon.\n", " – Anselme, Père de Valère et de Mariane.\n", " – Frosine, Femme d'Intrigue.\n", " – Maitre Simon, Courtier.\n", " – Maitre Jacques, Cuisinier et Cocher d'Harpagon.\n", " – La Flèche, Valet de Cléante.\n", " – Dame Claude, Servante d'Harpagon.\n", " – Brindavoine, laquais d'Harpagon.\n", " – La Merluche, laquais d'Harpagon.\n", " – Le commissaire, et son clerc.\n", "\n" ] } ], "source": [ "print(\"\".join(lines[19:33]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analyser les données" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }