{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sujet 5 - Analyse des dialogues dans l'Avare de Molière"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Récupérer les données"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import re"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"L’Observatoire de la vie littéraire (OBVIL) promeut une approche de l'analyse des textes littéraires fondée sur le numérique. Dans le cadre du Projet Molière, des pièces de cet auteur ont été numérisées et sont accessibles librement dans différents formats utilisables par un programme informatique. Nous allons utiliser ici les textes sous format markdown accessibles [ici](http://dramacode.github.io/markdown/moliere_avare.txt)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data_url = 'http://dramacode.github.io/markdown/moliere_avare.txt'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pour plus de reproductibilité, nous allons télécharger les données dans ce répertoire GitLab d'abord puis nous allons lire ce fichier plutôt que l'url directement."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data_file = \"moliere_avare.txt\"\n",
"\n",
"import os\n",
"import urllib.request\n",
"if not os.path.exists(data_file):\n",
" urllib.request.urlretrieve(data_url, data_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous pouvons regarder les premières lignes de ce fichier:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"%load_ext rpy2.ipython"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"---\n",
"identifier: moliere_avare \n",
"creator: Molière. \n",
"date: 1668 \n",
"title: L'Avare. Comédie \n",
"---\n",
"\n",
"\n",
"L'AVARE,\n",
"\n",
"COMÉDIE.\n",
"\n",
"Par J.B.P. MOLIÈRE.\n",
"\n",
"À PARIS, Chez JEAN RIBOU, au Palais, vis à vis la Porte de l'Église de la Sainte Chapelle, à l'Image Saint-Louis. M. DC. LXIX. *AVEC PRIVILÈGE DU ROI*\n",
"\n",
"\n",
"\n",
"# ACTEURS.\n",
" – Harpagon, Père de Cléante et d'Élise, et Amoureux de Mariane.\n",
" – Cléante, Fils d'Harpagon, Amant de Mariane.\n",
" – Élise, Fille d'Harpagon, Amante de Valère.\n",
" – Valère, Fils d'Anselme, et Amant d'Élise.\n",
" – Mariane, Amante de Cléante, et aimée d'Harpagon.\n",
" – Anselme, Père de Valère et de Mariane.\n",
" – Frosine, Femme d'Intrigue.\n",
" – Maitre Simon, Courtier.\n",
" – Maitre Jacques, Cuisinier et Cocher d'Harpagon.\n",
" – La Flèche, Valet de Cléante.\n",
" – Dame Claude, Servante d'Harpagon.\n",
" – Brindavoine, laquais d'Harpagon.\n",
" – La Merluche, laquais d'Harpagon.\n",
" – Le commissaire, et son clerc.\n",
"La Scène est à Paris.\n",
"\n",
"\n",
"\n",
"# L'Avare, *Comédie.*.\n",
"\n",
"\n",
"## Acte Premier.\n",
"\n",
"\n",
"### Scène Première.\n",
"Valère, Élise\n",
"\n",
"\n",
" VALÈRE.\n",
"Hé quoi, charmante Élise, vous devenez mélancolique, après les obligeantes assurances que vous avez eu la bonté de me donner de votre foi ?Je vous vois soupirer, hélas, au milieu de ma joie !Est-ce du regret, dites-moi, de m'avoir fait heureux ? et vous repentez-vous de cet engagement où mes feux ont pu vous contraindre ?\n",
"\n",
" ÉLISE.\n",
"Non, Valère, je ne puis pas me repentir de tout ce que je fais pour vous. Je m'y sens entraîner par une trop douce puissance, et je n'ai pas même la force de souhaiter que les choses ne fussent pas. Mais, à vous dire vrai, le succès me donne de l'inquiétude ; et je crains fort de vous aimer un peu plus que je ne devrais.\n",
"\n",
" VALÈRE.\n",
"Hé que pouvez-vous craindre, Élise, dans les bontés que vous avez pour moi ?\n"
]
}
],
"source": [
"%%sh\n",
"head -n 55 moliere_avare.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comme nous pouvons le voir, les actes sont marqués par un double-dièse en début de ligne et les scènes sont marquées par un triple-dièse en début de ligne. Les dialogues de personnages ont l'air d'être sur 2 lignes : une première avec le nom du personnage en majuscule et une deuxième avec les répliques du personnage.\n",
"\n",
"Nous allons tenter de réarranger les données sous forme de tableau, comme ceci :\n",
"\n",
"| Personnage | Acte | Scene | Nombre de Mots | Nombre de Repliques |\n",
"|:-------|:------|:-------|:------------|:----------------|\n",
"| nom du personnage | acte dans lequel il apparait | scène dans laquelle il figure | le nombre de mots qu'il parle | le nombre de repliques |\n",
"\n",
"Nous allons créer un fonction qui va remplacer les caractères à accent en caractère \"normaux\":"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import unicodedata\n",
"\n",
"def remove_accents(input_str):\n",
" nfkd_form = unicodedata.normalize('NFKD', input_str)\n",
" return u\"\".join([c for c in nfkd_form if not unicodedata.combining(c)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comme nous pouvons le voir, il y a quelques discordances entre la liste des personnages énumérés en-dessous du numéro de scène et les répliques dans le dialogue."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"VALERE is not listed under scene 19 acte 3 but is speaking\n",
"No characters listed for scene 26 acte 4\n",
"HARPAGON is not listed under scene 26 acte 4 but is speaking\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" acte \n",
" nombre_de_mots \n",
" nombre_de_repliques \n",
" personnage \n",
" scene \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" 596 \n",
" 8 \n",
" VALERE \n",
" 1 \n",
" \n",
" \n",
" 1 \n",
" 1 \n",
" 473 \n",
" 8 \n",
" ELISE \n",
" 1 \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" 725 \n",
" 10 \n",
" CLEANTE \n",
" 2 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 150 \n",
" 9 \n",
" ELISE \n",
" 2 \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" 396 \n",
" 34 \n",
" HARPAGON \n",
" 3 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" acte nombre_de_mots nombre_de_repliques personnage scene\n",
"0 1 596 8 VALERE 1\n",
"1 1 473 8 ELISE 1\n",
"2 1 725 10 CLEANTE 2\n",
"3 1 150 9 ELISE 2\n",
"4 1 396 34 HARPAGON 3"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"acte = 0\n",
"scene = 0\n",
"infos_scene = {}\n",
"\n",
"data = []\n",
"with open(data_file) as f:\n",
" lines = f.readlines()\n",
"nbline = 0\n",
"while nbline < len(lines):\n",
" line = lines[nbline]\n",
" if line.startswith(\"## Acte\"):\n",
" acte += 1\n",
" elif line.startswith(\"### Sc\"):\n",
" if infos_scene:\n",
" data += infos_scene.values()\n",
" scene += 1\n",
" nbline += 1\n",
" line = lines[nbline]\n",
" if line.strip():\n",
" infos_scene = {l.strip().upper():{'acte':acte,'scene':scene,'personnage':l.strip().upper(),'nombre_de_mots':0,'nombre_de_repliques':0} for l in remove_accents(line.strip()).split(\",\")}\n",
" else:\n",
" infos_scene = {}\n",
" print(\"No characters listed for scene\",scene,\"acte\",acte)\n",
" elif re.search(r\"^ [A-ZÈÉ ]+.$\",line):\n",
" assert acte and scene\n",
" personnage = remove_accents(re.search(r\"^ ([A-ZÈÉ ]+).$\",line).groups()[0])\n",
" nbline += 1\n",
" line = lines[nbline]\n",
" assert line.strip() # check line is not empty\n",
" nombre_de_mots = len(line.split()) # on va supposer que la ponctuation est négligeable dans le compte\n",
" if personnage not in infos_scene:\n",
" print(personnage,\"is not listed under scene\",scene,\"acte\",acte,\"but is speaking\")\n",
" infos_scene[personnage] = {'acte':acte,'scene':scene,'personnage':personnage,'nombre_de_mots':0,'nombre_de_repliques':0}\n",
" infos_scene[personnage]['nombre_de_repliques'] += 1\n",
" infos_scene[personnage]['nombre_de_mots'] += nombre_de_mots\n",
" nbline += 1\n",
"df = pd.DataFrame(data)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Voyons voir s'il y a bien le bon nombre d'Actes et de Scènes:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5\n",
"32\n"
]
}
],
"source": [
"%%sh\n",
"grep -c \"## Acte\" moliere_avare.txt\n",
"grep -c \"### Scène\" moliere_avare.txt"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(5,)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.acte.value_counts().shape"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(31,)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.scene.value_counts().shape"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(15,)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.personnage.value_counts().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On voit donc qu'on retrouve le même nombre d'actes et de scènes qu'avec l'analyse de texte via `grep`. On retrouve aussi les 15 personnages listées en début de document."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" – Harpagon, Père de Cléante et d'Élise, et Amoureux de Mariane.\n",
" – Cléante, Fils d'Harpagon, Amant de Mariane.\n",
" – Élise, Fille d'Harpagon, Amante de Valère.\n",
" – Valère, Fils d'Anselme, et Amant d'Élise.\n",
" – Mariane, Amante de Cléante, et aimée d'Harpagon.\n",
" – Anselme, Père de Valère et de Mariane.\n",
" – Frosine, Femme d'Intrigue.\n",
" – Maitre Simon, Courtier.\n",
" – Maitre Jacques, Cuisinier et Cocher d'Harpagon.\n",
" – La Flèche, Valet de Cléante.\n",
" – Dame Claude, Servante d'Harpagon.\n",
" – Brindavoine, laquais d'Harpagon.\n",
" – La Merluche, laquais d'Harpagon.\n",
" – Le commissaire, et son clerc.\n",
"\n"
]
}
],
"source": [
"print(\"\".join(lines[19:33]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyser les données"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Classons les personnages selon la quantité de parole grâce à une analyse syntaxique du texte (scènes / répliques / mots). En particulier, quel est celui qui parle le plus ? Quel est celui qui ne parle pas du tout ?"
]
},
{
"cell_type": "code",
"execution_count": 179,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" personnage \n",
" \n",
" \n",
" \n",
" \n",
" HARPAGON \n",
" 22 \n",
" \n",
" \n",
" FROSINE \n",
" 14 \n",
" \n",
" \n",
" CLEANTE \n",
" 14 \n",
" \n",
" \n",
" ELISE \n",
" 13 \n",
" \n",
" \n",
" MARIANE \n",
" 11 \n",
" \n",
" \n",
" VALERE \n",
" 8 \n",
" \n",
" \n",
" MAITRE JACQUES \n",
" 8 \n",
" \n",
" \n",
" LA FLECHE \n",
" 5 \n",
" \n",
" \n",
" LE COMMISSAIRE \n",
" 5 \n",
" \n",
" \n",
" SON CLERC \n",
" 5 \n",
" \n",
" \n",
" BRINDAVOINE \n",
" 2 \n",
" \n",
" \n",
" LA MERLUCHE \n",
" 2 \n",
" \n",
" \n",
" DAME CLAUDE \n",
" 1 \n",
" \n",
" \n",
" ANSELME \n",
" 1 \n",
" \n",
" \n",
" MAITRE SIMON \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" personnage\n",
"HARPAGON 22\n",
"FROSINE 14\n",
"CLEANTE 14\n",
"ELISE 13\n",
"MARIANE 11\n",
"VALERE 8\n",
"MAITRE JACQUES 8\n",
"LA FLECHE 5\n",
"LE COMMISSAIRE 5\n",
"SON CLERC 5\n",
"BRINDAVOINE 2\n",
"LA MERLUCHE 2\n",
"DAME CLAUDE 1\n",
"ANSELME 1\n",
"MAITRE SIMON 1"
]
},
"execution_count": 179,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame(df.personnage.value_counts())"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" nombre_de_mots \n",
" \n",
" \n",
" personnage \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" DAME CLAUDE \n",
" 0 \n",
" \n",
" \n",
" MAITRE JACQUES \n",
" 0 \n",
" \n",
" \n",
" MAITRE SIMON \n",
" 0 \n",
" \n",
" \n",
" SON CLERC \n",
" 0 \n",
" \n",
" \n",
" BRINDAVOINE \n",
" 38 \n",
" \n",
" \n",
" LA MERLUCHE \n",
" 49 \n",
" \n",
" \n",
" LE COMMISSAIRE \n",
" 258 \n",
" \n",
" \n",
" ANSELME \n",
" 383 \n",
" \n",
" \n",
" MARIANE \n",
" 819 \n",
" \n",
" \n",
" ELISE \n",
" 893 \n",
" \n",
" \n",
" LA FLECHE \n",
" 1419 \n",
" \n",
" \n",
" FROSINE \n",
" 2033 \n",
" \n",
" \n",
" VALERE \n",
" 2532 \n",
" \n",
" \n",
" CLEANTE \n",
" 3046 \n",
" \n",
" \n",
" HARPAGON \n",
" 5092 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" nombre_de_mots\n",
"personnage \n",
"DAME CLAUDE 0\n",
"MAITRE JACQUES 0\n",
"MAITRE SIMON 0\n",
"SON CLERC 0\n",
"BRINDAVOINE 38\n",
"LA MERLUCHE 49\n",
"LE COMMISSAIRE 258\n",
"ANSELME 383\n",
"MARIANE 819\n",
"ELISE 893\n",
"LA FLECHE 1419\n",
"FROSINE 2033\n",
"VALERE 2532\n",
"CLEANTE 3046\n",
"HARPAGON 5092"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['nombre_de_mots','personnage']].groupby('personnage').sum().sort_values(by=['nombre_de_mots'])"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" nombre_de_repliques \n",
" \n",
" \n",
" personnage \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" DAME CLAUDE \n",
" 0 \n",
" \n",
" \n",
" MAITRE JACQUES \n",
" 0 \n",
" \n",
" \n",
" MAITRE SIMON \n",
" 0 \n",
" \n",
" \n",
" SON CLERC \n",
" 0 \n",
" \n",
" \n",
" BRINDAVOINE \n",
" 3 \n",
" \n",
" \n",
" LA MERLUCHE \n",
" 5 \n",
" \n",
" \n",
" ANSELME \n",
" 14 \n",
" \n",
" \n",
" LE COMMISSAIRE \n",
" 15 \n",
" \n",
" \n",
" MARIANE \n",
" 26 \n",
" \n",
" \n",
" ELISE \n",
" 50 \n",
" \n",
" \n",
" FROSINE \n",
" 59 \n",
" \n",
" \n",
" LA FLECHE \n",
" 64 \n",
" \n",
" \n",
" VALERE \n",
" 99 \n",
" \n",
" \n",
" CLEANTE \n",
" 156 \n",
" \n",
" \n",
" HARPAGON \n",
" 334 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" nombre_de_repliques\n",
"personnage \n",
"DAME CLAUDE 0\n",
"MAITRE JACQUES 0\n",
"MAITRE SIMON 0\n",
"SON CLERC 0\n",
"BRINDAVOINE 3\n",
"LA MERLUCHE 5\n",
"ANSELME 14\n",
"LE COMMISSAIRE 15\n",
"MARIANE 26\n",
"ELISE 50\n",
"FROSINE 59\n",
"LA FLECHE 64\n",
"VALERE 99\n",
"CLEANTE 156\n",
"HARPAGON 334"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['nombre_de_repliques','personnage']].groupby('personnage').sum().sort_values(by=['nombre_de_repliques'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On voit dans ces analyses qu'Harpagon participe au plus grand nombre de scènes (22 sur 31). En terme de nombre de mots parlés, Harpagon est aussi celui qui parle le plus avec Dame Claude, Maitre Jacques, Maitre Simon et le clerc qui ne parlent pas du tout. C'est aussi Harpagon qui a le plus grand nombre de répliques."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous allons maintenant représenter la distribution des nombres de mots parlés par personne dans chaque scène sous forme de graphique. Pour cela, nous allons tout d'abord mettre en forme les données en extrayant les colonnes qui nous intéressent et en les formatant de \"long\" à \"wide\" (2D) compatible à pd.plot."
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" scene \n",
" personnage \n",
" nombre_de_mots \n",
" \n",
" \n",
" \n",
" \n",
" 11 \n",
" 5 \n",
" VALERE \n",
" 621 \n",
" \n",
" \n",
" 2 \n",
" 2 \n",
" CLEANTE \n",
" 725 \n",
" \n",
" \n",
" 13 \n",
" 6 \n",
" LA FLECHE \n",
" 853 \n",
" \n",
" \n",
" 8 \n",
" 4 \n",
" HARPAGON \n",
" 1044 \n",
" \n",
" \n",
" 22 \n",
" 10 \n",
" FROSINE \n",
" 1234 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" scene personnage nombre_de_mots\n",
"11 5 VALERE 621\n",
"2 2 CLEANTE 725\n",
"13 6 LA FLECHE 853\n",
"8 4 HARPAGON 1044\n",
"22 10 FROSINE 1234"
]
},
"execution_count": 149,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_wc = df[['scene','personnage','nombre_de_mots']].groupby(['scene','personnage']).sum().reset_index()\n",
"df_wc.sort_values(by='nombre_de_mots').tail()"
]
},
{
"cell_type": "code",
"execution_count": 175,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" personnage \n",
" scene \n",
" ANSELME \n",
" BRINDAVOINE \n",
" CLEANTE \n",
" DAME CLAUDE \n",
" ELISE \n",
" FROSINE \n",
" HARPAGON \n",
" LA FLECHE \n",
" LA MERLUCHE \n",
" LE COMMISSAIRE \n",
" MAITRE JACQUES \n",
" MAITRE SIMON \n",
" MARIANE \n",
" SON CLERC \n",
" VALERE \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 473.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 596.0 \n",
" \n",
" \n",
" 1 \n",
" 2 \n",
" 0.0 \n",
" 0.0 \n",
" 725.0 \n",
" 0.0 \n",
" 150.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 2 \n",
" 3 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 396.0 \n",
" 242.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 3 \n",
" 4 \n",
" 0.0 \n",
" 0.0 \n",
" 211.0 \n",
" 0.0 \n",
" 147.0 \n",
" 0.0 \n",
" 1044.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" \n",
" \n",
" 4 \n",
" 5 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 36.0 \n",
" 0.0 \n",
" 238.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" 621.0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"personnage scene ANSELME BRINDAVOINE CLEANTE DAME CLAUDE ELISE FROSINE \\\n",
"0 1 0.0 0.0 0.0 0.0 473.0 0.0 \n",
"1 2 0.0 0.0 725.0 0.0 150.0 0.0 \n",
"2 3 0.0 0.0 0.0 0.0 0.0 0.0 \n",
"3 4 0.0 0.0 211.0 0.0 147.0 0.0 \n",
"4 5 0.0 0.0 0.0 0.0 36.0 0.0 \n",
"\n",
"personnage HARPAGON LA FLECHE LA MERLUCHE LE COMMISSAIRE MAITRE JACQUES \\\n",
"0 0.0 0.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 0.0 0.0 0.0 \n",
"2 396.0 242.0 0.0 0.0 0.0 \n",
"3 1044.0 0.0 0.0 0.0 0.0 \n",
"4 238.0 0.0 0.0 0.0 0.0 \n",
"\n",
"personnage MAITRE SIMON MARIANE SON CLERC VALERE \n",
"0 0.0 0.0 0.0 596.0 \n",
"1 0.0 0.0 0.0 0.0 \n",
"2 0.0 0.0 0.0 0.0 \n",
"3 0.0 0.0 0.0 0.0 \n",
"4 0.0 0.0 0.0 621.0 "
]
},
"execution_count": 175,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2d_wc = df_wc.pivot('scene', 'personnage', 'nombre_de_mots').fillna(0).reset_index() #Reshape from long to wide\n",
"df2d_wc.head()"
]
},
{
"cell_type": "code",
"execution_count": 174,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"df2d_wc.plot(x = 'scene',\n",
" kind = 'barh',\n",
" stacked = True,\n",
" title = 'Number of words per scene and per character',\n",
" mark_right = True)\n",
"plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now plot the interactions between characters as a network where nodes are characters and edges are plotted if these characters are together in a scene."
]
},
{
"cell_type": "code",
"execution_count": 215,
"metadata": {},
"outputs": [],
"source": [
"edges = {}\n",
"for scene in df.scene.unique():\n",
" personnages = sorted(list(df[df.scene==scene].personnage))\n",
" for p1 in range(0,len(personnages)-1):\n",
" for p2 in range(p1+1,len(personnages)):\n",
" if personnages[p1] not in edges:\n",
" edges[personnages[p1]] = {}\n",
" if personnages[p2] not in edges[personnages[p1]]:\n",
" edges[personnages[p1]][personnages[p2]] = 0\n",
" edges[personnages[p1]][personnages[p2]] += 1\n",
"edges_tuples = []\n",
"for p1,d in edges.items():\n",
" for p2,count in d.items():\n",
" edges_tuples.append([p1,p2,count])\n",
"df_edges = pd.DataFrame(edges_tuples, columns=['source','target','weight'])"
]
},
{
"cell_type": "code",
"execution_count": 220,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" source \n",
" target \n",
" weight \n",
" \n",
" \n",
" \n",
" \n",
" 8 \n",
" CLEANTE \n",
" ELISE \n",
" 8 \n",
" \n",
" \n",
" 27 \n",
" FROSINE \n",
" HARPAGON \n",
" 10 \n",
" \n",
" \n",
" 1 \n",
" ELISE \n",
" HARPAGON \n",
" 10 \n",
" \n",
" \n",
" 9 \n",
" CLEANTE \n",
" HARPAGON \n",
" 10 \n",
" \n",
" \n",
" 30 \n",
" FROSINE \n",
" MARIANE \n",
" 11 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" source target weight\n",
"8 CLEANTE ELISE 8\n",
"27 FROSINE HARPAGON 10\n",
"1 ELISE HARPAGON 10\n",
"9 CLEANTE HARPAGON 10\n",
"30 FROSINE MARIANE 11"
]
},
"execution_count": 220,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_edges.sort_values(by='weight').tail()"
]
},
{
"cell_type": "code",
"execution_count": 223,
"metadata": {},
"outputs": [],
"source": [
"import networkx as nx\n",
"g = nx.Graph()\n",
"g = nx.from_pandas_edgelist(df_edges, edge_attr=True)"
]
},
{
"cell_type": "code",
"execution_count": 233,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"nx.draw(g, with_labels = True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}