Premiere partie sur la recuperation des donnees et de leur mise en forme en tableau

parent 588dc437
{
"cells": [],
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sujet 5 - Analyse des dialogues dans l'Avare de Molière"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Récupérer les données"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import re"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"L’Observatoire de la vie littéraire (OBVIL) promeut une approche de l'analyse des textes littéraires fondée sur le numérique. Dans le cadre du Projet Molière, des pièces de cet auteur ont été numérisées et sont accessibles librement dans différents formats utilisables par un programme informatique. Nous allons utiliser ici les textes sous format markdown accessibles [ici](http://dramacode.github.io/markdown/moliere_avare.txt)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data_url = 'http://dramacode.github.io/markdown/moliere_avare.txt'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pour plus de reproductibilité, nous allons télécharger les données dans ce répertoire GitLab d'abord puis nous allons lire ce fichier plutôt que l'url directement."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data_file = \"moliere_avare.txt\"\n",
"\n",
"import os\n",
"import urllib.request\n",
"if not os.path.exists(data_file):\n",
" urllib.request.urlretrieve(data_url, data_file)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous pouvons regarder les premières lignes de ce fichier:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"%load_ext rpy2.ipython"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"---\n",
"identifier: moliere_avare \n",
"creator: Molière. \n",
"date: 1668 \n",
"title: L'Avare. Comédie \n",
"---\n",
"\n",
"\n",
"L'AVARE,\n",
"\n",
"COMÉDIE.\n",
"\n",
"Par J.B.P. MOLIÈRE.\n",
"\n",
"À PARIS, Chez JEAN RIBOU, au Palais, vis à vis la Porte de l'Église de la Sainte Chapelle, à l'Image Saint-Louis. M. DC. LXIX. *AVEC PRIVILÈGE DU ROI*\n",
"\n",
"\n",
"\n",
"# ACTEURS.\n",
" – Harpagon, Père de Cléante et d'Élise, et Amoureux de Mariane.\n",
" – Cléante, Fils d'Harpagon, Amant de Mariane.\n",
" – Élise, Fille d'Harpagon, Amante de Valère.\n",
" – Valère, Fils d'Anselme, et Amant d'Élise.\n",
" – Mariane, Amante de Cléante, et aimée d'Harpagon.\n",
" – Anselme, Père de Valère et de Mariane.\n",
" – Frosine, Femme d'Intrigue.\n",
" – Maitre Simon, Courtier.\n",
" – Maitre Jacques, Cuisinier et Cocher d'Harpagon.\n",
" – La Flèche, Valet de Cléante.\n",
" – Dame Claude, Servante d'Harpagon.\n",
" – Brindavoine, laquais d'Harpagon.\n",
" – La Merluche, laquais d'Harpagon.\n",
" – Le commissaire, et son clerc.\n",
"La Scène est à Paris.\n",
"\n",
"\n",
"\n",
"# L'Avare, *Comédie.*.\n",
"\n",
"\n",
"## Acte Premier.\n",
"\n",
"\n",
"### Scène Première.\n",
"Valère, Élise\n",
"\n",
"\n",
" VALÈRE.\n",
"Hé quoi, charmante Élise, vous devenez mélancolique, après les obligeantes assurances que vous avez eu la bonté de me donner de votre foi ?Je vous vois soupirer, hélas, au milieu de ma joie !Est-ce du regret, dites-moi, de m'avoir fait heureux ? et vous repentez-vous de cet engagement où mes feux ont pu vous contraindre ?\n",
"\n",
" ÉLISE.\n",
"Non, Valère, je ne puis pas me repentir de tout ce que je fais pour vous. Je m'y sens entraîner par une trop douce puissance, et je n'ai pas même la force de souhaiter que les choses ne fussent pas. Mais, à vous dire vrai, le succès me donne de l'inquiétude ; et je crains fort de vous aimer un peu plus que je ne devrais.\n",
"\n",
" VALÈRE.\n",
"Hé que pouvez-vous craindre, Élise, dans les bontés que vous avez pour moi ?\n"
]
}
],
"source": [
"%%sh\n",
"head -n 55 moliere_avare.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comme nous pouvons le voir, les actes sont marqués par un double-dièse en début de ligne et les scènes sont marquées par un triple-dièse en début de ligne. Les dialogues de personnages ont l'air d'être sur 2 lignes : une première avec le nom du personnage en majuscule et une deuxième avec les répliques du personnage.\n",
"\n",
"Nous allons tenter de réarranger les données sous forme de tableau, comme ceci :\n",
"\n",
"| Personnage | Acte | Scene | Nombre de Mots | Nombre de Repliques |\n",
"|:-------|:------|:-------|:------------|:----------------|\n",
"| nom du personnage | acte dans lequel il apparait | scène dans laquelle il figure | le nombre de mots qu'il parle | le nombre de repliques |\n",
"\n",
"Nous allons créer un fonction qui va remplacer les caractères à accent en caractère \"normaux\":"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"import unicodedata\n",
"\n",
"def remove_accents(input_str):\n",
" nfkd_form = unicodedata.normalize('NFKD', input_str)\n",
" return u\"\".join([c for c in nfkd_form if not unicodedata.combining(c)])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Comme nous pouvons le voir, il y a quelques discordances entre la liste des personnages énumérés en-dessous du numéro de scène et les répliques dans le dialogue."
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"VALERE is not listed under scene 19 acte 3 but is speaking\n",
"No characters listed for scene 26 acte 4\n",
"HARPAGON is not listed under scene 26 acte 4 but is speaking\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>acte</th>\n",
" <th>nombre_de_mots</th>\n",
" <th>nombre_de_repliques</th>\n",
" <th>personnage</th>\n",
" <th>scene</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>596</td>\n",
" <td>8</td>\n",
" <td>VALERE</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>473</td>\n",
" <td>8</td>\n",
" <td>ELISE</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>725</td>\n",
" <td>10</td>\n",
" <td>CLEANTE</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>150</td>\n",
" <td>9</td>\n",
" <td>ELISE</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>1</td>\n",
" <td>396</td>\n",
" <td>34</td>\n",
" <td>HARPAGON</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>1</td>\n",
" <td>242</td>\n",
" <td>32</td>\n",
" <td>LA FLECHE</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1</td>\n",
" <td>147</td>\n",
" <td>23</td>\n",
" <td>ELISE</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>1</td>\n",
" <td>211</td>\n",
" <td>29</td>\n",
" <td>CLEANTE</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>1</td>\n",
" <td>1044</td>\n",
" <td>53</td>\n",
" <td>HARPAGON</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>1</td>\n",
" <td>621</td>\n",
" <td>22</td>\n",
" <td>VALERE</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>1</td>\n",
" <td>238</td>\n",
" <td>20</td>\n",
" <td>HARPAGON</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>1</td>\n",
" <td>36</td>\n",
" <td>4</td>\n",
" <td>ELISE</td>\n",
" <td>5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>2</td>\n",
" <td>350</td>\n",
" <td>21</td>\n",
" <td>CLEANTE</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>2</td>\n",
" <td>853</td>\n",
" <td>20</td>\n",
" <td>LA FLECHE</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>MAITRE SIMON</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>2</td>\n",
" <td>159</td>\n",
" <td>9</td>\n",
" <td>HARPAGON</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>2</td>\n",
" <td>126</td>\n",
" <td>6</td>\n",
" <td>CLEANTE</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>2</td>\n",
" <td>13</td>\n",
" <td>1</td>\n",
" <td>LA FLECHE</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>18</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>FROSINE</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>19</th>\n",
" <td>2</td>\n",
" <td>21</td>\n",
" <td>1</td>\n",
" <td>HARPAGON</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20</th>\n",
" <td>2</td>\n",
" <td>281</td>\n",
" <td>6</td>\n",
" <td>LA FLECHE</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>21</th>\n",
" <td>2</td>\n",
" <td>116</td>\n",
" <td>5</td>\n",
" <td>FROSINE</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22</th>\n",
" <td>2</td>\n",
" <td>520</td>\n",
" <td>35</td>\n",
" <td>HARPAGON</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23</th>\n",
" <td>2</td>\n",
" <td>1234</td>\n",
" <td>35</td>\n",
" <td>FROSINE</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>24</th>\n",
" <td>3</td>\n",
" <td>557</td>\n",
" <td>34</td>\n",
" <td>HARPAGON</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25</th>\n",
" <td>3</td>\n",
" <td>74</td>\n",
" <td>3</td>\n",
" <td>CLEANTE</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>26</th>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>ELISE</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>27</th>\n",
" <td>3</td>\n",
" <td>247</td>\n",
" <td>11</td>\n",
" <td>VALERE</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>28</th>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>DAME CLAUDE</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>29</th>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>MAITRE JACQUES</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>82</th>\n",
" <td>4</td>\n",
" <td>11</td>\n",
" <td>1</td>\n",
" <td>HARPAGON</td>\n",
" <td>26</td>\n",
" </tr>\n",
" <tr>\n",
" <th>83</th>\n",
" <td>5</td>\n",
" <td>87</td>\n",
" <td>6</td>\n",
" <td>HARPAGON</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>84</th>\n",
" <td>5</td>\n",
" <td>104</td>\n",
" <td>7</td>\n",
" <td>LE COMMISSAIRE</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>85</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>SON CLERC</td>\n",
" <td>27</td>\n",
" </tr>\n",
" <tr>\n",
" <th>86</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>MAITRE JACQUES</td>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>87</th>\n",
" <td>5</td>\n",
" <td>172</td>\n",
" <td>19</td>\n",
" <td>HARPAGON</td>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>88</th>\n",
" <td>5</td>\n",
" <td>154</td>\n",
" <td>8</td>\n",
" <td>LE COMMISSAIRE</td>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>89</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>SON CLERC</td>\n",
" <td>28</td>\n",
" </tr>\n",
" <tr>\n",
" <th>90</th>\n",
" <td>5</td>\n",
" <td>611</td>\n",
" <td>30</td>\n",
" <td>VALERE</td>\n",
" <td>29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91</th>\n",
" <td>5</td>\n",
" <td>434</td>\n",
" <td>30</td>\n",
" <td>HARPAGON</td>\n",
" <td>29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>92</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>LE COMMISSAIRE</td>\n",
" <td>29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>93</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>SON CLERC</td>\n",
" <td>29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>MAITRE JACQUES</td>\n",
" <td>29</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95</th>\n",
" <td>5</td>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>ELISE</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>96</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>MARIANE</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>97</th>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>FROSINE</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98</th>\n",
" <td>5</td>\n",
" <td>124</td>\n",
" <td>4</td>\n",
" <td>HARPAGON</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>99</th>\n",
" <td>5</td>\n",
" <td>20</td>\n",
" <td>1</td>\n",
" <td>VALERE</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>100</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>MAITRE JACQUES</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>LE COMMISSAIRE</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>102</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>SON CLERC</td>\n",
" <td>30</td>\n",
" </tr>\n",
" <tr>\n",
" <th>103</th>\n",
" <td>5</td>\n",
" <td>383</td>\n",
" <td>14</td>\n",
" <td>ANSELME</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>5</td>\n",
" <td>245</td>\n",
" <td>11</td>\n",
" <td>HARPAGON</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>105</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>ELISE</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106</th>\n",
" <td>5</td>\n",
" <td>190</td>\n",
" <td>3</td>\n",
" <td>MARIANE</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>107</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>FROSINE</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>108</th>\n",
" <td>5</td>\n",
" <td>346</td>\n",
" <td>14</td>\n",
" <td>VALERE</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>109</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>MAITRE JACQUES</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>110</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>LE COMMISSAIRE</td>\n",
" <td>31</td>\n",
" </tr>\n",
" <tr>\n",
" <th>111</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>SON CLERC</td>\n",
" <td>31</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>112 rows × 5 columns</p>\n",
"</div>"
],
"text/plain": [
" acte nombre_de_mots nombre_de_repliques personnage scene\n",
"0 1 596 8 VALERE 1\n",
"1 1 473 8 ELISE 1\n",
"2 1 725 10 CLEANTE 2\n",
"3 1 150 9 ELISE 2\n",
"4 1 396 34 HARPAGON 3\n",
"5 1 242 32 LA FLECHE 3\n",
"6 1 147 23 ELISE 4\n",
"7 1 211 29 CLEANTE 4\n",
"8 1 1044 53 HARPAGON 4\n",
"9 1 621 22 VALERE 5\n",
"10 1 238 20 HARPAGON 5\n",
"11 1 36 4 ELISE 5\n",
"12 2 350 21 CLEANTE 6\n",
"13 2 853 20 LA FLECHE 6\n",
"14 2 0 0 MAITRE SIMON 7\n",
"15 2 159 9 HARPAGON 7\n",
"16 2 126 6 CLEANTE 7\n",
"17 2 13 1 LA FLECHE 7\n",
"18 2 1 1 FROSINE 8\n",
"19 2 21 1 HARPAGON 8\n",
"20 2 281 6 LA FLECHE 9\n",
"21 2 116 5 FROSINE 9\n",
"22 2 520 35 HARPAGON 10\n",
"23 2 1234 35 FROSINE 10\n",
"24 3 557 34 HARPAGON 11\n",
"25 3 74 3 CLEANTE 11\n",
"26 3 3 1 ELISE 11\n",
"27 3 247 11 VALERE 11\n",
"28 3 0 0 DAME CLAUDE 11\n",
"29 3 0 0 MAITRE JACQUES 11\n",
".. ... ... ... ... ...\n",
"82 4 11 1 HARPAGON 26\n",
"83 5 87 6 HARPAGON 27\n",
"84 5 104 7 LE COMMISSAIRE 27\n",
"85 5 0 0 SON CLERC 27\n",
"86 5 0 0 MAITRE JACQUES 28\n",
"87 5 172 19 HARPAGON 28\n",
"88 5 154 8 LE COMMISSAIRE 28\n",
"89 5 0 0 SON CLERC 28\n",
"90 5 611 30 VALERE 29\n",
"91 5 434 30 HARPAGON 29\n",
"92 5 0 0 LE COMMISSAIRE 29\n",
"93 5 0 0 SON CLERC 29\n",
"94 5 0 0 MAITRE JACQUES 29\n",
"95 5 10 1 ELISE 30\n",
"96 5 0 0 MARIANE 30\n",
"97 5 4 1 FROSINE 30\n",
"98 5 124 4 HARPAGON 30\n",
"99 5 20 1 VALERE 30\n",
"100 5 0 0 MAITRE JACQUES 30\n",
"101 5 0 0 LE COMMISSAIRE 30\n",
"102 5 0 0 SON CLERC 30\n",
"103 5 383 14 ANSELME 31\n",
"104 5 245 11 HARPAGON 31\n",
"105 5 0 0 ELISE 31\n",
"106 5 190 3 MARIANE 31\n",
"107 5 0 0 FROSINE 31\n",
"108 5 346 14 VALERE 31\n",
"109 5 0 0 MAITRE JACQUES 31\n",
"110 5 0 0 LE COMMISSAIRE 31\n",
"111 5 0 0 SON CLERC 31\n",
"\n",
"[112 rows x 5 columns]"
]
},
"execution_count": 84,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"acte = 0\n",
"scene = 0\n",
"infos_scene = {}\n",
"\n",
"data = []\n",
"with open(data_file) as f:\n",
" lines = f.readlines()\n",
"nbline = 0\n",
"while nbline < len(lines):\n",
" line = lines[nbline]\n",
" if line.startswith(\"## Acte\"):\n",
" acte += 1\n",
" elif line.startswith(\"### Sc\"):\n",
" if infos_scene:\n",
" data += infos_scene.values()\n",
" scene += 1\n",
" nbline += 1\n",
" line = lines[nbline]\n",
" if line.strip():\n",
" infos_scene = {l.strip().upper():{'acte':acte,'scene':scene,'personnage':l.strip().upper(),'nombre_de_mots':0,'nombre_de_repliques':0} for l in remove_accents(line.strip()).split(\",\")}\n",
" else:\n",
" infos_scene = {}\n",
" print(\"No characters listed for scene\",scene,\"acte\",acte)\n",
" elif re.search(r\"^ [A-ZÈÉ ]+.$\",line):\n",
" assert acte and scene\n",
" personnage = remove_accents(re.search(r\"^ ([A-ZÈÉ ]+).$\",line).groups()[0])\n",
" nbline += 1\n",
" line = lines[nbline]\n",
" assert line.strip() # check line is not empty\n",
" nombre_de_mots = len(line.split()) # on va supposer que la ponctuation est négligeable dans le compte\n",
" if personnage not in infos_scene:\n",
" print(personnage,\"is not listed under scene\",scene,\"acte\",acte,\"but is speaking\")\n",
" infos_scene[personnage] = {'acte':acte,'scene':scene,'personnage':personnage,'nombre_de_mots':0,'nombre_de_repliques':0}\n",
" infos_scene[personnage]['nombre_de_repliques'] += 1\n",
" infos_scene[personnage]['nombre_de_mots'] += nombre_de_mots\n",
" nbline += 1\n",
"df = pd.DataFrame(data)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Voyons voir s'il y a bien le bon nombre d'Actes et de Scènes:"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5\n",
"32\n"
]
}
],
"source": [
"%%sh\n",
"grep -c \"## Acte\" moliere_avare.txt\n",
"grep -c \"### Scène\" moliere_avare.txt"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(5,)"
]
},
"execution_count": 86,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.acte.value_counts().shape"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(31,)"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.scene.value_counts().shape"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(15,)"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.personnage.value_counts().shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On voit donc qu'on retrouve le même nombre d'actes et de scènes qu'avec l'analyse de texte via `grep`. On retrouve aussi les 15 personnages listées en début de document."
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" – Harpagon, Père de Cléante et d'Élise, et Amoureux de Mariane.\n",
" – Cléante, Fils d'Harpagon, Amant de Mariane.\n",
" – Élise, Fille d'Harpagon, Amante de Valère.\n",
" – Valère, Fils d'Anselme, et Amant d'Élise.\n",
" – Mariane, Amante de Cléante, et aimée d'Harpagon.\n",
" – Anselme, Père de Valère et de Mariane.\n",
" – Frosine, Femme d'Intrigue.\n",
" – Maitre Simon, Courtier.\n",
" – Maitre Jacques, Cuisinier et Cocher d'Harpagon.\n",
" – La Flèche, Valet de Cléante.\n",
" – Dame Claude, Servante d'Harpagon.\n",
" – Brindavoine, laquais d'Harpagon.\n",
" – La Merluche, laquais d'Harpagon.\n",
" – Le commissaire, et son clerc.\n",
"\n"
]
}
],
"source": [
"print(\"\".join(lines[19:33]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analyser les données"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
......@@ -16,10 +998,9 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
This source diff could not be displayed because it is too large. You can view the blob instead.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment