Upload New File

aaa2485e · NourElh · f543716b · aaa2485e
Commit aaa2485e authored Nov 05, 2022 by NourElh
Hide whitespace changes
Inline Side-by-side

Showing with 89 additions and 0 deletions

03_Names-Methodo2022-exercise.Rmd ...Firstname-data-analysis/03_Names-Methodo2022-exercise.Rmd +89 -0

No files found.
--- a/data-analysis-visualization/Firstname-data-analysis/03_Names-Methodo2022-exercise.Rmd
+++ b/data-analysis-visualization/Firstname-data-analysis/03_Names-Methodo2022-exercise.Rmd
+---
+title: "French given names per year per department"
+author: "Lucas Mello Schnorr, Jean-Marc Vincent"
+date: "October, 2022"
+output:
+  pdf_document: default
+  html_document:
+    df_print: paged
+---
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+# The problem context
+The aim of the activity is to develop a methodology to answer a specific question on a given dataset. 
+The dataset is the set of Firstname given in France on a large period of time. 
+[https://www.insee.fr/fr/statistiques/2540004](https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip), we choose this dataset because it is sufficiently large, you can't do the analysis by hand, the structure is simple
+You need to use the _tidyverse_ for this analysis. Unzip the file _dpt2020_txt.zip_ (to get the **dpt2020.csv**). Read in R with this code. Note that you might need to install the `readr` package with the appropriate command.
+## Download Raw Data from the website
+```{r}
+file = "dpt2021_csv.zip"
+if(!file.exists(file)){
+  download.file("https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip",
+	destfile=file)
+}
+unzip(file)
+```
+Check if your file is the same as in the first analysis (reproducibility)
+```{bash}
+md5 dpt2021.csv
+```
+expected :
+MD5 (dpt2021.csv) = f18a7d627883a0b248a0d59374f3bab7
+## Build the Dataframe from file
+```{r}
+# The tidyverse package didn't want to install
+mydata <- read.csv("dpt2021.csv",sep = ";")
+FirstNames<- data.frame(mydata)
+head(FirstNames)
+```
+All of these following questions may need a preliminary analysis of the data, feel free to present answers and justifications in your own order and structure your report as it should be for a scientific report.
+### 1. Choose a firstname and analyse its frequency along time. 
+The chosen firstname is NOUR, my firstname! 
+```{r}
+library(dplyr)
+freq <- select(filter(FirstNames, preusuel == "NOUR"),c(preusuel,nombre)) %>% summarise(Firstname="NOUR",frequency=sum(nombre));
+print(freq)
+#Compare several firstnames frequency
+unique_names <- FirstNames %>% group_by(preusuel)%>% summarise();
+for (fname in unique_names$preusuel[1:20] ){
+  df <- select(filter(FirstNames, preusuel == fname),c(preusuel,nombre))%>% summarise(Firstname=fname,frequency=sum(nombre))
+  freq=rbind(freq,df)
+}
+freq
+``` 
+### 2. Establish by gender the most given firstname by year. Analyse the evolution of the most frequent firstname.
+```{r}
+grouped_data <- FirstNames %>%  group_by(sexe,annais) %>% select(c(preusuel,annais,nombre)) %>% filter(nombre==max(nombre))%>% summarise(fname=preusuel,nb=mean(nombre))
+print(n=1000,grouped_data)
+```
+### 3. Optional : Which department has a larger variety of names along time ? Is there some sort of geographical correlation with the data?
+```{r}
+#Variety of names along time for each department
+department <- FirstNames %>% group_by(dpt) %>% select(c(dpt,preusuel)) %>% summarise(nb=length(unique(preusuel)))
+print(department,n=101)
+```
+The department that has a larger variety of names along time is:
+```{r}
+dep <- filter(department, nb==max(nb))
+dep
+```
+```{r fig.width = 1100}}
+library(ggplot2)
+ggplot(data = department, aes(x=dpt, y=nb)) + geom_point() + theme_bw() + geom_smooth(method="lm")
+```
+Accordingto the graph, there is no geographical correlation with the data because the logistic regression line is almost constant.
\ No newline at end of file