diff --git a/data-analysis-visualization/Firstname-data-analysis/03_Names-Methodo2022-exercise.Rmd b/data-analysis-visualization/Firstname-data-analysis/03_Names-Methodo2022-exercise.Rmd new file mode 100644 index 0000000000000000000000000000000000000000..3f0d1750ac79df2521faa9cec3bf1a744aaecaa0 --- /dev/null +++ b/data-analysis-visualization/Firstname-data-analysis/03_Names-Methodo2022-exercise.Rmd @@ -0,0 +1,89 @@ +--- +title: "French given names per year per department" +author: "Lucas Mello Schnorr, Jean-Marc Vincent" +date: "October, 2022" +output: + pdf_document: default + html_document: + df_print: paged +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` +# The problem context +The aim of the activity is to develop a methodology to answer a specific question on a given dataset. + +The dataset is the set of Firstname given in France on a large period of time. +[https://www.insee.fr/fr/statistiques/2540004](https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip), we choose this dataset because it is sufficiently large, you can't do the analysis by hand, the structure is simple + + +You need to use the _tidyverse_ for this analysis. Unzip the file _dpt2020_txt.zip_ (to get the **dpt2020.csv**). Read in R with this code. Note that you might need to install the `readr` package with the appropriate command. + +## Download Raw Data from the website +```{r} +file = "dpt2021_csv.zip" +if(!file.exists(file)){ + download.file("https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip", + destfile=file) +} +unzip(file) +``` +Check if your file is the same as in the first analysis (reproducibility) +```{bash} +md5 dpt2021.csv +``` +expected : +MD5 (dpt2021.csv) = f18a7d627883a0b248a0d59374f3bab7 + +## Build the Dataframe from file + +```{r} +# The tidyverse package didn't want to install + +mydata <- read.csv("dpt2021.csv",sep = ";") +FirstNames<- data.frame(mydata) +head(FirstNames) +``` + +All of these following questions may need a preliminary analysis of the data, feel free to present answers and justifications in your own order and structure your report as it should be for a scientific report. + +### 1. Choose a firstname and analyse its frequency along time. +The chosen firstname is NOUR, my firstname! +```{r} +library(dplyr) +freq <- select(filter(FirstNames, preusuel == "NOUR"),c(preusuel,nombre)) %>% summarise(Firstname="NOUR",frequency=sum(nombre)); +print(freq) + +#Compare several firstnames frequency +unique_names <- FirstNames %>% group_by(preusuel)%>% summarise(); +for (fname in unique_names$preusuel[1:20] ){ + df <- select(filter(FirstNames, preusuel == fname),c(preusuel,nombre))%>% summarise(Firstname=fname,frequency=sum(nombre)) + freq=rbind(freq,df) +} +freq +``` + + +### 2. Establish by gender the most given firstname by year. Analyse the evolution of the most frequent firstname. +```{r} +grouped_data <- FirstNames %>% group_by(sexe,annais) %>% select(c(preusuel,annais,nombre)) %>% filter(nombre==max(nombre))%>% summarise(fname=preusuel,nb=mean(nombre)) +print(n=1000,grouped_data) +``` + +### 3. Optional : Which department has a larger variety of names along time ? Is there some sort of geographical correlation with the data? +```{r} +#Variety of names along time for each department +department <- FirstNames %>% group_by(dpt) %>% select(c(dpt,preusuel)) %>% summarise(nb=length(unique(preusuel))) +print(department,n=101) +``` +The department that has a larger variety of names along time is: +```{r} +dep <- filter(department, nb==max(nb)) +dep +``` +```{r fig.width = 1100}} +library(ggplot2) +ggplot(data = department, aes(x=dpt, y=nb)) + geom_point() + theme_bw() + geom_smooth(method="lm") +``` +Accordingto the graph, there is no geographical correlation with the data because the logistic regression line is almost constant. \ No newline at end of file