--- title: "French given names per year per department" author: "Lucas Mello Schnorr, Jean-Marc Vincent" date: "October, 2022" output: pdf_document: default html_document: df_print: paged --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # The problem context The aim of the activity is to develop a methodology to answer a specific question on a given dataset. The dataset is the set of Firstname given in France on a large period of time. [https://www.insee.fr/fr/statistiques/2540004](https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip), we choose this dataset because it is sufficiently large, you can't do the analysis by hand, the structure is simple You need to use the _tidyverse_ for this analysis. Unzip the file _dpt2020_txt.zip_ (to get the **dpt2020.csv**). Read in R with this code. Note that you might need to install the `readr` package with the appropriate command. ## Download Raw Data from the website ```{r} file = "dpt2021_csv.zip" if(!file.exists(file)){ download.file("https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip", destfile=file) } unzip(file) ``` Check if your file is the same as in the first analysis (reproducibility) ```{bash} md5 dpt2021.csv ``` expected : MD5 (dpt2021.csv) = f18a7d627883a0b248a0d59374f3bab7 ## Build the Dataframe from file ```{r} # The tidyverse package didn't want to install mydata <- read.csv("dpt2021.csv",sep = ";") FirstNames<- data.frame(mydata) head(FirstNames) ``` All of these following questions may need a preliminary analysis of the data, feel free to present answers and justifications in your own order and structure your report as it should be for a scientific report. ### 1. Choose a firstname and analyse its frequency along time. The chosen firstname is NOUR, my firstname! ```{r} library(dplyr) freq <- select(filter(FirstNames, preusuel == "NOUR"),c(preusuel,nombre)) %>% summarise(Firstname="NOUR",frequency=sum(nombre)); print(freq) #Compare several firstnames frequency unique_names <- FirstNames %>% group_by(preusuel)%>% summarise(); for (fname in unique_names$preusuel[1:20] ){ df <- select(filter(FirstNames, preusuel == fname),c(preusuel,nombre))%>% summarise(Firstname=fname,frequency=sum(nombre)) freq=rbind(freq,df) } freq ``` ### 2. Establish by gender the most given firstname by year. Analyse the evolution of the most frequent firstname. ```{r} grouped_data <- FirstNames %>% group_by(sexe,annais) %>% select(c(preusuel,annais,nombre)) %>% filter(nombre==max(nombre))%>% summarise(fname=preusuel,nb=mean(nombre)) print(n=1000,grouped_data) ``` ### 3. Optional : Which department has a larger variety of names along time ? Is there some sort of geographical correlation with the data? ```{r} #Variety of names along time for each department department <- FirstNames %>% group_by(dpt) %>% select(c(dpt,preusuel)) %>% summarise(nb=length(unique(preusuel))) print(department,n=101) ``` The department that has a larger variety of names along time is: ```{r} dep <- filter(department, nb==max(nb)) dep ``` ```{r fig.width = 1100}} library(ggplot2) ggplot(data = department, aes(x=dpt, y=nb)) + geom_point() + theme_bw() + geom_smooth(method="lm") ``` Accordingto the graph, there is no geographical correlation with the data because the logistic regression line is almost constant.