---
title: "French given names per year per department"
author: "Lucas Mello Schnorr, Jean-Marc Vincent"
date: "October, 2022"
output:
  pdf_document: default
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# The problem context
The aim of the activity is to develop a methodology to answer a specific question on a given dataset. 

The dataset is the set of Firstname given in France on a large period of time. 
[https://www.insee.fr/fr/statistiques/2540004](https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip), we choose this dataset because it is sufficiently large, you can't do the analysis by hand, the structure is simple


You need to use the _tidyverse_ for this analysis. Unzip the file _dpt2020_txt.zip_ (to get the **dpt2020.csv**). Read in R with this code. Note that you might need to install the `readr` package with the appropriate command.

## Download Raw Data from the website
```{r}
file = "dpt2021_csv.zip"
if(!file.exists(file)){
  download.file("https://www.insee.fr/fr/statistiques/fichier/2540004/dpt2021_csv.zip",
	destfile=file)
}
unzip(file)
```
Check if your file is the same as in the first analysis (reproducibility)
```{bash}
md5 dpt2021.csv
```
expected :
MD5 (dpt2021.csv) = f18a7d627883a0b248a0d59374f3bab7

## Build the Dataframe from file

```{r}
# The tidyverse package didn't want to install

mydata <- read.csv("dpt2021.csv",sep = ";")
FirstNames<- data.frame(mydata)
head(FirstNames)
```

All of these following questions may need a preliminary analysis of the data, feel free to present answers and justifications in your own order and structure your report as it should be for a scientific report.

### 1. Choose a firstname and analyse its frequency along time. 
The chosen firstname is NOUR, my firstname! 
```{r}
library(dplyr)
freq <- select(filter(FirstNames, preusuel == "NOUR"),c(preusuel,nombre)) %>% summarise(Firstname="NOUR",frequency=sum(nombre));
print(freq)

#Compare several firstnames frequency
unique_names <- FirstNames %>% group_by(preusuel)%>% summarise();
for (fname in unique_names$preusuel[1:20] ){
  df <- select(filter(FirstNames, preusuel == fname),c(preusuel,nombre))%>% summarise(Firstname=fname,frequency=sum(nombre))
  freq=rbind(freq,df)
}
freq
``` 


### 2. Establish by gender the most given firstname by year. Analyse the evolution of the most frequent firstname.
```{r}
grouped_data <- FirstNames %>%  group_by(sexe,annais) %>% select(c(preusuel,annais,nombre)) %>% filter(nombre==max(nombre))%>% summarise(fname=preusuel,nb=mean(nombre))
print(n=1000,grouped_data)
```

### 3. Optional : Which department has a larger variety of names along time ? Is there some sort of geographical correlation with the data?
```{r}
#Variety of names along time for each department
department <- FirstNames %>% group_by(dpt) %>% select(c(dpt,preusuel)) %>% summarise(nb=length(unique(preusuel)))
print(department,n=101)
```
The department that has a larger variety of names along time is:
```{r}
dep <- filter(department, nb==max(nb))
dep
```
```{r fig.width = 1100}}
library(ggplot2)
ggplot(data = department, aes(x=dpt, y=nb)) + geom_point() + theme_bw() + geom_smooth(method="lm")
```
Accordingto the graph, there is no geographical correlation with the data because the logistic regression line is almost constant.