---
title: "Exercice 4"
author: "Waad ALMASRI"
date: "21/08/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Exploration du répertoire
D'abord, on fait un git pull pour récupérer les données qu'on a téléversé dans le répertoire GIT.
**Attention!** Si tu as commencé à écrire dans le notebook, enregistre les données pour ne pas les perdre.
Ensuite, on s'assure que nous avons les données dans le répertoire avec la commande "list.files()".
```{r }
list.files(".")
```
## Exploration du jeu de données
Maintenant qu'on a les données, on va commencer à les explorer.
**NB:** les données suivantes sont déjà formattées en .csv.
```{r, echo=FALSE}
df <- read.csv(file = "data.csv", sep="\t")
print(nrow(df))
head(df)
```
Let us add the year column to the dataframe.
```{r, echo=FALSE}
df$year <- substring(df$date,1,4)
```
Now let us check what's in the dataframe:
```{r, echo=FALSE}
summary(df)
```
Let us remove the missing data
```{r, echo=FALSE}
library(tidyr)
df <- df %>% drop_na(job, state)
```
## Statistiques de base
```{r, echo=FALSE}
print(paste0("There are ", length(unique(df$job)), " unique jobs."))
print(paste0("There are ", length(unique(df$edited_by)), " unique editors."))
print(paste0("There are ", length(unique(df$state)), " unique state"))
```
Number of jobs per state per year
```{r, echo=FALSE}
library(plyr)
jobs_per_state_year <-ddply(df,.(state,year),summarise,number_of_jobs=length((job)))
jobs_per_state_year <-jobs_per_state_year[order(jobs_per_state_year$number_of_jobs, decreasing=TRUE),]
```
## Représentations graphiques
We will start by plotting the Nbr of jobs per year of New York versus Texas.
```{r, echo=FALSE}
library(tidyverse)
library(ggplot2)
df %>%
filter(df$state %in% c("Texas", "New York") ) %>%
group_by(state, year) %>%
summarise(Nbr_of_jobs=n()) %>%
ggplot(aes(x=year, y=Nbr_of_jobs))+
geom_bar(aes(fill=state),stat="identity") +
theme_bw()
```
Let us Check the top 7 jobs present in the US:
```{r, echo=FALSE}
top_jobs <-ddply(df,.(job),summarise,number_of_jobs=length((state)))
top_jobs <-top_jobs[order(top_jobs$number_of_jobs, decreasing=TRUE),]
ggplot(data=top_jobs, aes(x=reorder(job, -number_of_jobs), y=number_of_jobs)) +
geom_bar(stat="identity", color="blue", fill="white")+
theme(axis.text.x = element_text(angle = 90))
```
#### Réflexion
It seems that this database is more about politics since we see that the top 2 jobs are Republicans and Democrats.
Let us check the rate of Republicans versus Democrats in the top states of the US.
But First let us identify the top states of the US.
```{r, echo=FALSE}
jobs_per_state <-ddply(df,.(state),summarise,number_of_jobs=length((job)))
jobs_per_state <-jobs_per_state[order(jobs_per_state$number_of_jobs, decreasing=TRUE),]
ggplot(data=jobs_per_state, aes(x=reorder(state, -number_of_jobs), y=number_of_jobs)) +
geom_bar(stat="identity", color="white", fill="red")+
theme(axis.text.x = element_text(angle = 90))
```
Thus, we can conclude that the top 7 US states having the higher jobs availability are: Texas, Florida, Illinois, Ohio, Wisconsin, Georgia and Rhode Island.
Now let us compare the distribution of the Republican versus Democrat in the top 7 US states:
```{r, echo=FALSE}
library(tidyverse)
library(ggplot2)
df %>%
filter(df$state %in% c("Texas", "Florida", "Illinois", "Ohio", "Wisconsin", "Georgia", "Rhode Island") & df$job %in% c("Republican", "Democrat")) %>%
group_by(state, job) %>%
summarise(Nbr_of_jobs=n()) %>%
ggplot(aes(x=state, y=Nbr_of_jobs))+
geom_bar(aes(fill=job),stat="identity") +
theme_bw()
```
## Word Cloud
We could have also found the top states and top jobs using word cloud.
Top Jobs:
```{r, echo=FALSE}
library(wordcloud)
library(RColorBrewer)
pal2 <- brewer.pal(8,"Set2")#length(unique(top_jobs$job))
wordcloud(top_jobs$job, top_jobs$number_of_jobs,
random.order=TRUE, rot.per=.10, colors=pal2, vfont=c("sans serif","plain"))
```
Top US states:
```{r, echo=FALSE}
pal2 <- brewer.pal(8,"Accent")
wordcloud(jobs_per_state$state, jobs_per_state$number_of_jobs,
random.order=FALSE, rot.per=.15, colors=pal2, vfont=c("sans serif","plain"))
```