--- title: "SMPE-HM1" output: pdf_document: default html_document: default --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Getting the wrong picture from the data ## Read the data A good way to store the data is to organize it in a csv file. ```{r} group1=read.csv("activite-histo-group1.csv",header = T) group2=read.csv("activite-histo-group2.csv",header = T) group1 group2 ``` ## Data Visualization ### Petite section Plot each section scores of both groups in the same graph to see the difference. We need to make a plot that represents the scores of each group. The We use for example a solid line for the first group, and then draw a dashed line for group 2. A legend is added to the graph in order to make it clear to the reader what the two lines represent. ```{r} # plot solid line, set plot size, but omit axes plot(x=seq(1,33), y=group1$score.in.petite.section, type="l", lty=1, main="Petite section scores for group 1 and 2", xlab="students" , ylab="scores") # plot dashed line lines(x=seq(1,30), y=group2$score.in.petite.section, lty=2) # add legend par(xpd=TRUE) legend(x=5, y=-1, legend=c("group1", "group2"), lty=1:2, box.lty=0, ncol=2) ``` It's difficult to see whether the classical pedagogy is better than the alternative one, so an idea that could make it more clear is to visualize the grouped scores, in order to see their frequency for each group and thus we can make a good conclusion. ```{r} library(plyr) freq_1 <- count(group1, 'score.in.petite.section') freq_2 <- count(group2, 'score.in.petite.section') hg1 <- hist(freq_1$score.in.petite.section, plot = FALSE) # Save first histogram data hg2 <- hist(freq_2$score.in.petite.section, plot = FALSE) # Save 2nd histogram data plot(hg1, col = alpha("blue",0.2),xlab = "Petite section scores",main = ("Petite section scores frequency for group 1 and 2")) # Plot 1st histogram using a transparent color plot(hg2, col = alpha("red",0.4), add = TRUE) # Add 2nd histogram using different color legend("topright", legend = c("group 1", "group 2"), fill = c(5,2), # Color of the squares border = "black") ``` According to the graph, the classical pedagogy has a higher frequency of low scores than the alternative pedagogy. However, the classical pedagogy got the best scores [3, 3.5]. ### Moyenne section Let's see for the moyenne section. ```{r} # plot solid line, set plot size, but omit axes plot(x=seq(1,33), y=group1$score.in.moyenne.section, type="l", lty=1, main="Moyenne section scores for group 1 and 2", xlab="students", ylab="scores") # plot dashed line lines(x=seq(1,30), y=group2$score.in.moyenne.section, lty=2) # add legend par(xpd=TRUE) legend(x=5, y=-1, legend=c("group1", "group2"), lty=1:2, box.lty=0, ncol=2) ``` ```{r} library(ggplot2) library(plyr) freq_1 <- count(group1, 'score.in.moyenne.section') freq_2 <- count(group2, 'score.in.moyenne.section') hg1 <- hist(freq_1$score.in.moyenne.section, plot = FALSE) # Save first histogram data hg2 <- hist(freq_2$score.in.moyenne.section, plot = FALSE) # Save 2nd histogram data plot(hg1, col = alpha("blue",0.2),xlab = "Moyenne section scores",main = ("Moyenne section scores frequency for group 1 and 2")) # Plot 1st histogram using a transparent color plot(hg2, col = alpha("red",0.4), add = TRUE) # Add 2nd histogram using different color legend("topright", legend = c("group 1", "group 2"), fill = c(5,2), # Color of the squares border = "black") ``` Again, the classical pedagogy has a higher frequency of low scores than the alternative pedagogy, but it has the best scores [7,8]. ### Grande section Let's see now for the grande section. ```{r} # plot solid line, set plot size, but omit axes plot(x=seq(1,33), y=group1$score.in.grande.section, type="l", lty=1, main="Grande section scores for group 1 and 2", xlab="students", ylab="scores") # plot dashed line lines(x=seq(1,30), y=group2$score.in.grande.section, lty=2) # add legend par(xpd=TRUE) legend(x=5, y=-1, legend=c("group1", "group2"), lty=1:2, box.lty=0, ncol=2) ``` ```{r} library(ggplot2) library(plyr) freq_1 <- count(group1, 'score.in.grande.section') freq_2 <- count(group2, 'score.in.grande.section') hg1 <- hist(freq_1$score.in.grande.section, plot = FALSE) # Save first histogram data hg2 <- hist(freq_2$score.in.grande.section, plot = FALSE) # Save 2nd histogram data plot(hg1, col = alpha("blue",0.2),xlab = "Grande section scores",main = ("Grande section scores frequency for group 1 and 2")) # Plot 1st histogram using a transparent color plot(hg2, col = alpha("red",0.4), add = TRUE) # Add 2nd histogram using different color legend("topright", legend = c("group 1", "group 2"), fill = c(5,2), # Color of the squares border = "black") ``` However, the last graph shows that the alternative pedagogy is better than the classical one in the grande section. We can also visualize the mean of each section for both groups in the same plot. ```{r} library(ggplot2) df <- data.frame(Petite=c(mean(group1$score.in.petite.section),mean(group2$score.in.petite.section)), Moyenne=c(mean(group1$score.in.moyenne.section),mean(group2$score.in.moyenne.section)), Grande=c(mean(group1$score.in.grande.section),mean(group2$score.in.grande.section))) print(df) plot(x=1:3, y=c(df$Petite[1],df$Moyenne[1],df$Grande[1]), xaxt="n",type = "o", col = 1,xlab = "Average scores",ylab = "Sections", main = "Average scores of sections for group 1 & 2") lines(x=1:3, y=c(df$Petite[2],df$Moyenne[2],df$Grande[2]),xaxt="n", type = "o", col = 2) axis(1, at = seq(1, 3, by = 1), las=2) legend("bottomright", legend = c("1: Petite section", "2: Moyenne section","3: Grande section"), border = "black") par(xpd=TRUE) legend("topleft", legend = c("group 1", "group 2"), fill = c(1,2), # Color of the squares border = "black") ``` # Getting the wrong picture from the data - Correlation, causality ## Read the data ```{r} data=read.csv("foot_size_data.csv",header = T) data ``` ## Data Visualization - The graph I propose to represent the data is a *box plot*, that shows the the range and median of total mistakes made by each feet size category. - To build this graph, we can use the *ggplot* library. Our data is given to the ggplot, the x-axis is the feet size, the y-axis is the number of mistakes, and we want to group the latter by the feet size, so we use the group argument. Then, we tell ggplot to plot the boxes, and finally to set a black & white theme for a good visualization. ```{r} library(ggplot2) ggplot(data = data, aes(x=feet_size,y=nb_mistakes,group=factor(feet_size))) + geom_boxplot() + theme_bw() ``` - I made this graph because it summarizes the data well, and serves as a statistic for the reader to make a conclusion just by seeing the graph. - We can use a linear regression to determine the relationship between these two variables to make a summary. ```{r} reg<-lm(data$nb_mistakes ~.,data) summary(reg) ``` - From the graph, we can say that for small sizes of students' feet, the number of mistakes made in dictation are big, and with the growth of the feet size, the number of mistakes decreases. - We deduce from the graph that in the small ones (i.e. students with small feet sizes), they have a high probability of making a mistake, however with time (when they grow up and therefore their feet sizes grow), they start to master the language and to make less mistakes. This is logicial and corresponds to my initial intuition. - Yes there is a negative correlation between the two quantities. I looked for causality and found that correlation doesn't imply causality. This means that two variables could be linked together without one of them being the reason for the other's observed behavior.