version

parent d1b2f2c2
---
title: "in-situ CO2 Data"
author: " "
date: "`r format(Sys.time(),'%d %B %Y')`"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
In this document we perform an analysis about carbon dioxide in the atmosphere.
The goal is to detect a periodic oscillation and a slow but continuous increase in the concentration of carbon dioxide.
# Technical information on the computer on which the analysis is run
We will be using the R language using the following libraries :
```{r warning=FALSE}
library(tidyverse)
library(parsedate)
library(anytime)
library(lubridate)
library(forecast)
library(fpp2)
#sessionInfo()
```
Here are the available libraries
```{r}
devtools::session_info()
```
# Atmospheric CO2 data
Data are available at (Scrips)[https://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record.html]
*C. D. Keeling, S. C. Piper, R. B. Bacastow, M. Wahlen, T. P. Whorf, M. Heimann, and H. A. Meijer, Exchanges of atmospheric CO2 and 13CO2 with the terrestrial biosphere and oceans from 1978 to 2000. I. Global aspects, SIO Reference Series, No. 01-06, Scripps Institution of Oceanography, San Diego, 88 pages, 2001.*
## Description of the dataset
The data file below contains **10 columns**.
- Columns 1-4 give the dates in several redundant formats.
- Column 5 below gives monthly Mauna Loa CO2 concentrations in micro-mol CO2 per
mole (ppm), reported on the 2012 SIO manometric mole fraction scale. This is the standard version of the data most often sought. *The monthly values have been adjusted to 24:00 hours on the 15th of each month*.
- Column 6 gives the same data after a seasonal adjustment to remove the quasi-regular seasonal cycle. The adjustment involves subtracting from the data a 4-harmonic fit with a linear gain factor.
- Column 7 is a smoothed version of the data generated from a stiff cubic spline function plus 4-harmonic functions with linear gain.
- Column 8 is the same smoothed version with the seasonal cycle removed.
- Column 9 is identical to Column 5 except that the missing values from
- Column 5 have been filled with values from Column 7.
- Column 10 is identical to Column 6 except missing values have been filled with values from Column 8.
- Missing values are denoted by -99.99
CO2 concentrations are measured on the '12' calibration scale
## Loading data
The data start row 54 with the header. For this reason I skipped the 53 first rows when reading data.
To proceed to analysis on the same dataset. I downloaded the file only if it not already on my computer.
```{r}
data_url="https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/monthly/monthly_in_situ_co2_mlo.csv"
data_file="monthly_in_situ_co2_mlo.csv"
if (!file.exists(data_file)){download.file(data_url,data_file,method="auto")}
data=read.csv("monthly_in_situ_co2_mlo.csv", header= T,skip=54)
head(data)
tail(data)
```
## Insecpecting data
The 2 first lines of the dataset are containing information about the data. I removed them from the dataset.
```{r}
data2=data[-c(1,2),]
```
Is there some missing value in the data set:
```{r}
na_records= apply(data2,1, function (x) any(is.na(x)))
data2[na_records,]
```
Every rows are filled with values.
Nevertheless some missing value are denoted by **-99,99**. See paragraph above describing data.
Before dealing with the missing value, the type "Character" must be converted into numeric in order to look for 99.99 value.
```{r}
i<-c(5:10)
data2[,i] =apply(data2[,i],2,function(x) as.numeric(as.character(x)))
```
As I need to proceed to time series analysis which involve lag shifting I need to verify that every year will contain 12 months. So instead of deleting data which may lead to incoherent analysis I decided to replace the CO2 concentration by the data generated from a *stiff cubic spline function plus 4-harmonics ( the column 7 of the original dataset)*
```{r}
data2<-data2%>%mutate(CO2_clean=case_when(CO2==-99.99 ~ CO2.1, TRUE ~CO2))
data2<-data2%>%
mutate(seasonally_clean=case_when(seasonally==-99.99 ~ seasonally.1, TRUE ~seasonally))
```
Now, any data still containing missing value **-99.99**, are removed to allow time series analysis.
```{r}
mv_records= sapply(data2$CO2_clean, function (x) ((x==-99.99)))
data2[mv_records,]
data4=data2[!mv_records,]
data4%>%group_by(Yr)%>%summarise(nb=n())%>%filter(nb<12)
```
The first year *1958* and the last year *2021* don't show data for the 12 months. From my understanding so far this will not impair the further analysis so I decided to maintain those years in the dataset.
**I will perform the analysis from March 1958 to April 2021.**
## Date management :
Creating a column date inheriting from year and month as it seems impossible to convert correctly other format. It would have been beneficial to know which format has been used in column Date. I used the *lubridate package* and assume that day 15 is a good candidate for the conversion.
```{r}
data4$builtdate<-ymd(paste0(data4$Yr," ",data4$Mn," ","15"))
```
# Data Visualisation
## Raw data viz
I used the package ggplot2 to proceed to the first visualisation of the data set.
```{r}
ggplot(data4,aes(builtdate,CO2_clean))+
geom_line(color='blue')+
xlab("Year,Month")+
ylab("Concentration in CO2 (ppm)")+
scale_x_date(date_labels = "%Y-%m",date_breaks = "5 year")+
theme(axis.text.x =element_text(angle=45) )
```
This figure shows both the intra-annual variation of co2 concentration and the inter-annual increase.
In order to focus on the the intra-annual variation, a zoom is provided on the 40 last entries of the dataset.
```{r}
data5=tail(data4,40)
ggplot(data5,aes(builtdate,CO2_clean))+
geom_line(color='blue')+
xlab("Year,Month")+
ylab("Concentration in CO2 (ppm)")+
scale_x_date(date_labels = "%Y-%m")+
theme(axis.text.x =element_text(angle=45))
```
Considering that the data are measured in the North Hemisphere, we understand that the maximum concentration is reached during the summer and the minimum during the winter.
## Regression model
The dataset provides in column 6 and 10 the same data after a seasonal adjustment to remove the quasi-regular seasonal cycle. The adjustment involves subtracting from the data a 4-harmonic fit with a linear gain factor.
The following plot shows the inter-annual variation with a model of linear regression.
```{r}
ggplot(data4,aes(x=builtdate,y=seasonally_clean))+
geom_line(color='red')+
xlab("Year,Month")+
ylab("Concentration in CO2 (ppm)")+
scale_x_date(date_labels = "%Y-%m",date_breaks = "5 year")+
theme(axis.text.x =element_text(angle=45) )+
geom_smooth(method ="lm")
```
The linear model gives a good match with data. Nevertheless, for the tail and head of the data, the linear model seems to be less accurate.
For this reason I decided to dig into time series analysis with R in order to find a suitable answer to the questions raised.
## Differeciating the trend and the cycle
Using the time series tools of R *ts*
```{r}
CO2_main<-ts(data4$CO2_clean,start=c(1958,3),frequency = 12)
decomposesignal<-decompose(CO2_main,type="additive")
plot(decomposesignal)
```
We have solved the first question graphically. I used the *additive type* to decompose the signal. The **trend** plot depicts the dataset when the annual variation is removed.
The **seasonal** plot depicts the seasonal cycle observed (max in summer, min in winter ref. Northern hemisphere).
## Modeling the data
The data is not stationary because the mean of the data is time depending.
```{r}
ggtsdisplay(CO2_main)
```
The data shows a seasonality. The ACF is not going down to zero.
```{r}
CO2_main%>%diff(lag=12)%>%diff()%>%ggtsdisplay()
```
Adding a lag of 12 shows an ACF decreasing to 0.
Considering the seasonality, it is worth using seasonal differencing to model the data
## Forecasting
I split the dataset in *training data as Keepling_ts* and *test data as data_test_ts*.
```{r}
data7<-data4%>%select(builtdate,CO2_clean,Yr)%>%
filter(Yr<2017)%>%select(-Yr)
keepling_ts<-ts(data7%>%
dplyr::select(-builtdate),start=c(data7$builtdate[1] %>% year(),1),frequency=12)
data_test<-data4%>%select(builtdate,CO2_clean,Yr)%>%
filter(Yr>2016)%>%select(-Yr)
data_test_ts<-ts(data_test%>%
dplyr::select(-builtdate),start=c(data4$builtdate[711]%>% year(),1),frequency=12)
```
## ARIMA model
The parameter are those defined by R in the automatic mode
```{r}
co2_arima<-keepling_ts%>%auto.arima()
co2_arima%>%forecast(h=12*10)%>%autoplot()+
autolayer(data_test_ts)
summary(co2_arima)
Prediction1<-forecast(co2_arima,h=12*10)
```
Giving prediction for 2025 from January to December answered the second question.
```{r}
window(Prediction1$mean, start = 2025, end=2025.999 )
```
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment