In this document we perform an analysis about carbon dioxide in the atmosphere.
The goal is to detect a periodic oscillation and a slow but continuous increase in the concentration of carbon dioxide.
# Technical information on the computer on which the analysis is run
We will be using the R language using the following libraries :
```{r warning=FALSE}
library(tidyverse)
library(parsedate)
library(anytime)
library(lubridate)
library(forecast)
library(fpp2)
#sessionInfo()
```
Here are the available libraries
```{r}
devtools::session_info()
```
# Atmospheric CO2 data
Data are available at (Scrips)[https://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record.html]
*C. D. Keeling, S. C. Piper, R. B. Bacastow, M. Wahlen, T. P. Whorf, M. Heimann, and H. A. Meijer, Exchanges of atmospheric CO2 and 13CO2 with the terrestrial biosphere and oceans from 1978 to 2000. I. Global aspects, SIO Reference Series, No. 01-06, Scripps Institution of Oceanography, San Diego, 88 pages, 2001.*
## Description of the dataset
The data file below contains **10 columns**.
- Columns 1-4 give the dates in several redundant formats.
- Column 5 below gives monthly Mauna Loa CO2 concentrations in micro-mol CO2 per
mole (ppm), reported on the 2012 SIO manometric mole fraction scale. This is the standard version of the data most often sought. *The monthly values have been adjusted to 24:00 hours on the 15th of each month*.
- Column 6 gives the same data after a seasonal adjustment to remove the quasi-regular seasonal cycle. The adjustment involves subtracting from the data a 4-harmonic fit with a linear gain factor.
- Column 7 is a smoothed version of the data generated from a stiff cubic spline function plus 4-harmonic functions with linear gain.
- Column 8 is the same smoothed version with the seasonal cycle removed.
- Column 9 is identical to Column 5 except that the missing values from
- Column 5 have been filled with values from Column 7.
- Column 10 is identical to Column 6 except missing values have been filled with values from Column 8.
- Missing values are denoted by -99.99
CO2 concentrations are measured on the '12' calibration scale
## Loading data
The data start row 54 with the header. For this reason I skipped the 53 first rows when reading data.
To proceed to analysis on the same dataset. I downloaded the file only if it not already on my computer.
As I need to proceed to time series analysis which involve lag shifting I need to verify that every year will contain 12 months. So instead of deleting data which may lead to incoherent analysis I decided to replace the CO2 concentration by the data generated from a *stiff cubic spline function plus 4-harmonics ( the column 7 of the original dataset)*
The first year *1958* and the last year *2021* don't show data for the 12 months. From my understanding so far this will not impair the further analysis so I decided to maintain those years in the dataset.
**I will perform the analysis from March 1958 to April 2021.**
## Date management :
Creating a column date inheriting from year and month as it seems impossible to convert correctly other format. It would have been beneficial to know which format has been used in column Date. I used the *lubridate package* and assume that day 15 is a good candidate for the conversion.
This figure shows both the intra-annual variation of co2 concentration and the inter-annual increase.
In order to focus on the the intra-annual variation, a zoom is provided on the 40 last entries of the dataset.
```{r}
data5=tail(data4,40)
ggplot(data5,aes(builtdate,CO2_clean))+
geom_line(color='blue')+
xlab("Year,Month")+
ylab("Concentration in CO2 (ppm)")+
scale_x_date(date_labels = "%Y-%m")+
theme(axis.text.x =element_text(angle=45))
```
Considering that the data are measured in the North Hemisphere, we understand that the maximum concentration is reached during the summer and the minimum during the winter.
## Regression model
The dataset provides in column 6 and 10 the same data after a seasonal adjustment to remove the quasi-regular seasonal cycle. The adjustment involves subtracting from the data a 4-harmonic fit with a linear gain factor.
The following plot shows the inter-annual variation with a model of linear regression.
We have solved the first question graphically. I used the *additive type* to decompose the signal. The **trend** plot depicts the dataset when the annual variation is removed.
The **seasonal** plot depicts the seasonal cycle observed (max in summer, min in winter ref. Northern hemisphere).
## Modeling the data
The data is not stationary because the mean of the data is time depending.
```{r}
ggtsdisplay(CO2_main)
```
The data shows a seasonality. The ACF is not going down to zero.
```{r}
CO2_main%>%diff(lag=12)%>%diff()%>%ggtsdisplay()
```
Adding a lag of 12 shows an ACF decreasing to 0.
Considering the seasonality, it is worth using seasonal differencing to model the data
## Forecasting
I split the dataset in *training data as Keepling_ts* and *test data as data_test_ts*.