#+TITLE:  Reading and Analyzing some experimental hdf5 data
#+AUTHOR: Victor Martins Gomes
#+DATE:   2020-04-28 Tuesday
#+LANGUAGE: en
#+PROPERTY: header-args :session :exports both :eval never-export

I will read and get the statistics of some data I usually acquire in the lab. I
will use this log as a future reference to when I need to read hdf5 files
contained my data.

* Data reading using h5py package
To read the data we make use of the h5py package which contains most of what
might be necessary to read hdf5 files.

#+begin_src python :results output :session *python* :exports both
import h5py as h5
f = h5.File('testdata.h5','r')
print(f.keys())
#+end_src

#+RESULTS:
: <KeysViewHDF5 ['FileType', 'Frame', 'Waveforms']>

Now that we know what is inside, let's try to see which waveforms we have.

#+begin_src python :results output :session *python* :exports both
print(f['Waveforms'].keys())
#+end_src

#+RESULTS:
: <KeysViewHDF5 ['Channel 1', 'Channel 2']>

We have two channels, thus another subgroup. Gotta investigate deeper:
#+begin_src python :results output :session *python* :exports both
print(f['Waveforms/Channel 1'].keys())
print('---')
print(f['Waveforms/Channel 2'].keys())
#+end_src

#+RESULTS:
: <KeysViewHDF5 ['Channel 1Data']>
: ---
: <KeysViewHDF5 ['Channel 2Data']>

Now we can get the attributes of the recorded channels. They are going to be
necessary to properly map the data to a =numpy= array.
#+begin_src python :results output :session *python* :exports both
print(f['Waveforms/Channel 1'].attrs.keys())
#+end_src

#+RESULTS:
: <KeysViewHDF5 ['Count', 'DispInterpFactor', 'InterpSetting', 'MaxBandwidth', 'MinBandwidth', 'NumPoints', 'NumSegments', 'SavedInterpFactor', 'Start', 'WavAttr', 'WaveformType', 'XDispOrigin', 'XDispRange', 'XInc', 'XOrg', 'XUnits', 'YDispOrigin', 'YDispRange', 'YInc', 'YOrg', 'YUnits']>

Besides the channels we've read, a time series should be created. Each of
its elements should correspond to a data sample. To do that we will use the
information stored in one of the channels (supposing they were acquired using
the same oscilloscope config).
#+begin_src python :results output file :var filename="time_plot.png" :exports both :session *python*
import numpy as np
import matplotlib.pyplot as plt

idata1 = f['Waveforms/Channel 1'].attrs.get('XDispRange')
# XDispRange and YDispRange are the total range covered in the oscilloscope display.
idata2 = f['Waveforms/Channel 1'].attrs.get('XOrg')
# XOrg is the x-origin of the data and will be the same as XDispOrigin if the saved
# data is constrained to the displayed area.
idata3 = f['Waveforms/Channel 1'].attrs.get('XInc')
# Xinc is the incremental value of X, or the sampling period (i.e. XDispRange/NumPoints).
idata4 = f['Waveforms/Channel 1'].attrs.get('NumPoints')
temp1 = idata3*(np.arange(1,idata4+1))+idata2

plt.figure()
plt.plot(temp1,'b')
plt.grid(linestyle='dotted')
plt.axis('tight')
plt.title('Times series')

plt.savefig(filename)
print(filename)
#+end_src

#+RESULTS:
[[file:time_plot.png]]

Once the time array has been created we can move to reading the data.
#+begin_src python :results output :session *python* :exports both
yinc = f['Waveforms/Channel 1'].attrs.get('YInc') # Vertical sampling period
yorg = f['Waveforms/Channel 1'].attrs.get('YOrg') # Closest value to origin (y=0)
data1 = yorg + yinc*f['Waveforms/Channel 1/Channel 1Data']
yinc = f['Waveforms/Channel 2'].attrs.get('YInc')
yorg = f['Waveforms/Channel 2'].attrs.get('YOrg')
data2 = yorg + yinc*f['Waveforms/Channel 2/Channel 2Data']
print('Data type data:', data1.dtype)
#+end_src

#+RESULTS:
: Data type data: float64

Now we can plot the data to see it.

#+begin_src python :results output file :var filename="data_plot.png" :exports both :session *python*
plt.close('all')
plt.figure()
plt.plot(temp1*1e6,data1,'b')
plt.plot(temp1*1e6,data2,'r')
plt.grid(linestyle='dotted')
plt.axis('tight')
plt.xlabel('Time ($\mu$s)')
plt.ylabel('Amplitude (Volts)')
plt.title('Trigger (blue) and Data (red)')

plt.savefig(filename)
print(filename)
#+end_src

#+RESULTS:
[[file:data_plot.png]]

Data looks nice and amplitudes agree with what was observed in the oscilloscope
display. Therefore out reading method works.

Some statistics from the data can now be computed, once we can trust the data
was correctly read.
#+begin_src python :results output :session *python* :exports both
avg_data = np.mean(data2)
med_data = np.median(data2)
std_data = np.std(data2,ddof=1)
min_data = np.amin(data2)
max_data = np.amax(data2)
#------------------------
print('Data average (Volts): ',avg_data)
print('Data median (Volts): ',med_data)
print('Data standard deviation (Volts): ',std_data)
print('Data minimum (Volts): ',min_data)
print('Data maximum (Volts): ',max_data)
#+end_src

#+RESULTS:
: Data average (Volts):  0.0019089614972975556
: Data median (Volts):  0.004743201433087485
: Data standard deviation (Volts):  0.039513536115368106
: Data minimum (Volts):  -0.1612688487249741
: Data maximum (Volts):  0.15652564729188667

The data was recorded when the input channel was turned off, thus from that we
can estimated the signature "noise" of the system when the input will be on.
For that we will look at the histogram and see if looks like a Normal/Gaussian
continuous probability distribution.
#+begin_src python :results output file :var filename="data_hist.png" :exports both :session *python*
from scipy.stats import norm

plt.figure()
plt.hist(data2,color='b',histtype='bar',ec='black',density=True)

x = np.linspace(min_data,max_data,idata4)
p = norm.pdf(x,avg_data,std_data)

plt.plot(x,p,'k',linewidth=2)

plt.grid(linestyle='dotted')
plt.axis('tight')

title = "Fitting Gaussian to histogram: $\mu$ = %.5e, $\sigma$ = %.5e" % (avg_data,std_data)
plt.title(title)

plt.savefig(filename)
print(filename)
#+end_src

#+RESULTS:
[[file:data_hist.png]]

At last we can use the D'Agostino's K$^2$ test to infer goodness-of-fit or if
the assumption of a Normal distribution really holds.
#+begin_src python :results output file :var filename="data_plot.png" :exports both :session *python*
from scipy.stats import normaltest
alpha = 1e-3
k2, p = normaltest(data2)
print("p = {:.10e}".format(p))
#+end_src

#+RESULTS:
[[file:p = 0.0000000000e+00]]

Since the p-value is (way) smaller than the alpha value we can conclude the null
hypothesis can be rejected, that is, a Gaussian distribution does not represent
the data.