Reading and Analyzing some experimental hdf5 data

Table of Contents

I will read and get the statistics of some data I usually acquire in the lab. I will use this log as a future reference to when I need to read hdf5 files contained my data.

1 Data reading using h5py package

To read the data we make use of the h5py package which contains most of what might be necessary to read hdf5 files.

import h5py as h5
f = h5.File('testdata.h5','r')
print(f.keys())
<KeysViewHDF5 ['FileType', 'Frame', 'Waveforms']>

Now that we know what is inside, let’s try to see which waveforms we have.

print(f['Waveforms'].keys())
<KeysViewHDF5 ['Channel 1', 'Channel 2']>

We have two channels, thus another subgroup. Gotta investigate deeper:

print(f['Waveforms/Channel 1'].keys())
print('---')
print(f['Waveforms/Channel 2'].keys())
<KeysViewHDF5 ['Channel 1Data']>
---
<KeysViewHDF5 ['Channel 2Data']>

Now we can get the attributes of the recorded channels. They are going to be necessary to properly map the data to a numpy array.

print(f['Waveforms/Channel 1'].attrs.keys())
<KeysViewHDF5 ['Count', 'DispInterpFactor', 'InterpSetting', 'MaxBandwidth', 'MinBandwidth', 'NumPoints', 'NumSegments', 'SavedInterpFactor', 'Start', 'WavAttr', 'WaveformType', 'XDispOrigin', 'XDispRange', 'XInc', 'XOrg', 'XUnits', 'YDispOrigin', 'YDispRange', 'YInc', 'YOrg', 'YUnits']>

Besides the channels we’ve read, a time series should be created. Each of its elements should correspond to a data sample. To do that we will use the information stored in one of the channels (supposing they were acquired using the same oscilloscope config).

import numpy as np
import matplotlib.pyplot as plt

idata1 = f['Waveforms/Channel 1'].attrs.get('XDispRange')
# XDispRange and YDispRange are the total range covered in the oscilloscope display.
idata2 = f['Waveforms/Channel 1'].attrs.get('XOrg')
# XOrg is the x-origin of the data and will be the same as XDispOrigin if the saved
# data is constrained to the displayed area.
idata3 = f['Waveforms/Channel 1'].attrs.get('XInc')
# Xinc is the incremental value of X, or the sampling period (i.e. XDispRange/NumPoints).
idata4 = f['Waveforms/Channel 1'].attrs.get('NumPoints')
temp1 = idata3*(np.arange(1,idata4+1))+idata2

plt.figure()
plt.plot(temp1,'b')
plt.grid(linestyle='dotted')
plt.axis('tight')
plt.title('Times series')

plt.savefig(filename)
print(filename)

time_plot.png

Once the time array has been created we can move to reading the data.

yinc = f['Waveforms/Channel 1'].attrs.get('YInc') # Vertical sampling period
yorg = f['Waveforms/Channel 1'].attrs.get('YOrg') # Closest value to origin (y=0)
data1 = yorg + yinc*f['Waveforms/Channel 1/Channel 1Data']
yinc = f['Waveforms/Channel 2'].attrs.get('YInc')
yorg = f['Waveforms/Channel 2'].attrs.get('YOrg')
data2 = yorg + yinc*f['Waveforms/Channel 2/Channel 2Data']
print('Data type data:', data1.dtype)
Data type data: float64

Now we can plot the data to see it.

plt.close('all')
plt.figure()
plt.plot(temp1*1e6,data1,'b')
plt.plot(temp1*1e6,data2,'r')
plt.grid(linestyle='dotted')
plt.axis('tight')
plt.xlabel('Time ($\mu$s)')
plt.ylabel('Amplitude (Volts)')
plt.title('Trigger (blue) and Data (red)')

plt.savefig(filename)
print(filename)

data_plot.png

Data looks nice and amplitudes agree with what was observed in the oscilloscope display. Therefore out reading method works.

Some statistics from the data can now be computed, once we can trust the data was correctly read.

avg_data = np.mean(data2)
med_data = np.median(data2)
std_data = np.std(data2,ddof=1)
min_data = np.amin(data2)
max_data = np.amax(data2)
#------------------------
print('Data average (Volts): ',avg_data)
print('Data median (Volts): ',med_data)
print('Data standard deviation (Volts): ',std_data)
print('Data minimum (Volts): ',min_data)
print('Data maximum (Volts): ',max_data)
Data average (Volts):  0.0019089614972975556
Data median (Volts):  0.004743201433087485
Data standard deviation (Volts):  0.039513536115368106
Data minimum (Volts):  -0.1612688487249741
Data maximum (Volts):  0.15652564729188667

The data was recorded when the input channel was turned off, thus from that we can estimated the signature “noise” of the system when the input will be on. For that we will look at the histogram and see if looks like a Normal/Gaussian continuous probability distribution.

from scipy.stats import norm

plt.figure()
plt.hist(data2,color='b',histtype='bar',ec='black',density=True)

x = np.linspace(min_data,max_data,idata4)
p = norm.pdf(x,avg_data,std_data)

plt.plot(x,p,'k',linewidth=2)

plt.grid(linestyle='dotted')
plt.axis('tight')

title = "Fitting Gaussian to histogram: $\mu$ = %.5e, $\sigma$ = %.5e" % (avg_data,std_data)
plt.title(title)

plt.savefig(filename)
print(filename)

data_hist.png

At last we can use the D’Agostino’s K\(^2\) test to infer goodness-of-fit or if the assumption of a Normal distribution really holds.

from scipy.stats import normaltest
alpha = 1e-3
k2, p = normaltest(data2)
print("p = {:.10e}".format(p))

p = 0.0000000000e+00

Since the p-value is (way) smaller than the alpha value we can conclude the null hypothesis can be rejected, that is, a Gaussian distribution does not represent the data.

Date: 2020-04-28 Tuesday

Author: Victor Martins Gomes

Created: 2020-04-29 mer. 15:24