Make Data for Testing Statistical Filters

This page discusses the data to be used in subsequent work to develop means of reducing random noise in time series data

The data developed will be described first, followed by explanations on how this set of data came about.

Data Description

The data represents a number of noisy sine waves, as shown in the plot to the right. It has 300 rows and 3 columns

The rows

  • The rows representing a sequence of values sampled at regular interval from a continuous sine wave.
  • There are 300 rows for 5 sine waves. Each wave contains 60 data points at 6 degrees (0 to 360) intervals, 30 for peaks and 30 for troughs
  • The value for each row is generated as follows
    1. The sign value, between -1 and 1 for that degree is obtained
    2. The value is transformed to a mean of μ=100 and SD σ=10
    3. The noise is added. The noise is a normally distributed random number, with a mean of 0, and an assigned SD. A coefficient, which is the multiple of the SD of the data set, is used to control the level of noise. The coefficient for the current data is 3. This means that the noise is a normally distributed number, with mean of 0 and SD of 3x10 = 30

The colums

There are 3 columns
  1. The sequence value, from 1 to 300
  2. The group designation, false (F) if the data point is from the trough of the sine wave, and true (T) if the data is from the peak of the sine wave.
  3. The value, the sine values, transformed to a mean of μ=100 and Standard Deviation σ of 10, then with randomly generated noise values added

Reasoning and Computer Program

# -*- coding: utf-8 -*-
"""
Make Data.py    To create test data for Simon's project'

2024/01/20
"""

import math
import numpy.random
import statistics

def RandomNormal(mean, sd):
    return numpy.random.normal(mean,sd)


def AddNoise(sourceAr, numSD):
    sd = statistics.stdev(sourceAr)
    noiseLevel = numSD * sd
    print(sd,numSD,noiseLevel)
    resAr = []
    for i in range(len(sourceAr)):
        resAr.append(sourceAr[i] + RandomNormal(0, noiseLevel))
    return resAr
        

"""
nCycle, number of cycles, each cycle 2 360 degrees, so a pos and a neg wave
nPoints, number of data points for each cycle
toMean and toSD is what sine values 0-1 translate to

The length of the data is nCylcles x nPoints
"""
def MakeSineWaves(nCycles, nPoints, toMean, toSD):
    order = []
    groupNames = []
    groupNumbers = []
    sines = []
    intv = 360 / nPoints
    k = intv
    n = 1
    for i in range(nCycles):
        for j in range(nPoints):
            degree = k % 360
            x = math.sin(math.radians(degree))
            grpName = "F"
            grpNum = 0
            if x>0:
                grpName = "T"
                grpNum = 1
            order.append(n)
            groupNames.append(grpName)
            groupNumbers.append(grpNum)
            sines.append(x)
            k += intv
            n += 1
    mean = statistics.mean(sines)
    sd = statistics.stdev(sines)
    newVals = []
    for v in sines:
        newVals.append(((v - mean) / sd) * toSD + toMean)
    return order, groupNames, groupNumbers, sines, newVals
    
        
    
if __name__ == "__main__":
    nCycles = 5             # number of cycles
    nPoints = 60          # each cycle divided into 60 data points
    toMean = 100          # mean and SD
    toSD = 10
    
    order, groupNames, groupNumbers, sines, newVals = \
                         MakeSineWaves(nCycles, nPoints, toMean, toSD)
    
    noise_20 = AddNoise(newVals, 2)
    noise_30 = AddNoise(newVals, 3)
    noise_40 = AddNoise(newVals, 4)
    noise_50 = AddNoise(newVals, 5)
    
    for i in range (len(order)):
        print(order[i], "\t", groupNames[i], "\t", groupNumbers[i], "\t",  \
              "%.4f" % sines[i], "\t",  "%.4f" % newVals[i], "\t",  \
              "%.4f" % noise_20[i], "\t",  "%.4f" % noise_30[i], "\t", \
              "%.4f" % noise_40[i], "\t",  "%.4f" % noise_50[i])
The Python program that produced the data demonstrated above is shown in the panel to the right. The remainder of this page desceibes the thinking behaind and leading to this program.

It began with idea to explore how to clean up and interpret a sequence of numbers sampled from an analog signal. The model being to sample a continuous electical signal and convert this into digital bipolar values of 0/1

I have in mind two conceptual processes

It is also envisaged that large quantities of data will be repeatedly required for this exercise, firstly to find a suitable set of data to act as the model while exploring alternative strategies. More importantly, if what appears to be a successful strategy emerges, there is a need to repeatedly testing it for robustness (not to make wrong interpretations) and sensitivity (able to detect the underlying signal)

I therefore decided to produce a short program with changeable parameters. As the data is generated by random numbers but controlled by its parameters, similar but different sets of data can thus be generated quickly. The program developed should be able to produce data with the following characteristics

The Python program listed to the right was therefor produced to fulfil these capabilities.

Choosing a Modelling Set of Data

The parameters looked for are as follows At this stage, I had no idea of the relative frequency of the waves to detect and the sampling rate, so I had to do some trial and error to see what seems to work.

Description of the Modelling Data

FTAll
n150150300
Set_0
mean91.0201108.9799100
SD4.37674.376710
Set_3
mean92.0824112.5713102.3268
SD28.928926.889629.7096
The data containing the sine wave and without noise is named Set_0, and the modelling data, with the SD of noise 3 time the SD of the noiseless data, is named Set_3. The sequence of values in the two sets, yellow for Set_0 and black for Set_3, are shown in the first of the plots to the left, and the basic statistical description of the data shown in the table. The plot shows that the incorporation of noise increases the variability of the signal, so the range of the values have increased

Only the data of set_3 are shown in the second plot to the left. The blue circles are from those values that were above the midline (v=100) of the noiseless sine wave, and are designated group true (T). The red circles are those values at or below the midline of the original sine wave, and are designated as group false (F). This plot shows how the incorporation of noise increases the overlapping of the data in the two groups

The difference betwwen the F (red) and T (blue) groups is also shown in the normal distribution plot to the right. It can be seen that the spread of the data in the two groups are similar, the modes are different, but there are large overlaps because of the high noise level.

The distributions of data from the two groups (F and T) in the two sets (Set_0 to the left and Set_3 to the right) are fusther demonstated in the plot below and to the left, and in the table above.

It can be seen that the sign wave data has uniform distribution, and the two groups are defined by their values, so there is no overlap.

With the addition of noise, and particularly if the noise is random but normally distributed, the range of measurements is much increased, and the data in the 2 groups now are more normally distributed, and overlap considerably.

Thinking backwards, the real signal inside the noise is likely to be of smaller amplitude then the raw signals, which included the noise, and the difference betwwn the two poles is very much smaller than what appears in the signal.

Comments

I suspect that the limiting factors in translating analog signals to bipolar values are As I have no information on what these parameters might be in a set of real data, the modelling set (Set_3) is meant only for initial development, until better parameters or data produced by machinery or electronics are available.

I shall therefore proceed with what I have got (Set_3), and would be grateful for comments and suggestions for change from you