Binaural sound source localization - Correlation in the time domain algorithm

Note: sample code to this algorithm can be found here.

The easiest implementation of the Jeffress model, is a simple cross correlation of the input signals recorded from the two microphones (\mathrm{L}(k) and \mathrm{R}(k) being the k-th sample from the left and right microphone, respectively):


The cross-correlation is a measure of the similarity between two signals (in our case \mathrm{L} and \mathrm{R}). It works by shifting the signals against each other by the time lag \tau, multiplying the samples from \mathrm{L} and \mathrm{R} and finally summing up the results. Here, n represents the current time, i.e. n samples have been recorded from \mathrm{L} and \mathrm{R}.

As it is, the above equation takes into account every recorded sample equally. In order for the sound localization to prefer more recent information, a "forgetting average" is added via exponential decay:


The expression e^{-(n-k)/T_{int}} has its maximal value of 1 for k=n, i.e. the current sample is weighted maximally. Samples further in the past (k<n) are weighted less. The rate of decay is controlled by the parameter T_{int}.

Now one could directly implement the above equation in a computer program, but that would be rather awkward. The sum has to be computed for every new n, starting at 0. Furthermore, n can grow large really fast. For a sampling rate of 44.1 kHz, n equals 44100 after one second. Additionally, every recorded sample would have to be saved, meaning that after 1 second, and assuming 16 bit (2 bytes) precision, 2*2*44100=176400 bytes, i.e. approximately 172 kB would have been recorded from the two audio channels. After 1 minute, 10 MB are recorded and after 1 hour, 605 MB of audio data would have accumulated. Clearly, this is rather impractical.

Fortunately, the cross correlation can quite easily be expressed as a recursive function:

with the initial value:

As can be seen, for a given lag \tau, the cross correlation now only depends on the value of the cross correlation from the previous time step as well as the current samples from the left and right microphones. There's no more need to recompute the whole sum at every time step or to store every recorded sample.

For sound source localization purposes, we want to find out what time difference was present in the input signals. In order to do that, the cross correlation has to be computed for every possible value of \tau. The \tau for which the correlation function reaches a maximum corresponds to the sought-after time difference.

The range of possible values for \tau corresponds to the physiological range of interaural time differences. For a simple setup of two omnidirectional microphones, it is the time a sound wave needs to travel from one microphone to the other (this corresponds to a sound source positioned at ± 90°):


where \mathrm{ITD}_{max} is the maximal interaural time difference (in seconds), b is the microphone baseline (distance) in metres and c is the speed of sound (in m/s). For a microphone distance of 21.5 cm and a value for the speed of sound, the maximal ITD corresponds to 625 µs.

The correlation will actually have to be computed for values of \tau ranging from -\mathrm{ITD}_{max} to +\mathrm{ITD}_{max}, because if a sound source is positioned to the right, the sound will first arrive at the right microphone. Conversely, if a source is positioned to the left, its sound wave will first reach the left microphone.

By convention, the right signal leading (source to the right) will produce positive ITDs and the left signal leading (source to the left) will produce negative ITDs.

Once the value \tau which maximizes the correlation has been found, it can be plugged into the equation relating ITD to azimuth to compute the angle to the sound source.

As the value \tau will usually represent time difference in samples (and not seconds), the equation will have to be slightly adapted:

\alpha=\arcsin{\frac{\Delta k/f_s\cdot c}{b}}

where \Delta k is the time difference in samples (the value \tau for which the correlation is maximal), f_s is the sampling frequency in hertz and b and c are the microphone distance (in metres) and speed of sound (in metres per second), respectively.