Augmented reality in sound is used in hearing aids, surveillance, and other applications. Using the power of sound. Audio has not been as focused on as much as other forms of information, such as image and text. However, there is still much value in audio. -whether that be for journalists in the field recording audio from interviews, in walkie-talkies, or to improve speech to text. The main reason why audio is not used as much for information retrieval is due to the failure of speech to text to convert audio files to text, much of which is due to background noise. intuitively, the BS problem is: given a mixed signal, how cna you extract the original signals? Technically, the BSS problem, can be written as: $$ X = As $$ where s is the original signals, A is the mixing matrix, and X is the mixed signals. The aim is to find W that is the inverse of A $$P = AW$$ is the gauge of how good W is. Ideally, P is I. However, as the separated sources may not come with the same order and scale as the original sources, the matrix P should ideally be an identity up to a permutation and scale. See the repository with all the code here
I had never daelt with digital processing before, so the below is a tour through domain knowledge
First, let us look at the fourier transform
The Fourier transform is able to detect the structure behind signals, stripping them to their sinusoid functions
Here is the fourier transform (both DFT and FFT that I implemented) running on a signal for safety check. The x-axis for post FFT is the frequency and the y is the amplitude.
The fourier transform by itself does not do blind source separation, but it is a crucial transform.
Here is the audio spectrum of the audio file above of the crowded bar.
Find the code for FFTs here
Let us startwith some definiitons of the types of priors that are looked at when evaluating separabilit: -Kurtosis -Entropy
To test, we will be artificially mixing various sound samples. Insert interactive audio mixer here --> {TO DO} Here, we have two sounds - let us mix them together with this code
Man with threatening voice (Voice 2)
Mixed Audio (Microphone 1)
Mixed Audio (Microphone 2)
I will draw guidelines from this paper, looking at distortion. By distortion we mean how the original signals are distorted from the mixed signals in the absence of other source signals. The equations are below.
And here is the code
Voice 1 before, mixed, and estimate Voice 2 before, mixed, and estimated
Estimated Voice 1
Estimated Voice 2 (which it failed to extract)
Notes: Simulated annealing could be used instead of back propagation.
Evaluation Deep learning method produced an average distortion of 72.9566240393 for voice 1, and 40.26824952131 for voice 2. Here is all of my work for the code.Compression sensing (CS) emarkably reduces the amount of sampling neededto restore a siganl exaclty - instead of sampling at least twice the frequency of a signal, CS depends on the number of non-zero frequencies. It is based on the assumption that audio signals are sparse. Here, the basis used is the Discrete Cosine Trnasofmr (DCT), and using L1 norms, we can reconstruc the original siganls. The literature review for this. Other than a few papers by Michael Z , who explores Bayesian priors of BSS to tackel the case where we do not know A, there is not as much research with BSS using compressed sensing - most CS papers are on reconstruction of one signal. A whole another question is - how to find the basis functions for each audio stream, especially for human voices? There is one demo online with CS for BSS, which fails pretty badly for voices
Here are some cool papers:
CS applied wtih ICA There isn't that muchHere are the results of ocmpression sensing, after modifying it using L1 norm to retrieve more than 1 signal at a time.
Here is the Matlab code . I also started translating to Python .
For compression sensing, we must represnet audio sources as a combiantion of basis funcitons such that
In the case of overcomplete ICA, it is still possible to identify the mixing matrix from the knowledge of x alone, although it is not possible to uniquely recover the sources s. An area to delve deper in is how to best reconstruct the unique sources. Here, we have only considered instantaenous sources. There is also BSS with regards to noise. There are two major approaches: blind source separation and spatial filtering. The first relies on the statistical independence and super-Gaussian distribution of the speech signals. The spatial filtering uses the fact that speech sources are separated in the space, which is an active field of research at Microsoft Research.
I want to conitnue with this - perhaps explore this further my senior year, looking specifically at how to find basis function represnetation for compressed sensing - since that seems to be the key to making blind soruce seaparrtion real time.