Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates.
ABSTRACT: This paper presents a novel approach for indoor acoustic source localization using microphone arrays, based on a Convolutional Neural Network (CNN). In the proposed solution, the CNN is designed to directly estimate the three-dimensional position of a single acoustic source using the raw audio signal as the input information and avoiding the use of hand-crafted audio features. Given the limited amount of available localization data, we propose, in this paper, a training strategy based on two steps. We first train our network using semi-synthetic data generated from close talk speech recordings. We simulate the time delays and distortion suffered in the signal that propagate from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results, evaluated on a publicly available dataset recorded in a real room, show that this approach is able to produce networks that significantly improve existing localization methods based on SRP-PHAT strategies and also those presented in very recent proposals based on Convolutional Recurrent Neural Networks (CRNN). In addition, our experiments show that the performance of our CNN method does not show a relevant dependency on the speaker's gender, nor on the size of the signal window being used.
Project description:The gear fault signal under different working conditions is non-linear and non-stationary, which makes it difficult to distinguish faulty signals from normal signals. Currently, gear fault diagnosis under different working conditions is mainly based on vibration signals. However, vibration signal acquisition is limited by its requirement for contact measurement, while vibration signal analysis methods relies heavily on diagnostic expertise and prior knowledge of signal processing technology. To solve this problem, a novel acoustic-based diagnosis (ABD) method for gear fault diagnosis under different working conditions based on a multi-scale convolutional learning structure and attention mechanism is proposed in this paper. The multi-scale convolutional learning structure was designed to automatically mine multiple scale features using different filter banks from raw acoustic signals. Subsequently, the novel attention mechanism, which was based on a multi-scale convolutional learning structure, was established to adaptively allow the multi-scale network to focus on relevant fault pattern information under different working conditions. Finally, a stacked convolutional neural network (CNN) model was proposed to detect the fault mode of gears. The experimental results show that our method achieved much better performance in acoustic based gear fault diagnosis under different working conditions compared with a standard CNN model (without an attention mechanism), an end-to-end CNN model based on time and frequency domain signals, and other traditional fault diagnosis methods involving feature engineering.
Project description:Animal activity acoustic monitoring is becoming one of the necessary tools in agriculture, including beekeeping. It can assist in the control of beehives in remote locations. It is possible to classify bee swarm activity from audio signals using such approaches. A deep neural networks IoT-based acoustic swarm classification is proposed in this paper. Audio recordings were obtained from the Open Source Beehive project. Mel-frequency cepstral coefficients features were extracted from the audio signal. The lossless WAV and lossy MP3 audio formats were compared for IoT-based solutions. An analysis was made of the impact of the deep neural network parameters on the classification results. The best overall classification accuracy with uncompressed audio was 94.09%, but MP3 compression degraded the DNN accuracy by over 10%. The evaluation of the proposed deep neural networks IoT-based bee activity acoustic classification showed improved results if compared to the previous hidden Markov models system.
Project description:Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 ms) and long-term (30 min) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that PCEN reduces temporal overfitting across dawn vs. dusk audio clips whereas context adaptation on PCEN-based summary statistics reduces spatial overfitting across sensor locations. Moreover, combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.
Project description:MOTIVATION:Neural networks have been widely used to analyze high-throughput microscopy images. However, the performance of neural networks can be significantly improved by encoding known invariance for particular tasks. Highly relevant to the goal of automated cell phenotyping from microscopy image data is rotation invariance. Here we consider the application of two schemes for encoding rotation equivariance and invariance in a convolutional neural network, namely, the group-equivariant CNN (G-CNN), and a new architecture with simple, efficient conic convolution, for classifying microscopy images. We additionally integrate the 2D-discrete-Fourier transform (2D-DFT) as an effective means for encoding global rotational invariance. We call our new method the Conic Convolution and DFT Network (CFNet). RESULTS:We evaluated the efficacy of CFNet and G-CNN as compared to a standard CNN for several different image classification tasks, including simulated and real microscopy images of subcellular protein localization, and demonstrated improved performance. We believe CFNet has the potential to improve many high-throughput microscopy image analysis applications. AVAILABILITY AND IMPLEMENTATION:Source code of CFNet is available at: https://github.com/bchidest/CFNet. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
Project description:The automatic detection of atrial fibrillation (AF) is crucial for its association with the risk of embolic stroke. Most of the existing AF detection methods usually convert 1D time-series electrocardiogram (ECG) signal into 2D spectrogram to train a complex AF detection system, which results in heavy training computation and high implementation cost. This paper proposes an AF detection method based on an end-to-end 1D convolutional neural network (CNN) architecture to raise the detection accuracy and reduce network complexity. By investigating the impact of major components of a convolutional block on detection accuracy and using grid search to obtain optimal hyperparameters of the CNN, we develop a simple, yet effective 1D CNN. Since the dataset provided by PhysioNet Challenge 2017 contains ECG recordings with different lengths, we also propose a length normalization algorithm to generate equal-length records to meet the requirement of CNN. Experimental results and analysis indicate that our method of 1D CNN achieves an average F1 score of 78.2%, which has better detection accuracy with lower network complexity, as compared with the existing deep learning-based methods.
Project description:Resting-state functional magnetic resonance imaging (rs-fMRI) based on the blood-oxygen-level-dependent (BOLD) signal has been widely used in healthy individuals and patients to investigate brain functions when the subjects are in a resting or task-negative state. Head motion considerably confounds the interpretation of rs-fMRI data. Nuisance regression is commonly used to reduce motion-related artifacts with six motion parameters estimated from rigid-body realignment as regressors. To further compensate for the effect of head movement, the first-order temporal derivatives of motion parameters and squared motion parameters were proposed previously as possible motion regressors. However, these additional regressors may not be sufficient to model the impact of head motion because of the complexity of motion artifacts. In addition, while using more motion-related regressors could explain more variance in the data, the neural signal may also be removed with increasing number of motion regressors. To better model how in-scanner motion affects rs-fMRI data, a robust and automated convolutional neural network (CNN) model is developed in this study to obtain optimal motion regressors. The CNN network consists of two temporal convolutional layers and the output from the network are the derived motion regressors used in the following nuisance regression. The temporal convolutional layer in the network can non-parametrically model the prolonged effect of head motion. The set of regressors derived from the neural network is compared with the same number of regressors used in a traditional nuisance regression approach. It is demonstrated that the CNN-derived regressors can more effectively reduce motion-related artifacts.
Project description:Edges tend to be over-smoothed in total variation (TV) regularized under-sampled images. In this paper, symmetric residual convolutional neural network (SR-CNN), a deep learning based model, was proposed to enhance the sharpness of edges and detailed anatomical structures in under-sampled cone-beam computed tomography (CBCT). For training, CBCT images were reconstructed using TV-based method from limited projections simulated from the ground truth CT, and were fed into SR-CNN, which was trained to learn a restoring pattern from under-sampled images to the ground truth. For testing, under-sampled CBCT was reconstructed using TV regularization and was then augmented by SR-CNN. Performance of SR-CNN was evaluated using phantom and patient images of various disease sites acquired at different institutions both qualitatively and quantitatively using structure similarity (SSIM) and peak signal-to-noise ratio (PSNR). SR-CNN substantially enhanced image details in the TV-based CBCT across all experiments. In the patient study using real projections, SR-CNN augmented CBCT images reconstructed from as low as 120 half-fan projections to image quality comparable to the reference fully-sampled FDK reconstruction using 900 projections. In the tumor localization study, improvements in the tumor localization accuracy were made by the SR-CNN augmented images compared with the conventional FDK and TV-based images. SR-CNN demonstrated robustness against noise levels and projection number reductions and generalization for various disease sites and datasets from different institutions. Overall, the SR-CNN-based image augmentation technique was efficient and effective in considerably enhancing edges and anatomical structures in under-sampled 3D/4D-CBCT, which can be very valuable for image-guided radiotherapy.
Project description:Although there have been impressive strides in detector development for time-of-flight positron emission tomography, most detectors still make use of simple signal processing methods to extract the time-of-flight information from the detector signals. In most cases, the timing pick-off for each waveform is computed using leading edge discrimination or constant fraction discrimination, as these were historically easily implemented with analog pulse processing electronics. However, now with the availability of fast waveform digitizers, there is opportunity to make use of more of the timing information contained in the coincident detector waveforms with advanced signal processing techniques. Here we describe the application of deep convolutional neural networks (CNNs), a type of machine learning, to estimate time-of-flight directly from the pair of digitized detector waveforms for a coincident event. One of the key features of this approach is the simplicity in obtaining ground-truth-labeled data needed to train the CNN: the true time-of-flight is determined from the difference in path length between the positron emission and each of the coincident detectors, which can be easily controlled experimentally. The experimental setup used here made use of two photomultiplier tube-based scintillation detectors, and a point source, stepped in 5?mm increments over a 15?cm range between the two detectors. The detector waveforms were digitized at 10 GS s-1 using a bench-top oscilloscope. The results shown here demonstrate that CNN-based time-of-flight estimation improves timing resolution by 20% compared to leading edge discrimination (231?ps versus 185?ps), and 23% compared to constant fraction discrimination (242?ps versus 185?ps). By comparing several different CNN architectures, we also showed that CNN depth (number of convolutional and fully connected layers) had the largest impact on timing resolution, while the exact network parameters, such as convolutional filter size and number of feature maps, had only a minor influence.
Project description:Medical image fusion techniques can fuse medical images from different morphologies to make the medical diagnosis more reliable and accurate, which play an increasingly important role in many clinical applications. To obtain a fused image with high visual quality and clear structure details, this paper proposes a convolutional neural network (CNN) based medical image fusion algorithm. The proposed algorithm uses the trained Siamese convolutional network to fuse the pixel activity information of source images to realize the generation of weight map. Meanwhile, a contrast pyramid is implemented to decompose the source image. According to different spatial frequency bands and a weighted fusion operator, source images are integrated. The results of comparative experiments show that the proposed fusion algorithm can effectively preserve the detailed structure information of source images and achieve good human visual effects.
Project description:This research focuses on the signal processing required for a sensory system that can simultaneously localize multiple moving underwater objects in a three-dimensional (3D) volume by simulating the hydrodynamic flow caused by these objects. We propose a method for localization in a simulated setting based on an established hydrodynamic theory founded in fish lateral line organ research. Fish neurally concatenate the information of multiple sensors to localize sources. Similarly, we use the sampled fluid velocity via two parallel lateral lines to perform source localization in three dimensions in two steps. Using a convolutional neural network, we first estimate a two-dimensional image of the probability of a present source. Then we determine the position of each source, via an automated iterative 3D-aware algorithm. We study various neural network architectural designs and different ways of presenting the input to the neural network; multi-level amplified inputs and merged convolutional streams are shown to improve the imaging performance. Results show that the combined system can exhibit adequate 3D localization of multiple sources.