Sorry, you need to enable JavaScript to visit this website.

IEEE ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The IEEE ICASSP 2024 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit the website.

Finger vein recognition (FVR) has emerged as a secure biometric technique because of the confidentiality of vascular bio-information. Recently, deep learning-based FVR has gained increased popularity and achieved promising performance. However, the limited size of public vein datasets has caused overfitting issues and greatly limits the recognition performance.

Categories:
7 Views

In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.3% on YouTube captioning.

Categories:
10 Views

Integrating coded caching (CC) into multi-input multi-output (MIMO) setups significantly enhances the achievable degrees of freedom (DoF). We consider a cache-aided MIMO configuration with a CC gain t, where a server with L Tx-antennas communicates with K users, each equipped with G Rx-antennas. Similar to existing works, we also extend a core CC approach, designed initially for multi-input single-output (MISO) scenarios, to the MIMO setup.

Categories:
8 Views

For the few-shot segmentation (FSS) task, existing methods
attempt to capture the diversity of new classes by fully uti-
lizing the limited support images, such as cross-attention and
prototype matching. However, they often overlook the fact
that there is variability in different regions of the same ob-
ject, and intra-image similarity is higher than inter-image sim-
ilarity.To address these limitations, a Self-Guided Trans-
former (SGT) is proposed by leveraging intra-image similar-

Categories:
31 Views

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions.

Categories:
50 Views

Graphs are widely used to represent complex information and signal domains with irregular support. Typically, the underlying graph topology is unknown and must be estimated from the available data. Common approaches assume pairwise node interactions and infer the graph topology based on this premise. In contrast, our novel method not only unveils the graph topology but also identifies three-node interactions, referred to in the literature as second-order simplicial complexes (SCs).

Categories:
15 Views

Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer).

Categories:
18 Views

In this paper, we explain the model that was developed by the NLP\_POSTECH team for the LIMMITS 2024 Grand Challenge. Among the three tracks, we focus on Track 1, which necessitates the creation of a few-shot text-to-speech (TTS) system that generates natural speech across diverse languages. Towards this end, to realize multi-lingual capability, we incorporate a learnable language embedding. In addition, for precise imitation of target speaker voices, we leverage an inductive speaker bias conditioning methodology.

Categories:
12 Views

Empathetic response generation requires perceiving and un- derstanding the user’s emotion to deliver a suitable response. However, existing models generally remain oblivious of an interlocutor’s persona, which has been shown to play a vital role in expressing appropriate empathy to different users. To address this problem, we propose a novel Transformer-based architecture that incorporates retrieval-augmented prompt learning to generate persona-aware empathetic responses.

Categories:
18 Views

As deep neural networks and the datasets used to train them get larger, the default approach to integrating them into re-
search and commercial projects is to download a pre-trained model and fine tune it. But these models can have uncertain
provenance, opening up the possibility that they embed hidden malicious behavior such as trojans or backdoors, where
small changes to an input (triggers) can cause the model toproduce incorrect outputs (e.g., to misclassify). This paper

Categories:
9 Views

Pages