IEEE ICASSP 2024

IEEE ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The IEEE ICASSP 2024 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit the website.

MOTION TRANSFER-DRIVEN INTRA-CLASS DATA AUGMENTATION FOR FINGER VEIN RECOGNITION

Read more about MOTION TRANSFER-DRIVEN INTRA-CLASS DATA AUGMENTATION FOR FINGER VEIN RECOGNITION
Log in to post comments

Finger vein recognition (FVR) has emerged as a secure biometric technique because of the confidentiality of vascular bio-information. Recently, deep learning-based FVR has gained increased popularity and achieved promising performance. However, the limited size of public vein datasets has caused overfitting issues and greatly limits the recognition performance.

icassp_v5_20240417_02.pptx

icassp_v5_20240417_02.pptx (16)

Categories:: Biometrics

7 Views

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.3% on YouTube captioning.

ICASSP2024_slides.pdf

Slides (11)

Categories:: Language Modeling, for Speech and SLP (SLP-LANG)

10 Views

Multicast Transmission Design With Enhanced DOF For Mimo Coded Caching Systems

Read more about Multicast Transmission Design With Enhanced DOF For Mimo Coded Caching Systems
Log in to post comments

Integrating coded caching (CC) into multi-input multi-output (MIMO) setups significantly enhances the achievable degrees of freedom (DoF). We consider a cache-aided MIMO configuration with a CC gain t, where a server with L Tx-antennas communicates with K users, each equipped with G Rx-antennas. Similar to existing works, we also extend a core CC approach, designed initially for multi-input single-output (MISO) scenarios, to the MIMO setup.

Transmission_Schemes_for_Enhanced_DoF_in_Cache_Aided_MIMO_Communication_Systems.pdf

MIMO CC (20)

Categories:: MIMO Communications and Signal Processing

8 Views

SGT: SELF-GUIDED TRANSFORMER FOR FEW-SHOT SEMANTIC SEGMENTATION

Read more about SGT: SELF-GUIDED TRANSFORMER FOR FEW-SHOT SEMANTIC SEGMENTATION
Log in to post comments

For the few-shot segmentation (FSS) task, existing methods
attempt to capture the diversity of new classes by fully uti-
lizing the limited support images, such as cross-attention and
prototype matching. However, they often overlook the fact
that there is variability in different regions of the same ob-
ject, and intra-image similarity is higher than inter-image sim-
ilarity.To address these limitations, a Self-Guided Trans-
former (SGT) is proposed by leveraging intra-image similar-

poster-icassp.pdf

poster-icassp.pdf (19)

Categories:: Other applications of machine learning (MLR-APPL)

31 Views

Scaling NVIDIA’s Multi-Speaker Multi-Lingual TTS Systems with Zero-Shot TTS to Indic Languages

In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets. We use HiFi-GAN vocoders for all submissions.

icassp_2024_04172024.pdf

icassp_2024_04172024.pdf (38)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

50 Views

Learning Graphs and Simplicial Complexes from Data

Read more about Learning Graphs and Simplicial Complexes from Data
Log in to post comments

Graphs are widely used to represent complex information and signal domains with irregular support. Typically, the underlying graph topology is unknown and must be estimated from the available data. Common approaches assume pairwise node interactions and infer the graph topology based on this premise. In contrast, our novel method not only unveils the graph topology but also identifies three-node interactions, referred to in the literature as second-order simplicial complexes (SCs).

Learning_Graphs_and_Simplicial_Complexes_from_Data.pdf

Learning_Graphs_and_Simplicial_Complexes_from_Data.pdf (31)

Categories:: Signal Processing Theory and Methods

15 Views

Music Source Separation with Band-Split RoPE Transformer

Read more about Music Source Separation with Band-Split RoPE Transformer
Log in to post comments

Music source separation (MSS) aims to separate a music recording into multiple musically distinct stems, such as vocals, bass, drums, and more. Recently, deep learning approaches such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been used, but the improvement is still limited. In this paper, we propose a novel frequency-domain approach based on a Band-Split RoPE Transformer (called BS-RoFormer).

BS-Roformer-present-ICASSP2024.pptx

Presentation slides (19)

Categories:: Music Signal Processing

18 Views

LEVERAGING EFFECTIVE LANGUAGE AND SPEAKER CONDITIONING IN INDIC TTS FOR LIMMITS 2024 CHALLENGE

In this paper, we explain the model that was developed by the NLP\_POSTECH team for the LIMMITS 2024 Grand Challenge. Among the three tracks, we focus on Track 1, which necessitates the creation of a few-shot text-to-speech (TTS) system that generates natural speech across diverse languages. Towards this end, to realize multi-lingual capability, we incorporate a learnable language embedding. In addition, for precise imitation of target speaker voices, we leverage an inductive speaker bias conditioning methodology.

ICASSP 2024.pptx.pdf

ICASSP 2024.pptx.pdf (29)

Categories:: Speech Processing

12 Views

GENERATING PERSONA-AWARE EMPATHETIC RESPONSES WITH RETRIEVAL-AUGMENTED PROMPT LEARNING

Read more about GENERATING PERSONA-AWARE EMPATHETIC RESPONSES WITH RETRIEVAL-AUGMENTED PROMPT LEARNING
2 comments
Log in to post comments

Empathetic response generation requires perceiving and un- derstanding the user’s emotion to deliver a suitable response. However, existing models generally remain oblivious of an interlocutor’s persona, which has been shown to play a vital role in expressing appropriate empathy to different users. To address this problem, we propose a novel Transformer-based architecture that incorporates retrieval-augmented prompt learning to generate persona-aware empathetic responses.

ICASSPslide_llwang.pptx

ICASSPslide_llwang.pptx (33)

Categories:: Other

18 Views

TEN-GUARD: TENSOR DECOMPOSITION FOR BACKDOOR ATTACK DETECTION IN DEEP NEURAL NETWORKS

Read more about TEN-GUARD: TENSOR DECOMPOSITION FOR BACKDOOR ATTACK DETECTION IN DEEP NEURAL NETWORKS
Log in to post comments

As deep neural networks and the datasets used to train them get larger, the default approach to integrating them into re-
search and commercial projects is to download a pre-trained model and fine tune it. But these models can have uncertain
provenance, opening up the possibility that they embed hidden malicious behavior such as trojans or backdoors, where
small changes to an input (triggers) can cause the model toproduce incorrect outputs (e.g., to misclassify). This paper

ICASSP_Poster_Khondoker.pdf

ICASSP_Poster_Khondoker.pdf (37)

Categories:: Machine Learning for Signal Processing

9 Views

IEEE ICASSP 2024

Pages