Cross-modal interaction and multi-source visual fusion for video generation in fetal cardiac screening

Guosong Zhu; Erqiang Deng; Zhen Qin; Fazlullah Khan; Wei Wei; Gautam Srivastava; Hu Xiong; Saru Kumari

doi:10.1016/j.inffus.2024.102510

Cross-modal interaction and multi-source visual fusion for video generation in fetal cardiac screening

Guosong Zhu, Erqiang Deng, Zhen Qin, Fazlullah Khan, Wei Wei, Gautam Srivastava, Hu Xiong, Saru Kumari

School of Computer Science

Research output: Journal Publication › Article › peer-review

Abstract

To address the limitation of preserving data for dynamic visualization in fetal ultrasound screening, a novel framework is proposed to facilitate the generation of fetal four-chamber echocardiogram videos, incorporating multi-source visual fusion and understanding. The framework utilizes an effective spectrogram-ultrasound synchronizer to align the ultrasound images with time, ensuring the generated video matches the actual heartbeat rhythm. It further employs effective frame interpolation techniques to synthesize a video by incorporating a nonlinear bidirectional motion prediction. By integrating a Transformer model for the autoregressive generation of visual semantic sequence, the proposed framework demonstrates its capability to generate high-resolution frames. Experimental outcomes show the Clip-Similarity of 96.23% and DINOv2-Similarity of 99.77%. Furthermore, a multimodal dataset of fetal echocardiogram examinations has been constructed.

Original language	English
Article number	102510
Journal	Information Fusion
Volume	111
DOIs	https://doi.org/10.1016/j.inffus.2024.102510
Publication status	Published - Nov 2024

Keywords

Cross-modal synchronization
Fetal echocardiogram scenario
Multi-source visual fusion and understanding
Transformer model
Visual data generation

ASJC Scopus subject areas

Software
Signal Processing
Information Systems
Hardware and Architecture

Access to Document

10.1016/j.inffus.2024.102510

Cite this

@article{e814003cbca34764b8d829ceb0bfa47d,

title = "Cross-modal interaction and multi-source visual fusion for video generation in fetal cardiac screening",

abstract = "To address the limitation of preserving data for dynamic visualization in fetal ultrasound screening, a novel framework is proposed to facilitate the generation of fetal four-chamber echocardiogram videos, incorporating multi-source visual fusion and understanding. The framework utilizes an effective spectrogram-ultrasound synchronizer to align the ultrasound images with time, ensuring the generated video matches the actual heartbeat rhythm. It further employs effective frame interpolation techniques to synthesize a video by incorporating a nonlinear bidirectional motion prediction. By integrating a Transformer model for the autoregressive generation of visual semantic sequence, the proposed framework demonstrates its capability to generate high-resolution frames. Experimental outcomes show the Clip-Similarity of 96.23% and DINOv2-Similarity of 99.77%. Furthermore, a multimodal dataset of fetal echocardiogram examinations has been constructed.",

keywords = "Cross-modal synchronization, Fetal echocardiogram scenario, Multi-source visual fusion and understanding, Transformer model, Visual data generation",

author = "Guosong Zhu and Erqiang Deng and Zhen Qin and Fazlullah Khan and Wei Wei and Gautam Srivastava and Hu Xiong and Saru Kumari",

note = "Publisher Copyright: {\textcopyright} 2024 The Author(s)",

year = "2024",

month = nov,

doi = "10.1016/j.inffus.2024.102510",

language = "English",

volume = "111",

journal = "Information Fusion",

issn = "1566-2535",

publisher = "Elsevier",

}

TY - JOUR

T1 - Cross-modal interaction and multi-source visual fusion for video generation in fetal cardiac screening

AU - Zhu, Guosong

AU - Deng, Erqiang

AU - Qin, Zhen

AU - Khan, Fazlullah

AU - Wei, Wei

AU - Srivastava, Gautam

AU - Xiong, Hu

AU - Kumari, Saru

PY - 2024/11

Y1 - 2024/11

N2 - To address the limitation of preserving data for dynamic visualization in fetal ultrasound screening, a novel framework is proposed to facilitate the generation of fetal four-chamber echocardiogram videos, incorporating multi-source visual fusion and understanding. The framework utilizes an effective spectrogram-ultrasound synchronizer to align the ultrasound images with time, ensuring the generated video matches the actual heartbeat rhythm. It further employs effective frame interpolation techniques to synthesize a video by incorporating a nonlinear bidirectional motion prediction. By integrating a Transformer model for the autoregressive generation of visual semantic sequence, the proposed framework demonstrates its capability to generate high-resolution frames. Experimental outcomes show the Clip-Similarity of 96.23% and DINOv2-Similarity of 99.77%. Furthermore, a multimodal dataset of fetal echocardiogram examinations has been constructed.

AB - To address the limitation of preserving data for dynamic visualization in fetal ultrasound screening, a novel framework is proposed to facilitate the generation of fetal four-chamber echocardiogram videos, incorporating multi-source visual fusion and understanding. The framework utilizes an effective spectrogram-ultrasound synchronizer to align the ultrasound images with time, ensuring the generated video matches the actual heartbeat rhythm. It further employs effective frame interpolation techniques to synthesize a video by incorporating a nonlinear bidirectional motion prediction. By integrating a Transformer model for the autoregressive generation of visual semantic sequence, the proposed framework demonstrates its capability to generate high-resolution frames. Experimental outcomes show the Clip-Similarity of 96.23% and DINOv2-Similarity of 99.77%. Furthermore, a multimodal dataset of fetal echocardiogram examinations has been constructed.

KW - Cross-modal synchronization

KW - Fetal echocardiogram scenario

KW - Multi-source visual fusion and understanding

KW - Transformer model

KW - Visual data generation

UR - http://www.scopus.com/inward/record.url?scp=85196254710&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2024.102510

DO - 10.1016/j.inffus.2024.102510

M3 - Article

AN - SCOPUS:85196254710

SN - 1566-2535

VL - 111

JO - Information Fusion

JF - Information Fusion

M1 - 102510

ER -

Cross-modal interaction and multi-source visual fusion for video generation in fetal cardiac screening

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this