Combining Multi-Modal Data with Deep Generative Models

Rebecca John
Ladoke Akintola University of Technology

View / Download Full Article (PDF)

Abstract

In the past several years, AI tasks have had to deal with all sorts of data, such as text, audio, pictures, and sensor outputs. Multi-modal data fusion enhances insight by combining information from different data modalities. This work presents a unified framework for multi-modal data fusion using deep generative models, focusing on Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models. A novel architecture is proposed to generate a joint latent representation capable of capturing relationships among multiple modalities even when some data is noisy or missing. Extensive evaluations on benchmark datasets demonstrate that the proposed approach outperforms state-of-the-art fusion methods on tasks such as classification, generation, and cross-modal retrieval. The model extends advanced AI systems by demonstrating applications in diverse domains including healthcare analytics and multimedia content creation.

Keywords

Multi-modal learning, Data fusion, Deep generative models, Variational autoencoders (VAEs), Generative adversarial networks (GANs), Cross-modal generation, Representation learning, Diffusion models, Missing modality handling, Joint latent space.

References

[1] Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR.

[2] Goodfellow, I., et al. (2014). Generative Adversarial Nets. NIPS.

[3] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS.

[4] Ngiam, J., et al. (2011). Multimodal deep learning. ICML.

[5] Tsai, Y. H. H., et al. (2019). Multimodal Transformer for Unaligned Multimodal Language Sequences. ACL.

[6] Wang, W., et al. (2020). Generalizing to Unseen Modalities for Multimodal Sentiment Analysis. ACL.

[7] Wu, Z., et al. (2018). Multimodal generative models for scalable weakly-supervised learning. NeurIPS.

[8] Suzuki, M., et al. (2016). Joint multimodal learning with deep generative models. ICML.

[9] Shi, Y., et al. (2019). Variational Modality Dropout for Multi-Modal Deep Generative Models. AAAI.

[10] Saito, M., et al. (2017). Temporal Generative Adversarial Nets with Singular Value Clipping. ICCV.

[11] Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE TPAMI.

[12] Pu, Y., et al. (2016). Variational Autoencoder for Deep Learning of Images, Labels and Captions. NIPS.

[13] Radford, A., et al. (2021). Learning Transferable Visual Models from Natural Language Supervision. ICML (CLIP).

[14] Ramesh, A., et al. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125.

[15] Tjandra, A., et al. (2020). Multi-modal self-supervised learning for audio-visual speech recognition. ICASSP.