The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Concept figure illustrating why higher-order alignment is needed beyond pairwise objectives.

Conceptual overview. Pairwise alignment alone cannot recover interactions that only emerge jointly across modalities. ConFu augments standard contrastive alignment with fused higher-order supervision, enabling the shared embedding space to preserve multimodal synergy.

Abstract

Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence.

Why higher-order alignment?

Pairwise alignment is incomplete

CLIP-style objectives align two modalities at a time. They are effective for standard cross-modal retrieval, but they do not explicitly model information that only becomes predictive when several modalities are considered jointly.

ConFu models multimodal synergy

ConFu builds fused representations from subsets of modalities and aligns them with the remaining modality, injecting direct supervision on higher-order structure into the shared embedding space.

Strong in both fused and standard settings

The method is not only useful when all modalities are present. It preserves strong single-modality usability while supporting fused retrieval and classification, making it practical beyond synthetic higher-order toy tasks.

Method

1. Pairwise contrastive alignment

Each modality is encoded and projected into a shared space. Conventional contrastive terms preserve direct correspondence across ordered modality pairs.

2. Fusion branch for higher-order structure

A lightweight fusion module combines two modalities into a joint embedding and aligns that fused representation with the held-out modality. This directly supervises interactions that cannot be recovered from pairwise terms alone.

3. Unified training objective

The final objective combines pairwise and fused higher-order terms. The resulting representation space remains usable for standard unimodal retrieval, while also supporting fused two-to-one retrieval and classification.

Jointly embeds individual modalities and fused modality pairs
Captures XOR-like dependencies missed by purely pairwise objectives
Supports both 1-to-1 and 2-to-1 downstream evaluation

Architecture of ConFu showing modality encoders, projection heads, fusion branch, and the combined training objective.

ConFu architecture. Each modality is projected into a common embedding space while an additional fusion branch combines modality subsets. Training jointly optimizes pairwise contrastive alignment and fused higher-order alignment, so the model retains strong unimodal utility while learning multimodal interactions.

Higher-order behavior in a controlled setting

XOR experiment showing the benefit of higher-order alignment as geometric interaction strength increases.

XOR with geometric structure. The controlled XOR setting illustrates the core motivation behind ConFu. When the target depends on a genuinely joint interaction, pairwise alignment is insufficient, while ConFu can exploit the higher-order signal.

Bird-MML Dataset

Bird-MML is a multimodal dataset consisting of aligned bird images, audio recordings, and textual descriptions. Each sample includes a bird photograph, a mel-spectrogram derived from the corresponding audio signal, and a short textual description capturing visual and acoustic attributes. It is designed to test both ordinary cross-modal alignment and settings where several modalities together provide stronger evidence than any one alone.

Image

Mel-Spectrogram

Description

This red male house finch, with a distinctive black and white back stripe, enjoys a meal from a wooden bird feeder.

Image

Mel-Spectrogram

Description

This brown-bodied great horned owl sits perched on a branch, its distinctive white throat patch and dark wings blending with the forest.

Image

Mel-Spectrogram

Description

This brown song sparrow, with black and white stripes, is a small common bird known for its distinctive song.

Licensing and Attribution

The Bird-MML examples shown on this page retain their original source licenses. Full licensing and attribution metadata for the displayed samples is available in this CSV file .

Results snapshot

Method	SSW60 A	SSW60 V	SSW60 A+V	VB100 A	VB100 V	VB100 A+V
CLIP	29.9	70.1	–	4.2	20.6	–
Tri-CLIP	31.1	69.0	–	3.9	20.7	–
Symile	–	–	60.2	–	–	13.4
TRIANGLE	–	–	64.1	–	–	12.1
GRAM	0.7	66.6	56.9	1.3	13.7	8.0
ConFu	30.3	69.4	71.44	3.4	19.3	18.1

ConFu remains competitive on single-modality evaluation while substantially improving the fused A+V setting, which is precisely where higher-order multimodal structure becomes useful.

Resources

Paper PDF arXiv Code Bird-MML Dataset Poster Licenses

BibTeX

@inproceedings{koutoupis2026more,
  title     = {The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment},
  author    = {Koutoupis, Stefanos and Zervou, Michaela Areti and Kontras, Konstantinos and
               De Vos, Maarten and Tsakalides, Panagiotis and Tsagkatakis, Grigorios},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}