ML/NLP September 30, 2024

Making Sense of Multimodality: How Models See and Read

Words count 4.6k Reading time 4 mins. Read count 0

Making Sense of Multimodality: How Models See and Read

For the longest time in my NLP studies, the world was made of text. Then came the rise of multimodal AI, and suddenly models could see, hear, and read all at once. For my seminar on advanced models, I had to do a deep dive into how exactly you get a model to understand both an image and a sentence at the same time. It turns out there are a few competing philosophies, each with its own flavor.

The First Big Question: When Do We Mix the Ingredients?

The first major architectural choice I encountered was the “fusion” question: do you mix the modalities early or late?

  • Early Fusion: This approach feels like throwing all the ingredients into a blender at the start. You take the raw image data and the raw text data, project them into a shared space, and let the model figure out the connections from layer one. OpenAI’s CLIP is a classic example. It learns a rich, shared representation, which is powerful but can be computationally expensive.
  • Late Fusion: This is more like preparing two separate dishes and combining them on the plate. You have one model process the image, another process the text, and then you merge their high-level outputs at the end. It’s more modular and often simpler to implement, but you risk losing out on the subtle, low-level interactions between the image and text.

The Modern Approach: Let’s Have a Conversation

The architectures that are really pushing the field forward, like Flamingo or BLIP-2, use a more sophisticated approach: cross-modal attention.

The way I visualize this is that you have two parallel streams of information, one for the image and one for the text. At various points in the model, you build “bridges” of attention between them. The text stream can “look over” at the image to ground its understanding, and the image stream can be influenced by the textual context. It’s less about fusion and more about a continuous dialogue between the modalities. This seems to be where the most exciting results are coming from, especially for complex reasoning tasks.

The Ultimate Goal: A Unified Brain?

Some of the most ambitious projects, like PaLM-E, are aiming for a truly unified architecture. They treat everything—text, images, even robotic actions—as a single, universal language. The model is a single, massive transformer that processes a sequence of “tokens,” where some tokens represent words and others represent patches of an image.

This is an incredibly elegant idea. It suggests a path toward more general AI that can seamlessly switch between different types of information. The challenge, of course, is that training such a model requires an immense and incredibly diverse dataset.

The Practical Path for Students: Adapters

For a student like me, who doesn’t have a Google-sized budget, the most practical and exciting development has been the rise of adapter-based methods. Instead of training a massive multimodal model from scratch, you can take a powerful, pre-trained language model and insert small, lightweight “adapter” modules. These adapters are trained to translate image features into a format the LLM can understand. This lets you give “sight” to an existing LLM without the astronomical cost of full retraining.

Exploring these different approaches has been one of the most fascinating parts of my studies. It’s a field that’s still figuring out its core principles, and watching these different architectural philosophies compete and evolve is like having a front-row seat to the future of AI.

0%