On Monday, researchers from Microsoft introduced Kosmos-1, a multimodal AI model that can reportedly analyze images for content, solve visual puzzles, perform visual text recognition, pass visual IQ tests, and understand natural language instructions. The researchers believe multimodal AI—which integrates different modes of input such as text, audio, images, and video—is a key step to building artificial general intelligence (AGI) that can perform general tasks at the level of a human.
“Being a basic part of intelligence, multimodal perception is a necessity to achieve artificial general intelligence, in terms of knowledge acquisition and grounding to the real world,” the researchers write in their academic paper, “Language Is Not All You Need: Aligning Perception with Language Models.”
Visual examples from the Kosmos-1 paper show the model analyzing images and answering questions about them, reading text from an image, writing captions for images, and taking a visual IQ test with 22–26 percent accuracy (more on that below).
While media buzz with news about large language models (LLM), some AI experts point to multimodal AI as a potential path toward general artificial intelligence, a hypothetical technology that will ostensibly be able to replace humans at any intellectual task (and any intellectual job). AGI is the stated goal of OpenAI, a key business partner of Microsoft in the AI space.
In this case, the AI model – Kosmos-1 appears to be a pure Microsoft project without OpenAI’s involvement. The researchers call their creation a “multimodal large language model” (MLLM) because its roots lie in natural language processing like a text-only LLM, such as ChatGPT. And it shows: For Kosmos-1 to accept image input, the researchers must first translate the image into a special series of tokens (basically text) that the LLM can understand.