AI learns to generate images from text and begins to better understand our world


OpenAI, co-founded by Elon Musk, has created the world’s most stunning AI model to date. GPT-3 (Generative Pre-trained Transformer 3) without any special prompts, can compose poems, short stories and songs, making one think that these are the work of a real person. But eloquence is just a gimmick, not to be confused with a human understanding of the environment. But what if the same technologies were trained simultaneously on text and images?

Researchers from the Paul Allen Institute for Artificial Intelligence have created a special, visual-linguistic model. It works with text and images and can generate pictures from text. The pictures look disturbing and strange, not at all like the hyperrealistic “deepfakes” created by generative adversarial networks (GANs). However, this capability has long been an important missing piece.

The aim of the study was to reveal whether neural networks can understand the visual world as humans.  For example a child who has learned a word for an object can not only name it, but also draw the object according to the hint, even if the object itself is absent from his point of view. So the AI2 project team suggested the models do the same: generate images from captions.

The final images created by the model are not entirely realistic upon close inspection. But it is not important. They contain the correct high-level visual concepts. AI simply draws the way a person who cannot draw would draw on paper.

This makes sense: converting text to an image is more difficult than doing the opposite.

“A caption doesn’t specify everything contained in an image,” says Ani Kembhavi, AI2’s computer vision team leader.

Creating an image from text is simply a transformation from smaller to larger. And it’s hard enough for the human mind, apart from programs.  If a model is asked to draw a “giraffe walking along a road,” then it needs to conclude that the road will be gray rather than bright pink, and will pass next to a field rather than the sea. Although all this is not obvious to AI.

Sample images generated by the AI2 model from captions. Source: AI2

This stage of the research shows that neural networks are capable of creating abstractions – a fundamental skill for understanding our world.

In the future, this technology will allow robots to see our world as well as humans, which will open up a huge scope of possibilities. The better the robot understands the environment and uses language to communicate, the more complex tasks it will be able to perform. In the current perspective, programmers can better understand the aspects of machine learning

“Image generation has really been a missing puzzle piece, By enabling this, we can make the model learn better representations to represent the world.”


The following two tabs change content below.
Tagged , , ,

Leave a Reply