A Blog by Jonathan Low

 

May 9, 2023

The Reason Generative AI Has Spread So Fast And Excited So Many

AI is exploding because so many can use it easily in so many different ways. 

It is changing how people work and create, though the implications for the economy and society are not yet clear - and may not be for quite a while. But that has been true of breakthrough technologies since the wheel. JL 

Will Heaven reports in MIT Technology Review:

While creative industries - from entertainment media to fashion, architecture, marketing, and more - will feel the impact first, this tech will give creative superpowers to everybody. In the longer term, it could be used to generate designs for anything from new drugs to clothes and buildings. “Almost always, we build something and then we all have to use it for a while to figure out what it’s going to be used for." Not this time.“This is the first AI technology that caught fire with regular people.”  OpenAI signed up a million users in just 2.5 months. More than a million people started using Stable Diffusion via its paid-for service in less than half that time. The generative revolution has begun.

It was clear that OpenAI was on to something. In late 2021, a small team of researchers was playing around with an idea at the company’s San Francisco office. They’d built a new version of OpenAI’s text-to-image model, DALL-E, an AI that converts short written descriptions into pictures: a fox painted by Van Gogh, perhaps, or a corgi made of pizza. Now they just had to figure out what to do with it.

“Almost always, we build something and then we all have to use it for a while,” Sam Altman, OpenAI’s cofounder and CEO, tells MIT Technology Review. “We try to figure out what it’s going to be, what it’s going to be used for.”

Not this time. As they tinkered with the model, everyone involved realized this was something special. “It was very clear that this was it—this was the product,” says Altman. “There was no debate. We never even had a meeting about it.”

But nobody—not Altman, not the DALL-E team—could have predicted just how big a splash this product was going to make. “This is the first AI technology that has caught fire with regular people,” says Altman.  

DALL-E 2 dropped in April 2022. In May, Google announced (but did not release) two text-to-image models of its own, Imagen and Parti. Then came Midjourney, a text-to-image model made for artists. And August brought Stable Diffusion, an open-source model that the UK-based startup Stability AI has released to the public for free.   

The doors were off their hinges. OpenAI signed up a million users in just 2.5 months. More than a million people started using Stable Diffusion via its paid-for service Dream Studio in less than half that time; many more used Stable Diffusion through third-party apps or installed the free version on their own computers. (Emad Mostaque, Stability AI’s founder, says he’s aiming for a billion users.)

And then in October we had Round Two: a spate of text-to-video models from Google, Meta, and others. Instead of just generating still images, these can create short video clips, animations, and 3D pictures.  

The pace of development has been breathtaking. In just a few months, the technology has inspired hundreds of newspaper headlines and magazine covers, filled social media with memes, kicked a hype machine into overdrive—and set off an intense backlash.

“The shock and awe of this technology is amazing—and it’s fun, it’s what new technology should be,” says Mike Cook, an AI researcher at King’s College London who studies computational creativity. “But it’s moved so fast that your initial impressions are being updated before you even get used to the idea. I think we’re going to spend a while digesting it as a society.”

Artists are caught in the middle of one of the biggest upheavals in a generation. Some will lose work; some will find new opportunities. A few are headed to the courts to fight legal battles over what they view as the misappropriation of images to train models that could replace them.

Creators were caught off guard, says Don Allen Stevenson III, a digital artist based in California who has worked at visual-effects studios such as DreamWorks. “For technically trained folks like myself, it’s very scary. You’re like, ‘Oh my god—that’s my whole job,’” he says. “I went into an existential crisis for the first month of using DALL-E.”

The image above is based on a variation of the prompt "an artist making art with an AI art tool in Alien (1979)." Artist Erik Carter went through a series of iteration to produce the final image (at top.) "After landing on an image I was happy with, I went in and made adjustments to clean up any AI artifacts and make it look more 'real.' I'm a big fan of sci-fi from that era," explains Carter.
ERIK CARTER VIA DALL-E 2

But while some are still reeling from the shock, many—including Stevenson—are finding ways to work with these tools and anticipate what comes next.

The exciting truth is, we don’t really know. For while creative industries—from entertainment media to fashion, architecture, marketing, and more—will feel the impact first, this tech will give creative superpowers to everybody. In the longer term, it could be used to generate designs for almost anything, from new types of drugs to clothes and buildings. The generative revolution has begun.

 

A magical revolution 

For Chad Nelson, a digital creator who has worked on video games and TV shows, text-to-image models are a once-in-a-lifetime breakthrough. “This tech takes you from that lightbulb in your head to a first sketch in seconds,” he says. “The speed at which you can create and explore is revolutionary—beyond anything I’ve experienced in 30 years.”


Generative AI is changing the way work gets done in many industries.

Within weeks of their debut, people were using these tools to prototype and brainstorm everything from magazine illustrations and marketing layouts to video-game environments and movie concepts. People generated fan art, even whole comic books, and shared them online in the thousands. Altman even used DALL-E to generate designs for sneakers that someone then made for him after he tweeted the image

Amy Smith, a computer scientist at Queen Mary University of London and a tattoo artist, has been using DALL-E to design tattoos. “You can sit down with the client and generate designs together,” she says. “We’re in a revolution of media generation.”

Paul Trillo, a digital and video artist based in California, thinks the technology will make it easier and faster to brainstorm ideas for visual effects. “People are saying this is the death of effects artists, or the death of fashion designers,” he says. “I don’t think it’s the death of anything. I think it means we don’t have to work nights and weekends.”

Stock image companies are taking different positions. Getty has banned AI-generated images. Shutterstock has signed a deal with OpenAI to embed DALL-E in its website and says it will start a fund to reimburse artists whose work has been used to train the models.

Stevenson says he has tried out DALL-E at every step of the process that an animation studio uses to produce a film, including designing characters and environments. With DALL-E, he was able to do the work of multiple departments in a few minutes. “It’s uplifting for all the folks who’ve never been able to create because it was too expensive or too technical,” he says. “But it’s terrifying if you’re not open to change.”

Nelson thinks there’s still more to come. Eventually, he sees this technology being embraced not only by media giants but also by architecture and design firms. It’s not ready yet, though, he says.

“Right now it’s like you have a little magic box, a little wizard,” he says. That’s great if you just want to keep generating images, but not if you need a creative partner. “If I want it to create stories and build worlds, it needs far more awareness of what I’m creating,” he says. 

That’s the problem: these models still have no idea what they’re doing.

Inside the black box

To see why, let’s look at how these programs work. From the outside, the software is a black box. You type in a short description—a prompt—and then wait a few seconds. What you get back is a handful of images that fit that prompt (more or less). You may have to tweak your text to coax the model to produce something closer to what you had in mind, or to hone a serendipitous result. This has become known as prompt engineering.

Prompts for the most detailed, stylized images can run to several hundred words, and wrangling the right words has become a valuable skill. Online marketplaces have sprung up where prompts known to produce desirable results are bought and sold. 

Prompts can contain phrases that instruct the model to go for a particular style: “trending on ArtStation” tells the AI to mimic the (typically very detailed) style of images popular on ArtStation, a website where thousands of artists showcase their work; “Unreal engine” invokes the familiar graphic style of certain video games; and so on. Users can even enter the names of specific artists and have the AI produce pastiches of their work, which has made some artists very unhappy.  

ERIK CARTER VIA DALL-E 2
ERIK CARTER VIA DALL-E 2
ERIK CARTER VIA DALL-E 2

"I tried to metaphorically represent AI with the prompt 'the Big Bang' and ended up with these abstract bubble-like forms (right). It wasn't exactly what I wanted, so then I went more literal with 'explosion in outer space 1980s photograph' (left), which seemed too aggressive. I also tried growing some digital plants by putting in 'plant 8-bit pixel art' (center)."

Under the hood, text-to-image models have two key components: one neural network trained to pair an image with text that describes that image, and another trained to generate images from scratch. The basic idea is to get the second neural network to generate an image that the first neural network accepts as a match for the prompt.

The big breakthrough behind the new models is in the way images get generated. The first version of DALL-E used an extension of the technology behind OpenAI’s language model GPT-3, producing images by predicting the next pixel in an image as if they were words in a sentence. This worked, but not well. “It was not a magical experience,” says Altman. “It’s amazing that it worked at all.”

Instead, DALL-E 2 uses something called a diffusion model. Diffusion models are neural networks trained to clean images up by removing pixelated noise that the training process adds. The process involves taking images and changing a few pixels in them at a time, over many steps, until the original images are erased and you’re left with nothing but random pixels. “If you do this a thousand times, eventually the image looks like you have plucked the antenna cable from your TV set—it’s just snow,” says Björn Ommer, who works on generative AI at the University of Munich in Germany and who helped build the diffusion model that now powers Stable Diffusion. 

The neural network is then trained to reverse that process and predict what the less pixelated version of a given image would look like. The upshot is that if you give a diffusion model a mess of pixels, it will try to generate something a little cleaner. Plug the cleaned-up image back in, and the model will produce something cleaner still. Do this enough times and the model can take you all the way from TV snow to a high-resolution picture.

AI art generators never work exactly how you want them to. They often produce hideous results that can resemble distorted stock art, at best. In my experience, the only way to really make the work look good is to add descriptor at the end with a style that looks aesthetically pleasing.

~Erik Carter

The trick with text-to-image models is that this process is guided by the language model that’s trying to match a prompt to the images the diffusion model is producing. This pushes the diffusion model toward images that the language model considers a good match. 

But the models aren’t pulling the links between text and images out of thin air. Most text-to-image models today are trained on a large data set called LAION, which contains billions of pairings of text and images scraped from the internet. This means that the images you get from a text-to-image model are a distillation of the world as it’s represented online, distorted by prejudice (and pornography).

One last thing: there’s a small but crucial difference between the two most popular models, DALL-E 2 and Stable Diffusion. DALL-E 2’s diffusion model works on full-size images. Stable Diffusion, on the other hand, uses a technique called latent diffusion, invented by Ommer and his colleagues. It works on compressed versions of images encoded within the neural network in what’s known as a latent space, where only the essential features of an image are retained.

This means Stable Diffusion requires less computing muscle to work. Unlike DALL-E 2, which runs on OpenAI’s powerful servers, Stable Diffusion can run on (good) personal computers. Much of the explosion of creativity and the rapid development of new apps is due to the fact that Stable Diffusion is both open source—programmers are free to change it, build on it, and make money from it—and lightweight enough for people to run at home.

2 comments:

FrankBell said...

And I personally really like that AI and other programming innovations make our everyday life much easier. It’s hard for me to imagine my life without the opportunities that the Internet gives me. This is not only a vacation but also a job.

BlackmooreJoe said...

I agree with you Frank, although our everyday life is not always connected with the work of AI. Daily routine is something like the MelBet mobile app - https://melbet.com/ Mobile applications have become a very popular way not only for entertainment but also to earn money remotely. Sometimes it looks like a miracle, but this is a real working strategy.

Post a Comment