Will AI-generated art becomes a nightmare for artists?
I still remember the good old days, when I was an avid reader of conspiracy theories, often seeking new exciting stories on Reddit and other various forums and blogs. Bermuda stories were famous at that time, I knew they all were conspiracies, but I loved to read them and watch conspiracy nutter videos and interviews. During that time, around 2012–13s quantum computer theories were also a famous topic, and a few Redditors and 4chan board users cooked up another set of theories like computers and robots will be available in the future as capable of creating or doing anything and almost everything that we command. Those people were actually confused with 3D printers, robots, and AI. Even five years ago, if you had asked me the same question “Will computers be capable of creating new things and outsmart humans?”, I probably would have responded, “Not in this century.” The buzzword “AI” put a full stop in the 2016s when tech giants started rolling out AI-aided products.
Now, speaking of AI-generated art and AI art generators like DALL-E and MidJourney, it often tells us machines creating things is no longer fiction. Artificial intelligence’s final frontier appears to be human creativity.
Many argue that AI is excellent at many things, far above human expectations, but that it will be difficult for it to be creative. There is still much that is unknown about the creative process. Although there are methods we can employ to foster creative thinking, we are ultimately unsure of how it occurs. It is challenging for scientists to simulate our own biological processes on a machine because of this lack of knowledge about them. Despite these restrictions, we are still not too far from producing artistic masterpieces with AI. Before coming into conclusions, let’s see how these AI art generators make art?
This is the basic process, happening behind the scenes of DALL-E image generation.
As per this diagram, first a text prompt will go into the text encoder, and algorithm will process the text and trigger the image encoding process, finally, the output image. But this wasn’t done all at once; rather, DALL-E 2 is the result of several methods that were developed and enhanced gradually over the previous few years. To create the whole model, these strategies are built on top of one another like Lego blocks. We must first comprehend what these components are and what they perform in order to comprehend DALL-E 2. Then, only after that we can explore the inner workings and training processes to comprehend the fundamental elements. Among the most important steps or building blocks in the DALL-E 2 is CLIP model.
CLIP
How does DALL-E 2 understand the visual manifestation of a verbal concept like a casual text prompt? Another OpenAI model named CLIP learns the relationship between textual semantics and their visual representations in DALL-E 2. CLIP, often referred to, Contrastive Language-Image Pre-training, is crucial to DALL-E 2 since it serves as the primary link between text and visuals. In general, CLIP stands for the notion that language can be used to instruct computers on how various images connect to one another. Formally, it suggests a straightforward approach of applying this teaching. To determine how much a particular text passage corresponds to an image, CLIP is trained on millions of image data sets and the captions that go with them. In other words, CLIP just learns how closely connected any given caption is to a particular image rather than attempting to predict a caption given an image. CLIP can understand the relationship between textual and visual representations of the same abstract item because of its contrastive rather than predictive purpose. Let’s look at how CLIP is educated,
https://www.assemblyai.com/blog/content/media/2022/04/CLIP_training-1.mp4
The weaknesses of earlier computer vision systems must first be recognized in order to fully comprehend CLIP. Prior to CLIP, neural approaches for computer vision required compiling enormous datasets of images, which were then manually categorized. Even while modern models are very good at this, the requirement for pre-selected categories places inherent limitations on it. Imagine taking a photograph of a street and asking a system like this to describe it; it might be able to tell you how many cars and signs were present, but it wouldn’t be able to convey the overall atmosphere of the area. Furthermore, the model would not classify anything for which there were insufficient photos to create a category.
The clever notion to train models to not only identify which category (of a pre-defined list of alternatives) an image belongs to, but to also identify the caption of each image from a collection of random captions, is the insight that makes CLIP so strong. Instead of having a human labeler predetermine whether or not these are in the same category, this enables the model to use language to distinguish between “a cheems” and “a cheems doing bonk.”
By completing the aforementioned “pre-training task,” CLIP is able to produce a vector space with dimensions that correspond to both linguistic and visual characteristics. This common vector space serves as a form of image-text dictionary for models, enabling them to translate between the two (at least conceptually).
This understanding does not, however, encompass all of DALL-E 2. One would still need to learn Spanish pronunciations and grammar before speaking, even if there was a mechanism to translate English words into Spanish. For the sake of our model, CLIP allows us to take textual phrases and determine how they correspond to visual cues. However, we still need a method for creating those graphics that is accurate to how we read the text. Let’s move on to our second Lego piece, the diffusion model.
GLIDE
Following training, the CLIP model is put to rest, and DALL-E 2 begins to learn how to reverse the image encoding mapping that CLIP had just discovered. Although our focus is on picture creation, CLIP learns a representation space where it is simple to ascertain the relationship between textual and visual encodings. Therefore, in order to complete this assignment, we must understand how to take advantage of the representation space.
In specifically, OpenAI performs this image generation using a modified version of GLIDE, one of its earlier models. To stochastically decode CLIP image embeddings, the GLIDE model learns to flip the picture encoding process.
(more on next issue)