Text to Art using AI

The idea of using deep learning to create art or intelligently modify existing photos is not new concept. Famous initiatives like Pix2Pix, which fills in and colors the outlines of black and white drawings, have been around for five years. DeepDream architecture, developed by Alexander Mordvintsev, uses CNNs to morph inputted images with algorithmic pareidolia to find and intensify patterns in the visual data. These technologies have immense potential, with use cases for security, marketing, shopping, and life in general all emerging from breakthroughs such as those listed above.

While the business world outside of the machine learning community is starting to feel the effects of breakthroughs in computer vision with deep learning, DL approaches are also becoming more and more common in the art world. The emergence of NFTs and virtual influencers, ai-generated music and artwork, and many other developments demonstrate the profound cultural impact that deep learning technologies are having.

In this post, we’ll take a closer look at VQGAN-CLIP, one of the most promising image generators currently on the market, and see how it may be used in conjunction with NLP and Gradient to produce creative clipart with a single request.

Contrastive Image-Language Pretraining (CLIP) and Vector Quantized Generative Adversarial Network (VQGAN) are two technologies and we refer to the interaction between these two networks as VQGAN-CLIP. They are independent models that cooperate.

How it works…

Then how does VQGAN+CLIP function? To put it simply, the generator will produce a picture, and the CLIP will measure how well the image matches the original. The generator then creates more “accurate” pictures using the feedback from the CLIP model. Up till the CLIP score is high enough and the created picture matches the text, this iteration will be repeated several times.

QGAN+CLIP is only an illustration of what may be accomplished by combining CLIP with an image generator. Nevertheless, you may use any generator instead of VQGAN and it can still function pretty well. There are several variations of X + CLIP, including StyleCLIP (StyleGAN + CLIP), CLIPDraw (which makes use of a vector art generator), BigGAN + CLIP, and many more. Even AudioCLIP utilizes audio instead of images.

VQGAN: A GAN architecture called VQGAN can be used to learn from historical data and generate new images. It was first introduced for the paper “Taming Transformers” (2021) by Esser, Rombach, and Ommer. In order to encode the feature map of the visual portions of the images, the feature map of the image data is first directly input to a GAN. Then, this image data is vector quantized: a type of signal processing that organizes vector groups into clusters that may be accessed by a representative vector designating the centroid and is referred to as a “codeword.” The vector quantized data is encoded and stored as a codebook, or dictionary of codeworks. The image data is first represented by a codebook before being entered as a sequence to a transformer. The transformer is then trained to simulate the generation of high-resolution images from these encoded sequences.

The ability to feed the picture data directly into the transformer via the codebook encoding sequence autoregressively is the key novelty of VQGAN and what makes it so fascinating. In actual use, the transformer learns how to anticipate the distribution of the next token based on a sequence of previous tokens by being trained on a succession of quantized tokens supplied from the codebook in an autoregressive manner. This approach greatly reduces the prospective cost of generating such images immensely, and allows for quicker processing of the image data.

Through the sliding-window, the transformer also limits the context of the image generation to “patches” of pixels. This enhances resource efficiency by allowing the transformer to just consider the local context of a patch, only “looking” at neighboring patches for information while creating the image.

A high resolution image is ultimately produced by the transformer based on the context of the generation event, either unconditionally or conditionally as further training rounds of a system like VQGAN-CLIP are completed.

CLIP (Contrastive Language-Image Pre-training) is a model that has been developed to evaluate how well a caption fits with an image when compared to the other captions in the collection. Since CLIP is capable of zero-shot learning, it can function successfully even with untested data. When used in conjunction with VQGAN-CLIP, CLIP is able to evaluate the quality of produced images in relation to a user-inputted caption. The outputted scores may then be utilized as weights to “guide” the VQGAN’s learning process so that it can more closely match the subject matter.

Figure: From the first image to the output with the increase the number of iterations (The art done with the name of “A monk in the forest”)

When combined, VQGAN-CLIP develops a number of models that may be used to produce pictures from text. In order to produce these images, the VQGAN first generates a random noise image that is vector quantized and encoded in a codebook. The codebook is then used as input to a transformer that produces the new image from the encoded signals. The output is then used to assess the accuracy of the image to the inputted prompt through CLIP, and that scoring is then sent back to the VQGAN to update the image generation model to more closely reflect the prompt.

Source Code

A Google Colab for experimenting with VQGAN and Clip is here.

Leave a Reply

Your email address will not be published. Required fields are marked *