Generative AI for Branding

Inspiration

A couple months ago, I saw this incredible video from the Corridor Crew. This was my first introduction to training Stable Diffusion to recognize an object or style. I sat on it for a couple months until I came up with a use case I liked.

As I went down the rabbit hole of learning the difference between textual inversion, LoRA, and Dreambooth, I found this video very informative. Check out his other videos too. He jumps between tutorials and videos explaining the math.

The Use Case

Branding and marketing departments use plenty of images to convey messages. I’m not too well versed in the marketing industry, but I imagine that companies tailor their portfolio with images that follow certain color schemes, show a diverse population, and fit an overall aesthetic. Instead of teaching Stable Diffusion an art style like the Corridor Crew, could I teach it a marketing aesthetic?

The Dataset

I work at Accenture, a company that is known globally for its marketing. I looked through some resources online and found a set of ~80 images that sum up Accenture’s visual brand pretty well. I appreciate that there is an effort to use images that show normal people and not corny corporate pictures you would see in a school textbook. There’s an emphasis on using images that foster creativity, show people smiling, and people in nature.

Picking a Model

For the Corridor Crew video, they used Dreambooth to train models that recognized specific people and an anime style. I could’ve used Dreambooth but I only have 50 GBs of space free on my C drive, so I went the textual inversion embedding route. The Dreambooth route is kind of how you would train any other CV model. You have a backbone that performs well on COCO or any other big dataset, then you resume training from that checkpoint with use case specific images. Since this is the case, the output of Dreambooth is an 865 million parameter model just like the Stable Diffusion checkpoint you started from. Textual inversion, on the other hand, has absolutely zero effect on the Stable Diffusion checkpoint. To train a textual inversion embedding is to find the perfect combination of floating numbers that make up a vector in a text embedding’s token space, and that will result in similar images to what you trained on. Depending on the size of the vector you choose, the embedding is actually only KBs big. Mine was only 50 KBs.

Training the Model

The first step was to resize all the images to 512x512. Through the training tab on the Automatic 1111 web UI, you can set the path to your image folder, and then a BLIP model downloads in the background and decodes your image for you. For each image, a text file is made that acts like a prompt would for text to image generation. The BLIP model gives you a head start, and I’ve found that it does a good job at describing the image most the time. To read more about BLIP, check out the paper here.

After resizing and making a prompt for each training image, all the training parameters can be set. batch size and all that stuff can be set based on how much GPU vRAM you have. It is also important to take advantage of the new PyTorch 2.0 and the xformers package from Facebook. These accelerate memory efficiency for transformers. Lastly, you have to select a specific key word that will be used when you start making your prompts. I chose accenture-style as the keyword.

Results

I had no idea what to expect when I finished training. Art styles area easy to decode because of line work, paint brushes, colors etc., but images seem way tougher to me. In the end I think the model did pick up on an aesthetic. For all the images below the prompts were super simple. It was literally just like ‘head of woman looking away from camera’, and ‘person standing on cliff looking over body of water and mountains’. The images created with the embedding all look like they were shot by somebody who has some experience taking pictures. The composition on all the before pictures is not that appealing. Almost every image has some a short focal plane. I see a trend in the color palette but I don’t think it is accurate of the dataset. There are a lot of cool colors in the results, where I think the dataset was pretty vibrant.

More than anything, I think this textual embedding provides a short cut to get you to portrait photography. Instead of typing out in the prompt ‘ 35mm, 50mm, fujifilm, kodak, photorealistic, portrait’ and all those key words, maybe a prompter could have a bigger word budget for describing the scene or other characteristics about the focal point.