17 January 2026

The Power of Images

In our previous blog post, we explored the fundamentals of artificial intelligence and, building on that foundation, ventured into the fascinating world of large language models. Now it is time to turn our attention to several other exciting subfields of generative AI. This time, we will focus specifically on image, audio, and video generation using artificial intelligence.

Be Creative, Be Yourself

Let us begin by clarifying what is actually meant by “generative” artificial intelligence. In simple terms, generative AI (often referred to as Gen AI) describes a form of artificial intelligence that is capable of creating new content that resembles content produced by humans. This content can include text, images, music, videos, or even computer code.

And here comes a realization that many in the art world find unsettling: these generative AI models are not only able to analyze data, they can also create creative content on their own. The far-reaching consequences of this development are something we will explore further, but first, let us take a closer look at the most well-known types of generative AI models.

GAN, Baby, GAN

Generative Adversarial Networks, or GANs, represent an innovative approach within generative AI. They consist of two neural networks that compete against each other in a kind of game-like setup. The first network, called the generator, creates artificial data such as images. The second network, known as the discriminator, attempts to determine whether a given example is real (taken from the training data) or artificially generated by the generator.

The generator’s goal is to fool the discriminator, while the discriminator continuously improves its ability to detect fake data. Through this ongoing back-and-forth, both networks become increasingly sophisticated, until the generator produces data that is so realistic it is nearly indistinguishable from real examples. GANs have enabled groundbreaking advances in image, video, and speech synthesis—in other words, converting text into the desired target format—and are therefore considered a major milestone in the development of creative AI systems.

A Pleasant High?

Diffusion models are another form of generative AI capable of producing realistic images from random noise. This process can be guided using precise text descriptions. Well-known applications based on this approach include DALL·E, Midjourney, and Stable Diffusion.

The underlying process involves gradually adding noise to an image and then learning how to remove that noise step by step. Neural networks are trained to understand what real images look like and to approximate the function that reverses the noise. Creativity emerges from the combination of random noise and textual input, resulting in images that most likely have never existed before.

However, the capabilities of these models depend heavily on their training data—millions of images sourced from the internet, including works by artists who often did not consent to their art being used. This raises ethical and legal questions: Who owns the generated images? And is AI-generated art truly art? Anyone paying close attention will realize that such a system could be trained on a large collection of Picasso paintings to generate a new “Picasso” image. This creation would be comparable to a new work by the long-deceased master and would likely never be exposed as a forgery, because technically it is not a forgery at all—it is simply the result of many original works combined with a bit of random noise. Fascinating on the one hand, and unsettling on the other, wouldn’t you agree?

Hello, Mr. President

Generative AI can also produce highly realistic-sounding voices of real people by converting text into speech using a technique known as voice cloning. First, the input text is analyzed and transformed into an acoustic representation, such as a spectrogram. A neural network then uses previously learned voice recordings of a specific person to replicate that individual’s vocal characteristics, including pitch, rhythm, tone, and speaking style. Finally, another model—such as WaveNet—generates a natural-sounding audio waveform that makes it seem as though the real person actually spoke the text.

Modern systems like VALL-E can even imitate emotions and speaking styles, often using only a few seconds of original audio. While this technology is impressive, it also carries significant risks. Imagine receiving a phone call from the president personally, sounding completely authentic and exactly like the voice you know from television. In a moment of fascination or shock, some people might reveal passwords or confidential information they would otherwise never disclose. Unfortunately, certain malicious actors already exploit these technologies for social engineering. This makes it more important than ever to remain alert and skeptical when confronted with unexpected situations.

AI Killed the Movie Star

The slightly exaggerated headline captures the potential direction of this development, though it does not necessarily have to turn out that way. By combining diffusion models with voice cloning, generative AI could theoretically produce entire feature films featuring realistic-looking actors—without those actors ever having stood in front of a camera. As mentioned earlier, diffusion models enable image and video generation based on text descriptions or storyboards, creating photorealistic scenes, movements, and even camera motions.

At the same time, voice cloning allows the imitation of real actors’ voices, making dialogue sound authentic, complete with tone, emotion, and speech rhythm. When these technologies are combined with AI-assisted scriptwriting, music composition, and editing, the result is a fully synthetic film that looks and sounds as though it were produced with real people. This opens up new creative possibilities but also raises profound ethical and legal questions, particularly regarding the consent of the actors involved. In the future, it may become possible to purchase licenses from real actors, allowing them to “appear” in films without ever physically being on set. It will certainly be fascinating to see how filmmakers and the creative industry as a whole evolve in light of the possibilities offered by generative AI.

A Brave New World

By now, it should be clear just how extraordinary the capabilities of this type of AI are. In everyday work, generative AI technologies for audio, image, and video can be used for tasks such as automatically creating presentation graphics, producing realistic product visualizations for marketing, hosting virtual presenters in videos, and much more. These applications can make workflows more efficient, creative, and personalized.

However, caution is warranted. Generative AI can also be misused, for example in the creation of deepfakes. These are media contents generated using artificial intelligence, typically videos or audio recordings in which the appearance or voice of a real person is replicated so convincingly that it seems as though they are saying or doing things they never actually said or did. The growing spread of deepfakes makes critical thinking and reliance on trustworthy information sources more important than ever. In a world where digitally manipulated content can appear completely real, the ability to question and verify information becomes the most important defense against disinformation. Stay vigilant—but do not close yourself off from the brave new world of AI and all its fascinating possibilities.

You may also like