The Rise of Generative AI
From GANs and diffusion models to DALL·E, Stable Diffusion, and the automation of creativity.
The Rise of Generative AI
From GANs and Diffusion Models to DALL·E, Stable Diffusion, and the Automation of Creativity
Introduction: The Machine Learns to Make
For most of the history of artificial intelligence, the field’s relationship with human creativity was observational. AI systems looked at things --- images, texts, audio recordings --- and learned to classify them, describe them, or predict what would come next. They were, fundamentally, systems of recognition: pattern matchers that became extraordinarily sophisticated but that always stood in the position of a reader or viewer rather than an author or artist. The shift that occurred in the 2010s and accelerated dramatically in the early 2020s was, in its deepest sense, a shift in that posture: AI systems stopped being primarily systems of recognition and became systems of creation.
This shift had been building for years in the technical literature. Generative Adversarial Networks, introduced by Ian Goodfellow and colleagues in 2014, provided the first architecture capable of generating images of sufficient quality to provoke genuine surprise --- and, in some cases, genuine unease --- in viewers who could not always tell whether a face was real or invented. Variational Autoencoders offered a mathematically principled framework for learning the statistical structure of a domain and sampling coherent new examples from it. Diffusion models, developed through the late 2010s and reaching practical effectiveness around 2020 and 2021, provided a new and in some respects superior approach to the same problem, producing images of a quality that previous methods had not approached. And the large Transformer-based language models traced in Episode 15 demonstrated that the same scaling dynamics that made text generation impressive could, when combined with image generation models, produce systems capable of creating images from text descriptions of extraordinary fluency and diversity.
“Generative AI did not just give machines new capabilities. It gave them a new relationship to human culture --- not as archivists or analysts of what humans had made, but as participants in the act of making.”
The consequences of this shift have been felt most immediately by those whose work involves creation: visual artists and illustrators, musicians and sound designers, writers and journalists, graphic designers and advertising professionals. For these communities, the arrival of generative AI in the early 2020s was experienced as a disruption of unusual speed and unusual ambiguity --- a technology that could produce, in seconds, outputs that would have taken hours of skilled human work, whose aesthetic qualities were sometimes impressive and sometimes uncanny, and whose relationship to the human creative work on which it had been trained raised profound questions about authorship, originality, and fair use that legal and ethical frameworks had not been designed to answer.
This episode traces the technical development of generative AI --- the progression from GANs and VAEs through diffusion models to multimodal systems --- and examines the practical reality of these systems: what they can do, what they cannot do, the specific risks they introduce for individuals and organizations, and the governance frameworks that are developing to address those risks. It is organized both as a history and as a practical guide, because the history is still happening and its implications are still being worked out in real time.
Section 1: The Milestones --- How Generative AI Developed
The history of generative AI is not a single linear progression but several parallel research traditions that developed largely independently for years before converging in the early 2020s around the common goal of generating high-quality, controllable content. Understanding the distinct origins of GANs, VAEs, autoregressive language models, and diffusion models is essential for understanding why current generative systems work the way they do, and why they have the specific strengths and weaknesses they exhibit.
Generative Adversarial Networks: The Adversarial Imagination
The paper that most dramatically announced the generative AI era was Ian Goodfellow’s 2014 introduction of Generative Adversarial Networks, conceived --- according to a well-worn story in the AI community --- during a late-night argument at a Montreal bar and written in a single day. The GAN framework was conceptually elegant to the point of elegance becoming a problem: two neural networks, a generator and a discriminator, trained against each other in a minimax game. The generator’s job was to produce outputs --- initially, images --- that the discriminator could not distinguish from real examples drawn from the training distribution. The discriminator’s job was to distinguish real from generated samples. Each network improved by trying to defeat the other, and the hope was that this adversarial dynamic would drive the generator toward producing samples indistinguishable from real data.
The initial results were modest by current standards: blurry, low-resolution images of faces and simple objects that were clearly artificial on close inspection but showed that the basic approach worked. What followed over the next several years was one of the most rapid improvement trajectories in the history of machine learning. The Progressive GAN architecture of 2018, from researchers at NVIDIA, introduced training that built up resolution progressively rather than all at once, producing photorealistic 1024x1024 pixel faces of people who did not exist --- images that fooled many viewers who did not know what they were looking at. StyleGAN (2019) and StyleGAN2 (2020), also from NVIDIA, added explicit control over style at different scales, allowing artists and researchers to interpolate between faces, transfer styles, and manipulate specific attributes while holding others constant.
The cultural impact of GAN-based face generation was significant and partly troubling. The website “This Person Does Not Exist,” launched in February 2019 and generating a new photorealistic fake face on each page load, became a widely shared demonstration of how completely GANs had mastered the visual statistics of the human face. The same technology powered deepfake videos --- realistic-seeming video content placing real people’s faces on different bodies or in different contexts --- that raised immediate concerns about political manipulation, non-consensual pornography, and the general reliability of video as evidence. The first major public deepfake scandal, involving manipulated video of politicians and celebrities, arrived within months of the most capable face-swapping tools becoming widely available, and established a pattern --- capability preceding safeguard --- that would characterize generative AI’s broader social impact.
GANs also found enormously productive applications in domains less laden with deception risk. In drug discovery, GANs were used to generate candidate molecular structures with desired chemical properties. In astronomy, they generated synthetic training data for classifying galaxies. In medical imaging, they generated synthetic patient data that could be used to train diagnostic systems without compromising real patient privacy. In fashion and retail, they generated photorealistic product images without physical prototypes. In architecture and design, they generated spatial layouts and material combinations at a speed no human designer could approach. The technology was the same in all these cases; the context determined whether it was a tool for creativity or a vector for harm.
Variational Autoencoders: Learning the Shape of Possibility
Variational Autoencoders, introduced by Diederik Kingma and Max Welling in 2013 --- a year before GANs --- approached the generative problem from a different direction, rooted in Bayesian statistics rather than game theory. The VAE framework trained an encoder network to compress input data into a compact latent representation --- a vector of numbers in a low-dimensional “latent space” --- and a decoder network to reconstruct the original input from that latent representation. The key innovation was that the encoder did not produce a single point in latent space for each input but rather a probability distribution, typically a Gaussian centered at a learned mean with a learned variance. During training, the model sampled from this distribution rather than using the mean directly, forcing the latent space to be smooth and continuous: nearby points in latent space decoded to similar outputs, and interpolating between two points produced coherent intermediate outputs rather than incoherent noise.
This smoothness of the latent space was VAEs’ most practically significant property. Because the latent space was organized such that similar inputs mapped to nearby points and interpolation was semantically meaningful, users could navigate it intuitively: moving a latent vector in a particular direction would change a specific attribute of the decoded output in a predictable way. If you identified the direction in latent space corresponding to “smiling” by comparing the average latent vectors of smiling and non-smiling faces, you could add or subtract that direction from any face’s latent vector to make it more or less smiling, with everything else held approximately constant. The latent space was, in this sense, a structured representation of the space of possible outputs that could be navigated with a kind of creative intentionality.
VAEs produced outputs that were somewhat blurrier than GAN-generated images, because the probabilistic training objective encouraged the decoder to hedge its bets by producing the average of multiple possible reconstructions rather than committing to a single sharp output. This blurriness limited their appeal for applications requiring photorealistic output. But their mathematical tractability --- the fact that their training objective was a well-defined probabilistic loss function rather than the delicate adversarial balance of GANs --- made them more stable to train and easier to understand theoretically. VAEs became foundational for applications that valued control and interpretability over raw image quality, and their latent-space framework directly influenced the design of the diffusion models that would ultimately supersede both GANs and VAEs for image synthesis.
Diffusion Models: Learning to Reverse Chaos
The dominant technology for image generation by the early 2020s was neither GANs nor VAEs but diffusion models, a class of generative models whose development can be traced from Sohl-Dickstein and colleagues’ 2015 paper on non-equilibrium thermodynamics to the denoising diffusion probabilistic models of Ho, Jain, and Abbeel in 2020, and the score-based diffusion models of Song and Ermon. The core idea was simple and counterintuitive: rather than training a model to directly generate images, train a model to reverse a process of progressive image destruction. Start with a real image, add Gaussian noise gradually over hundreds of steps until the image is indistinguishable from pure random noise, and train a neural network to predict, given a noisy image at any step of this process, what the original image looked like. A trained diffusion model could then generate new images by starting from pure noise and repeatedly applying the denoising network, each step moving slightly from noise toward coherent image structure.
The practical advantages of diffusion models over GANs were substantial and quickly decisive. GANs were notoriously difficult to train: the adversarial game between generator and discriminator was unstable, prone to mode collapse --- a failure mode in which the generator learned to produce a small number of high-quality samples rather than the full diversity of the training distribution --- and highly sensitive to the relative training progress of the two networks. Diffusion models, trained on a straightforward denoising objective, were substantially more stable and produced images with greater diversity and fewer artifacts. The 2021 paper “Diffusion Models Beat GANs on Image Synthesis”, by Dhariwal and Nichol at OpenAI, was a turning point: on standard image quality benchmarks, diffusion models had overtaken the best GAN architectures, and the field’s attention shifted accordingly.
Latent diffusion models, introduced by Rombach and colleagues at the University of Munich and Runway ML in 2022 and forming the technical foundation of Stable Diffusion, addressed the computational cost of diffusion models by performing the diffusion process in a compressed latent space rather than in full pixel space. By first encoding images into a low-dimensional latent representation using a VAE-like encoder, running the diffusion process in that latent space, and then decoding the result back to pixel space, latent diffusion models achieved the image quality of pixel-space diffusion models at a fraction of the computational cost. This efficiency made it practical to release Stable Diffusion as an open-source model that could be run on consumer hardware --- a decision with enormous consequences for the accessibility and subsequent trajectory of image generation AI.
Text-to-Image: When Language Meets Vision
The systems that brought generative AI to broad public attention were not pure image generation models but text-to-image systems: models that accepted natural language descriptions as input and produced images matching those descriptions. The enabling technology was the combination of large vision-language models --- trained to align text and image representations --- with powerful image generation models. OpenAI’s CLIP (Contrastive Language-Image Pre-training), published in January 2021, trained a vision encoder and a text encoder jointly to produce aligned representations: the CLIP embedding of an image and the CLIP embedding of its caption were trained to be close in representation space, while embeddings of mismatched image-caption pairs were pushed apart. CLIP learned, from 400 million image-text pairs scraped from the internet, to represent the semantic content of images in a way that aligned with how humans described them in text.
DALL·E, released by OpenAI in January 2021, combined a GPT-based text model with a discrete image generation model, using CLIP embeddings to condition image generation on text descriptions. The results were striking: given prompts like “an armchair in the shape of an avocado” or “a painting of a capybara sitting in a field at sunrise in the style of Monet,” DALL·E produced multiple plausible images that captured both the content and the stylistic instruction of the prompt. DALL·E 2, released in April 2022, used a diffusion model conditioned on CLIP embeddings to produce significantly higher-quality images with better coherence between prompt and output. Midjourney, released in open beta in July 2022, and Stable Diffusion, released in August 2022, provided alternative text-to-image systems with their own aesthetic characteristics and, in Stable Diffusion’s case, open-source weights that could be freely downloaded, fine-tuned, and deployed.
The public launch of these systems in 2022 was a cultural moment of unusual intensity. Within weeks of their availability, social media was flooded with AI-generated images in styles ranging from photorealistic to impressionistic to fantastical; artists experimented with prompting as a new creative medium; illustrators and graphic designers confronted the possibility that skills they had spent years developing could be partially replicated in seconds by anyone with a good prompt. The creative community’s response was divided and passionate: some embraced the tools as powerful aids to their practice, using AI-generated images as starting points, references, or components in larger works; others experienced the systems’ capabilities as a direct threat to their livelihoods and a violation of their intellectual property, given that the models had been trained on billions of images scraped from the internet, including many by artists who had neither consented to nor been compensated for their work’s use as training data.
Reflection: The progression from GANs in 2014 to diffusion-based text-to-image systems in 2022 compressed roughly eight years of research into a capability trajectory that no one had fully predicted. Each major architectural shift --- from GANs to VAEs to diffusion models, from pixel space to latent space, from unconditional generation to text conditioning --- unlocked new capabilities and new applications while introducing new risks and new questions. The speed of this progression left social, legal, and ethical frameworks substantially behind the technical reality, a gap that the following years would struggle to close.
Section 2: How It Works --- Technical Intuition for the Curious
Generative AI systems are often described in terms that oscillate between mystification --- treating them as oracles or black boxes that somehow “understand” creativity --- and dismissiveness, characterizing them as “stochastic parrots” that merely recombine training data without genuine generalization. Both descriptions miss what is actually happening in these systems, which is technically specific and intellectually interesting. This section offers a conceptual map for readers who want to understand the mechanisms behind generative AI without navigating the full mathematical formalism.
Latent Spaces: The Geometry of Meaning
The central concept that unifies most generative AI approaches is the latent space: a high-dimensional mathematical space in which the model represents the underlying structure of its training data in a compressed form. Imagine the space of all possible human faces: it is an enormously high-dimensional space, with each dimension corresponding to a pixel value in a high-resolution image. Most of that high-dimensional space contains images that look nothing like faces --- random noise, or coherent objects that are not faces. The actual faces form a much lower-dimensional manifold embedded within this high-dimensional space, with the dimensions of that manifold corresponding to meaningful attributes of faces: age, gender expression, skin tone, facial structure, lighting, expression, and thousands of subtler variations.
A generative model’s latent space is, ideally, a compact and organized representation of this face manifold. Each point in the latent space decodes to a specific face; nearby points decode to similar faces; moving in a particular direction in latent space changes specific facial attributes in predictable ways. The “knobs” that creators manipulate when they adjust prompts, change style weights, or use slider interfaces in image generation tools are, at a mathematical level, operations in latent space: adding or subtracting vectors that correspond to specific attributes, interpolating between two reference points, or sampling randomly from a region of latent space defined by a text description.
The quality of a generative model can largely be understood as a question of how well its latent space is organized. A well-organized latent space has smooth transitions between similar concepts, clear separation between dissimilar ones, and consistent mapping between directions in latent space and meaningful semantic attributes. A poorly organized latent space produces discontinuous outputs where small changes in the latent vector cause large, unpredictable changes in the decoded output --- the “mode collapse” problem familiar from GAN training, or the incoherent interpolations that plagued early VAEs. Much of the technical progress in generative AI over the past decade can be understood as progress in learning better-organized latent spaces from larger and more diverse training datasets.
Three Approaches to Generation: Games, Probability, and Noise
The three major generative model paradigms --- GANs, VAEs and autoregressive models, and diffusion models --- each approach the problem of learning to generate from a training distribution in a fundamentally different way, and each approach has characteristic strengths, weaknesses, and failure modes that matter for practical applications.
GANs train by adversarial competition. The generator network is never directly told what a good image looks like; it is only told, through the discriminator’s gradient signal, whether the discriminator was fooled or not. This adversarial training signal is powerful --- it pushes the generator toward outputs at the edge of what the discriminator can detect as fake --- but unstable. The generator and discriminator must remain in rough parity for training to proceed productively; if either gets too far ahead of the other, training diverges or collapses. GAN training requires careful tuning of learning rates, batch sizes, and architectural choices, and even well-tuned GANs exhibit mode collapse --- generating a limited subset of the training distribution’s diversity rather than its full range. GANs’ strength is speed: once trained, generating a sample requires a single forward pass through the generator, making GAN-based generation extremely fast at inference time.
Autoregressive models --- including the GPT family for text and image generation models like the original DALL·E that used discrete image tokens --- generate outputs one element at a time, each conditioned on all previous elements. The training objective is maximizing the likelihood of the training data under the model’s predicted distributions, which is a stable and well-understood probabilistic objective with no adversarial dynamics. Autoregressive models can be highly expressive, but their sequential generation process is slow at inference time for long sequences and makes them somewhat inflexible for applications that require editing existing outputs rather than generating from scratch. They also suffer from “exposure bias”: the model is trained on perfect context sequences but deployed with its own potentially imperfect generated context, creating a discrepancy that can cause quality to degrade for long generations.
Diffusion models train by learning to reverse a noise-addition process, with a stable denoising objective that avoids the instabilities of adversarial training. Their outputs are generally of higher quality and greater diversity than GANs, and they do not suffer from mode collapse. Their primary limitation is inference speed: generating a sample requires running the denoising network for hundreds or thousands of sequential steps, making raw diffusion-based generation significantly slower than GAN-based generation. Latent diffusion models address this by running the process in a compressed latent space, and distillation techniques --- training smaller, faster models to approximate the outputs of larger diffusion models in far fewer steps --- have substantially reduced the gap. By 2023, high-quality image generation with diffusion models was achievable in single-digit seconds on consumer hardware, making the inference speed disadvantage manageable for most applications.
Conditioning and Control: Telling the Model What You Want
Raw generative models produce samples from the distribution they were trained on, without any mechanism for specifying what kind of sample you want. Useful generative AI systems add conditioning: a mechanism for providing information that guides the generation toward outputs with desired properties. The conditioning signal can take many forms: class labels (generate a sample from class X), text descriptions (generate an image matching this description), reference images (generate a variation of this image), or combinations of all three.
The critical technical advance that made text-to-image generation practical was the development of conditioning mechanisms that could effectively use rich natural language descriptions to guide image generation. CLIP’s aligned text-image embeddings provided the bridge: by conditioning the image generation process on the CLIP embedding of a text prompt rather than the text itself, the model could exploit the rich semantic structure that CLIP had learned to associate text with visual content. Classifier-free guidance, introduced in 2022 by Jonathan Ho and Tim Salimans, provided a technique for amplifying the effect of the conditioning signal by training the model both with and without conditioning and using the difference between the conditional and unconditional generation directions to steer more strongly toward the prompted output. The “guidance scale” parameter that users of diffusion models can adjust is a direct expression of this technique: higher guidance scale means stronger conditioning and images that more closely match the prompt, at some cost to diversity and sometimes to image naturalness.
Fine-Tuning and Personalization: Making the Model Yours
Pre-trained generative models are general-purpose: they have learned the statistical structure of their training distribution and can generate diverse samples from it, but they do not inherently know about specific people, styles, objects, or concepts that were not well-represented in their training data. Fine-tuning techniques allow users to adapt pre-trained models to their specific needs with relatively small amounts of additional data and compute. LoRA (Low-Rank Adaptation), DreamBooth, and Textual Inversion are among the most widely used fine-tuning approaches for image generation models; each provides a different tradeoff between the amount of data required, the degree of model modification, and the specificity of the resulting adaptation.
DreamBooth, published by Google researchers in 2022, could teach a pre-trained text-to-image model to generate images of a specific person, object, or style by fine-tuning on as few as three to five reference images. Given a handful of photographs of a person, DreamBooth could generate that person in arbitrary scenarios, styles, and contexts with a fidelity that previous portrait generation approaches had not achieved. The technique found immediate practical applications in personalized marketing, content creation, and artistic exploration, and equally immediate misuse in the generation of non-consensual intimate imagery and the fabrication of public figures in compromising situations.
Reflection: The technical mechanisms of generative AI --- latent spaces, adversarial training, diffusion processes, conditioning --- are not magic and are not fully opaque. They are specific mathematical constructs with specific properties, specific strengths, and specific failure modes. Understanding these mechanisms does not demystify generative AI in a way that diminishes it; the fact that a neural network learns to organize a latent space such that interpolating between two faces produces coherent intermediate faces is genuinely remarkable. It does demystify it in the sense that matters most for practical and policy purposes: it clarifies what these systems are actually doing, and therefore what they can and cannot do, where they are likely to fail, and what governance approaches are actually addressing the right problems.
Section 3: Applications --- What Generative AI Can Actually Do
The applications of generative AI span an enormous range of domains and use cases, from the deeply practical to the aesthetically experimental. This section surveys the most significant current applications across text, images, audio, and multimodal systems, with attention to both the genuine capabilities these systems offer and the specific limitations that affect their practical utility.
Text Generation: From Drafts to Dialogue
Large language models of the GPT family are the dominant technology for text generation across virtually every application category. Long-form generation --- producing articles, reports, stories, and other extended texts --- reached a level of quality with GPT-3 and its successors that made AI-generated text difficult to reliably distinguish from human-written text for many readers, particularly when the topic was generic and the standard was average rather than expert human writing. The practical uses are extensive: drafting marketing copy, generating product descriptions at scale, producing first drafts of reports that human writers then edit and improve, creating educational content across a range of levels and styles.
Summarization --- condensing a long document to its essential points --- was among the earliest practical applications of large language models and remains one of their most reliable. Legal document review, research paper summarization, meeting transcript condensation, and news article briefing are all applications where AI summarization adds genuine value by reducing the time required to extract relevant information from large volumes of text. The caveat is that summarization models can miss important nuances, incorrectly weight information, or confidently summarize content they have misunderstood --- making human review essential for high-stakes applications.
Code generation, powered by models fine-tuned on large codebases including GitHub Copilot (based on OpenAI’s Codex, a GPT-3 variant trained on code) and its successors, became one of the highest-value practical applications of generative AI for professional users. By 2022, GitHub Copilot was being used by over a million developers and was estimated to generate roughly 40 percent of the code written by its users in the languages and contexts where it was most effective. The productivity gains were real: studies found that developers using AI code completion tools completed tasks significantly faster than those working without them, with the benefit most pronounced for boilerplate code, documentation generation, and translation between programming languages.
Image Generation: Visual Creativity at Scale
The practical applications of AI image generation divide roughly into two categories: applications where the AI-generated image is the final product, and applications where it is a starting point or component in a larger creative process. In advertising, marketing, and content production, AI-generated images are increasingly being used as final assets for social media, online advertising, and internal communications --- applications where the speed and cost advantages over commissioned photography or illustration are decisive and the quality requirements are sufficiently generic that current AI systems can meet them. A marketing team that previously commissioned multiple weeks of illustration work to produce a product catalog can now generate candidate images for each product in hours, reviewing and selecting the best outputs rather than briefing and managing illustrators.
In the design and concept art pipeline, AI image generation functions most productively as a rapid ideation tool: a way to explore a much larger space of visual possibilities in the early stages of a project than would be practical with traditional tools. An architect can generate dozens of renderings of a building in different styles and contexts in the time that would previously have been required to produce a single rendering. A game designer can explore a hundred character designs before selecting one for the detailed work of production. A film production designer can generate mood boards from text descriptions of scenes before commissioning concept art. In each case, the AI generation is the beginning of a creative process, not its entirety, and the human judgment that selects, refines, and directs is essential to the quality of the final result.
The limitations of current image generation systems are equally important to understand for practical deployment. Text rendering within images remains unreliable: most diffusion models produce plausible-looking but frequently misspelled or garbled text when asked to include words or signs in their outputs. Consistent characters and objects across multiple generations are difficult to achieve without fine-tuning: generating ten images of “the same character” without DreamBooth-style personalization will produce ten visually inconsistent characters that share only their described attributes. Complex compositional prompts --- scenes with multiple interacting objects, precise spatial relationships, or specific numerical quantities --- are handled unreliably. And the aesthetic tendencies of pre-trained models are not neutral: they reflect the distribution of their training data, which over-represents certain visual styles, certain demographic groups, and certain cultural aesthetics in ways that require conscious effort to counteract.
Audio, Music, and Voice: The Sonic Frontier
Generative AI for audio and music developed somewhat later than its image and text counterparts but followed a similar trajectory: from narrow, low-quality outputs to broad, high-quality generation within a few years of focused research and scaling. Text-to-speech systems powered by neural networks --- including Google’s WaveNet (2016), which modeled audio waveforms directly with an autoregressive neural network --- produced speech of dramatically higher naturalness than the concatenative and parametric synthesis systems that had preceded them. By the early 2020s, commercial text-to-speech systems could produce speech virtually indistinguishable from human recordings for most listeners in standard conditions, with control over voice characteristics, emotion, and prosody that earlier systems had not offered.
Music generation AI developed through several distinct approaches. Symbolic music generation systems, including OpenAI’s MuseNet (2019), modeled music as sequences of notes in a symbolic representation and generated new compositions by predicting subsequent notes given previous ones --- essentially applying the autoregressive language model approach to musical sequences. These systems could generate music in a wide range of styles and could continue or harmonize with human-composed melodies. Audio generation systems, including Google’s AudioLM (2022) and Meta’s MusicGen (2023), generated audio waveforms directly rather than symbolic representations, producing music that captured the full texture and timbre of real recordings rather than the more austere quality of synthesized notation playback. Google’s MusicFX and similar products allowed users to generate short musical pieces from text descriptions in any style from jazz to classical to electronic, with quality sufficient for background music and content creation applications.
Voice cloning --- the ability to generate speech in the voice of a specific individual from a small number of reference audio samples --- became one of the most practically consequential and most ethically fraught audio AI capabilities of the early 2020s. Legitimate applications were significant: voice cloning allowed audiobook narration in the voice of an author who had died, enabled people with ALS or other conditions affecting speech to communicate in their own voices after losing the ability to speak, and provided cost-effective dubbing for video content into multiple languages. The misuse potential was equally significant: voice cloning was the technology behind a wave of “vishing” attacks in which scammers called elderly people’s relatives claiming to be a family member in distress, using cloned voice samples scraped from social media to make the deception convincing.
Multimodal Systems: Reasoning Across Senses
The systems that represented the most dramatic expansion of generative AI’s capabilities in the early 2020s were multimodal models: systems that could process and generate across multiple data types simultaneously. GPT-4, released by OpenAI in March 2023 with image input capabilities, could analyze photographs, diagrams, charts, and screenshots, answering questions about their content and reasoning across visual and textual information simultaneously. Google’s Gemini models, released in December 2023, extended this to video understanding. The combination of language understanding with visual perception and generation created systems that could describe images, generate images from descriptions, answer questions about visual content, and reason about the relationships between text and images in ways that neither pure language models nor pure image models could approach.
The practical implications of multimodal AI were most immediate for accessibility: systems that could describe images for visually impaired users, generate captions for videos in multiple languages, or convert spoken language to text and vice versa with high accuracy addressed genuine needs that previous technology had addressed only partially. For creative professionals, multimodal systems offered workflows that integrated language and visual reasoning in ways that separate tools had not: a graphic designer could describe a layout in natural language and receive a generated image, then describe what needed to change and receive an updated version, without leaving the language interface. For educators, multimodal systems could generate illustrated explanations, interactive visual representations of abstract concepts, and accessible versions of complex visual information.
Reflection: The applications surveyed in this section share a common structural feature: they are most productive when used as components in human-directed workflows rather than as autonomous replacements for human creative judgment. The advertising team that uses AI image generation to explore the space of possible campaign visuals and then exercises creative judgment to select, refine, and develop the best candidates is using generative AI productively. The team that deploys AI-generated content without review or curation, at scale, without the contextual judgment that human creators bring to audience and purpose, is likely to produce output that is generically competent and specifically disappointing. This distinction --- between AI as amplifier of human creativity and AI as replacement for it --- is not merely aesthetic. It has practical consequences for output quality and ethical consequences for accountability.
Section 4: Risks, Limitations, and the Governance Gap
Generative AI’s capabilities arrived faster than the frameworks for managing their consequences, and the gap between what the technology can do and what responsible deployment looks like has been wide, visible, and consequential. This section examines the principal risks and limitations of generative AI systems with specificity, because generic risk discussion --- “AI can produce biased or harmful content” --- is less useful than understanding the specific mechanisms by which these risks manifest and the specific interventions that address them.
Bias and Harmful Content: Training Data as Cultural Mirror
Generative AI systems learn their outputs from training data, and training data reflects the full range of human expression --- including its prejudices, stereotypes, and historical injustices. Image generation models trained on internet images inherit the biases of those images: the over-representation of certain demographics in certain roles, the aesthetic standards of particular cultural contexts, the visual tropes that recur in stock photography and art collections. A text-to-image model asked to generate “a doctor” will, if its training data over-represents male doctors, generate predominantly male doctors unless explicitly prompted otherwise. Asked to generate “a criminal,” it may generate members of demographic groups that were over-represented in the crime-related training images it encountered, regardless of any relationship between those images and actual crime rates.
These biases are not incidental and cannot be fully eliminated by better algorithms alone; they are reflections of the statistical structure of training data that itself reflects historical and ongoing discrimination. Mitigation requires combination of approaches: curating training data to better represent the range of human diversity and to avoid or down-weight explicitly harmful content; applying post-training alignment techniques that use human feedback to shape model behavior away from harmful outputs; and deploying content filtering systems that catch harmful outputs before they reach users. None of these approaches is fully effective, and each introduces its own distortions: over-curated training data can produce models that refuse reasonable requests or produce sanitized outputs that fail to reflect the full complexity of human experience, while insufficient curation produces models that perpetuate harm at scale.
Copyright, Provenance, and the Training Data Controversy
The legal and ethical questions surrounding the training data of generative AI models were among the most actively contested in the early 2020s. Image generation models were trained on billions of images scraped from the internet, including vast quantities of copyrighted artwork, photography, and illustration, without the consent of the creators and without compensation. The artists whose work was most influential in shaping the aesthetic capabilities of these models --- whose distinctive styles could be reliably reproduced by including their names in prompts --- had not chosen to contribute their work as training material and received nothing in return for their contribution to the systems’ commercial value.
The legal status of this training data use was genuinely uncertain, with reasonable arguments on both sides. Proponents of the “fair use” position argued that training on copyrighted works was transformative, analogous to a human artist learning from the works they studied, and that the AI model’s outputs were not copies of the training data but new works generated from learned statistical patterns. Critics argued that the scale of copying required for training --- literal storage and processing of billions of copyrighted images --- could not be characterized as fair use regardless of the transformative quality of the outputs, and that the systematic reproduction of artistic styles without compensation was economically harmful to the artists whose work had been used. Class action lawsuits against image generation companies were filed in 2023, and the courts’ eventual resolution of these questions will shape the legal framework for AI training data for years to come.
Provenance and watermarking --- the ability to identify whether a given image or text was AI-generated and, if so, by which system --- became active areas of both technical research and policy development. Cryptographic watermarking systems, including C2PA (Coalition for Content Provenance and Authenticity) standards that major technology companies began adopting in 2023 and 2024, provided a mechanism for embedding verifiable provenance information in generated content. The limitation of watermarking approaches was that they required the cooperation of the systems producing the content --- open-source models made available without watermarking could produce unmarked AI content without any provenance record, and watermarks in images could be stripped by sufficiently sophisticated post-processing.
Deepfakes, Misinformation, and the Epistemic Environment
The concern that motivated OpenAI’s staged release of GPT-2 in 2019 --- that AI-generated text could be used to produce disinformation at scale --- became substantially more concrete by 2023 and 2024, as image and video generation capabilities reached the level at which fabricated visual content was difficult or impossible for ordinary viewers to identify as fake. “Deepfake” technology --- the use of AI to place real people’s likenesses in video content they did not participate in --- moved from a curiosity requiring expensive hardware and significant expertise to a consumer-grade capability available through smartphone applications. The political implications were serious: fabricated video of political candidates saying things they had never said, fabricated audio of public officials announcing events that had not occurred, and fabricated photographic “evidence” of events that had not happened all became practical threats to the information environment in democratic societies.
The technical responses to deepfakes --- detection models trained to identify artifacts and statistical signatures of AI generation --- faced the same adversarial dynamic that had characterized spam filtering: as detection improved, generation systems were refined to avoid the detectable artifacts, and the arms race between generation and detection made reliable detection increasingly difficult. The more durable responses were social and institutional: media literacy education that trained consumers to approach all visual evidence with appropriate skepticism; provenance standards that allowed trustworthy media organizations to certify the authenticity of their content; and platform policies that required labeling of AI-generated or AI-modified content. These approaches were implemented imperfectly and partially, but they represented a more sustainable strategy than purely technical detection.
Energy, Cost, and Concentration of Power
The computational resources required to train state-of-the-art generative AI models continued to grow through the early 2020s, with training runs for the largest models requiring thousands of the most advanced GPU chips operating for weeks or months. The energy consumption of these training runs --- measured in gigawatt-hours for the largest models --- and the carbon emissions associated with them became subjects of serious concern and active study. Inference costs, while lower per query than training costs, accumulated to significant scale as deployment of large generative models became widespread: millions of users generating images, text, and audio daily through AI services represented an aggregate energy footprint that had to be accounted for in any honest assessment of the technology’s environmental impact.
The concentration of capability in a small number of organizations --- primarily large technology companies with the resources to train frontier models --- raised governance concerns that went beyond environmental impact. The ability to deploy generative AI systems capable of shaping public information, automating creative work, and generating convincing synthetic media was concentrated in organizations that were privately owned, insufficiently regulated, and operating under competitive dynamics that created incentives for rapid deployment ahead of thorough safety evaluation. The open-source response --- releasing capable models publicly to prevent monopolization of AI capability --- addressed some concerns while creating others: open-source models democratized access to AI capabilities including the harmful ones, making misuse available to actors without the resources or inclination to develop models from scratch.
“The governance gap in generative AI is not primarily a technical problem. It is a problem of institutional design: building the organizations, regulations, norms, and accountability structures adequate to a technology developing faster than any that has preceded it.”
Section 5: Recommendations for Creators and Organizations
The practical question for creators, businesses, and institutions confronting generative AI is not whether to engage with it but how to engage with it responsibly, productively, and in ways that preserve rather than erode the values and capabilities that make human creative and intellectual work meaningful. The following recommendations are organized by operational context and are intended to be specific enough to be actionable rather than aspirational enough to be useless.
Start with Prototypes, Not Deployments
The temptation, when a new capability appears impressive in demonstration, is to move quickly to deployment at scale before fully understanding the system’s failure modes. This temptation is reinforced by competitive pressure and by the genuine costs of delayed adoption. But generative AI systems fail in ways that are characteristically different from conventional software failures: they do not crash or return error codes; they produce plausible-seeming outputs that may be subtly or egregiously wrong, biased, or harmful in ways that are not visible without domain expertise and careful evaluation. A prototype phase that specifically tests failure modes --- deliberately prompting the system with the inputs most likely to produce problematic outputs, evaluating outputs across demographic groups, and stress-testing the system against adversarial inputs --- is essential for understanding what deployment will actually look like.
For organizations without in-house AI expertise, starting with cloud-hosted API access to commercially available models rather than building or fine-tuning custom models is generally the right first step. Commercial API providers have implemented safety filtering, usage monitoring, and content policies that provide baseline protections. The limitations of these protections --- and the cases where they will be insufficient for your specific use case --- are best discovered in a controlled prototype environment rather than in production. Open-source models offer greater customization and lower marginal cost but require substantially more operational expertise and place greater responsibility for safety on the deployer rather than the model provider.
Human-in-the-Loop Workflows Are Not Optional
The productivity gains from generative AI in creative and knowledge work are most reliably realized when AI-generated outputs are treated as drafts, proposals, or starting points for human judgment rather than final products. This is true not just for quality reasons --- though current generative models are unreliable enough that unreviewed outputs frequently contain errors, inconsistencies, or hallucinations that require correction --- but for accountability reasons. The organization that publishes AI-generated content without review has no defense against the claim that it published errors or harmful content without adequate care; the organization that reviews, selects, and edits AI-generated outputs can point to the human judgment exercised at each decision point.
Human-in-the-loop workflows are most effective when they are designed around the specific failure modes of the AI systems being used rather than as generic quality checks. If a code generation tool characteristically produces syntactically correct but logically flawed implementations of complex algorithms, the review workflow should specifically test generated code against edge cases and adversarial inputs, not merely check whether it compiles. If an image generation tool characteristically produces images that are demographically homogeneous in ways that do not reflect your intended audience, the review workflow should specifically check for and correct this tendency rather than relying on reviewers to notice it incidentally. Effective human-in-the-loop design requires understanding what the AI does poorly, not just what it does well.
Document Provenance and Maintain Accountability Records
As generative AI outputs become integrated into organizational workflows, maintaining records of what was generated, how, and with what human oversight becomes essential for accountability, quality management, and legal defensibility. At minimum, organizations should record the model or API version used to produce each generated output, the prompt or inputs provided, the date of generation, and the human review steps applied before the output was used or published. This documentation supports several important functions: it enables investigation and correction when problems are discovered; it demonstrates due diligence in the event of legal or regulatory scrutiny; it enables continuous improvement of workflows by allowing analysis of patterns in what required human correction; and it provides an audit trail for compliance with emerging disclosure requirements.
For organizations in regulated industries --- healthcare, finance, legal services, journalism --- provenance documentation is not merely good practice but likely to be required by emerging regulations. The European Union’s AI Act, which entered into force in 2024, includes disclosure requirements for certain categories of AI-generated content and transparency requirements for high-risk AI applications that have direct implications for provenance tracking. Even in less regulated contexts, voluntary adoption of provenance practices positions organizations favorably relative to the regulatory requirements that are predictably coming as governments worldwide grapple with generative AI’s implications for public information and accountability.
Plan Realistically for Costs and Limitations
Generative AI’s productivity benefits are real, but their realization requires investment in infrastructure, workflow redesign, quality assurance, and ongoing maintenance that is often underestimated in initial assessments. The cost of API access to commercial models scales with usage and can grow rapidly as adoption expands across an organization; the infrastructure cost of deploying open-source models on-premises or in cloud environments includes not just compute but engineering time for deployment, monitoring, updating, and safety evaluation. Moderation costs --- the human review of AI-generated content for harmful or policy-violating outputs --- are a frequently overlooked category that can be substantial for consumer-facing applications at scale.
Realistic planning also requires honest assessment of the current limitations of generative AI for your specific use case. The models that perform impressively in demonstrations often perform less impressively on the specific inputs and formats of real organizational workflows. Domain-specific terminology, proprietary formatting requirements, highly specialized expertise, and outputs requiring guaranteed factual accuracy are all areas where current generative AI underperforms relative to general-purpose demonstrations. Fine-tuning on domain-specific data can address some of these gaps but requires labeled examples, engineering expertise, and ongoing maintenance as models and requirements evolve. The organization that enters generative AI deployment with realistic expectations about where current systems will require significant human supplementation is better positioned to realize genuine productivity gains than the one that scales back disappointed expectations after premature deployment.
Reflection: The recommendations in this section are, at their core, applications of principles that have governed responsible technology adoption long before generative AI: test before deploying at scale; maintain human judgment at critical decision points; document processes for accountability; and be honest about costs and limitations. What makes these principles newly urgent in the generative AI context is the technology’s speed of deployment, its pervasive applicability across creative and knowledge work domains, and its characteristic failure modes --- plausible-seeming wrong outputs, systematic biases, and sensitivity to adversarial inputs --- that differ qualitatively from those of conventional software. The principles are not new; their application to this technology requires new attention and new specificity.
Conclusion: Creation, Amplified and Complicated
The rise of generative AI between 2014 and 2024 represents one of the most compressed technological transformations in the history of creative tools. From Goodfellow’s 2014 GAN paper to the public availability of systems capable of generating photorealistic images, professional-quality text, and convincing synthetic audio from natural language descriptions took less than a decade --- a span that encompasses the entire working career of a junior professional who started work in 2015 and found their domain transformed by the time they had accumulated a few years of experience. The speed of this transformation is, in itself, one of its most significant features: it outpaced the legal frameworks designed to protect creative workers, the ethical frameworks designed to govern AI deployment, and the social frameworks designed to maintain shared epistemic standards in public discourse.
The capabilities that generative AI provides are genuinely significant. The ability to produce visual content, text, and audio at scales and speeds that human creators alone could not approach opens real possibilities for education, communication, accessibility, scientific research, and creative exploration. These are not trivial benefits, and dismissing them to focus exclusively on risks misses the genuine transformation in what is possible for people who engage thoughtfully with these tools. The illustrator who can explore ten times as many initial concepts before committing to one, the researcher who can generate synthetic training data for rare conditions, the educator who can produce accessible versions of complex visual material --- each is benefiting from capabilities that the underlying technology genuinely provides.
The complications are equally genuine. Training on human creative work without consent or compensation raises ethical questions about fairness that market dynamics alone will not resolve. The environmental cost of the largest generative AI systems is real and demands honest accounting. The misuse potential for deepfakes, disinformation, and content spam requires active technical and regulatory countermeasures. The concentration of the most capable generative AI systems in the hands of a small number of large organizations raises governance questions that neither the organizations nor the regulators have fully answered. These complications do not cancel the benefits; they define the terms on which the benefits can be realized equitably and sustainably.
“Generative AI did not change what creativity is. It changed who can do it, how quickly, at what cost, and at what scale --- and in doing so, it raised every question about authorship, originality, and value that the creative professions had previously been able to take for granted.”
The future of generative AI --- its technical trajectory, its social consequences, and the governance frameworks that will shape its deployment --- will be determined in the coming years by decisions being made now by researchers, companies, policymakers, creators, and users. Those decisions will be better if they are made with clear understanding of what the technology actually does, what its genuine capabilities and genuine limitations are, and what the stakes of different choices look like for different communities. This episode has aimed to contribute to that clarity. The complexity of what remains is the subject of the episodes that follow.
───
Next in the Series: Episode 17
Ethics, Bias, and Regulation --- How Societies Are Responding to Generative AI’s Opportunities and Harms
Generative AI’s capabilities arrived faster than the institutions designed to govern them. In Episode 17, we trace the regulatory landscape that is taking shape in response: the European Union’s AI Act, the first comprehensive legislative framework for AI, and its risk-tiered approach to oversight; the United States’ more fragmented regulatory posture, relying on executive action and sector-specific guidance rather than comprehensive legislation; China’s distinctive approach to AI governance, combining aggressive development with specific controls on generative AI outputs; and the emerging international frameworks --- from the G7’s Hiroshima AI Process to the UK’s AI Safety Institute --- that seek coordination across jurisdictions. We also examine the bias and fairness challenges in generative AI with greater technical depth, and ask what accountability structures are needed to ensure that the benefits of generative AI are distributed equitably rather than concentrated among those already advantaged by existing structures of power and access.
--- End of Episode 16 ---