AI in Healthcare & Science — History of AI

AI in Healthcare & Science

How Machine Learning Entered the Clinic, the Laboratory, and the Foundations of Science Itself

Introduction: When the Stakes Became Life and Death

Every technology eventually encounters the domains where its consequences are most serious --- where errors are not merely costly or embarrassing but harmful, where trust must be earned through demonstrated reliability rather than assumed, and where the gap between impressive performance on benchmarks and genuine real-world utility is most consequential. For artificial intelligence, those domains arrived in the 2010s in the form of medicine and science: the places where AI’s capabilities, if realized, could save lives, accelerate discovery, and address problems that had resisted every previous approach. And the places where its failures, if not carefully managed, could cause harm at a scale commensurate with the hope.

The preceding episodes traced how deep learning transformed the processing of images, speech, and text --- domains that had been central to AI research since its early years and that produced improvements measured in benchmark percentages and product quality ratings. The application of those same capabilities to healthcare and science involved all the same technical advances, but in contexts where the stakes were orders of magnitude higher and where the challenges of deployment --- regulatory approval, clinical workflow integration, physician trust, patient consent, demographic fairness --- were more complex than anything the technology industry had previously navigated.

“AI entering medicine was not like AI entering search or spam filtering. The failures were not merely inconvenient. They could kill. That changed everything about how the technology had to be built, evaluated, and deployed.”

This episode traces AI’s entry into healthcare and the life sciences across several dimensions: the deep learning-based diagnostic imaging systems that approached specialist physician performance on specific tasks and raised immediate questions about what “approaching human performance” actually meant for clinical decision-making; the drug discovery and genomics applications that promised to compress years of laboratory work into days of computation; the personalized medicine systems that used patient data to predict individual risk and tailor treatment; and the scientific applications that addressed fundamental questions in biology, physics, chemistry, and astronomy that had been beyond the reach of previous computational approaches. It also traces, with equal attention, the ethical and practical challenges that accompanied these applications: the bias problems that emerged when models trained on non-representative data were deployed on diverse patient populations, the regulatory uncertainty that slowed clinical translation, and the deep questions about the appropriate relationship between algorithmic recommendation and clinical judgment that the field is still working to answer.

Section 1: Diagnostics and Medical Imaging --- The Machine Reads the Scan

The medical imaging applications of deep learning were, from a technical standpoint, among the most natural extensions of the computer vision breakthroughs traced in Episode 11. A chest X-ray is an image; an MRI scan is a stack of images; a histopathology slide is an image at very high magnification. The convolutional neural network architectures that had learned, from ImageNet, to classify photographs of dogs and cats and keyboards could, in principle, be trained on labeled medical images to classify pathological findings. The question was whether the performance on clinical tasks would be good enough to matter, and whether “good enough to matter” in the clinical context was the same as “good enough to win a benchmark.”

Diabetic Retinopathy: The First Major Clinical Demonstration

The deep learning system whose performance first compelled serious attention from the medical community was not a general diagnostic AI but a system for a specific, well-defined task: detecting diabetic retinopathy from photographs of the retina. Diabetic retinopathy is the leading cause of preventable blindness in working-age adults in the developed world, caused by damage to the blood vessels of the retina as a complication of poorly controlled diabetes. It is treatable if detected early; it causes irreversible vision loss if detected late. Regular screening of diabetic patients with retinal photography can identify the condition before symptoms appear, but the global shortage of trained ophthalmologists means that many patients, particularly in low- and middle-income countries, do not receive timely screening.

The Google Brain team’s 2016 paper in JAMA, “Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs,” trained a deep convolutional neural network on 128,175 retinal images that had been graded by a panel of ophthalmologists, and evaluated it against the performance of a separate panel of ophthalmologists on a held-out test set. The algorithm achieved sensitivity and specificity for detecting referable diabetic retinopathy that exceeded the performance of the median ophthalmologist in the evaluation panel. This was not a result on an academic benchmark; it was a result on the specific clinical task of identifying which patients needed referral for further evaluation and treatment, evaluated against the clinical standard being proposed for replacement.

The result received substantial press coverage and was widely interpreted as evidence that AI could perform at or above specialist physician level in medical imaging. The interpretation was accurate but required careful qualification. The algorithm had been evaluated on a specific, well-defined task --- binary classification of retinal images as requiring or not requiring referral --- under conditions that differed from clinical deployment in important ways. The images in the evaluation set had been captured using equipment from a small number of clinical sites in the United States; the performance of the algorithm on images from different equipment, different patient populations, and different clinical contexts was not established. The algorithm classified images but did not explain its classifications in terms that would allow a clinician to verify its reasoning. And it had been evaluated on a static dataset rather than in the kind of dynamic, longitudinal clinical workflow where the consequences of specific errors --- false negatives that led to missed diagnoses, false positives that led to unnecessary referrals --- could be fully assessed.

Radiology: The Promise and the Reality Gap

The success of the diabetic retinopathy result inspired a wave of deep learning applications in radiology, where the combination of large existing image archives and the critical importance of accurate interpretation made the potential for AI assistance substantial and the interest from researchers, companies, and hospital systems intense. Studies demonstrating that deep learning systems could detect pneumonia from chest X-rays with accuracy comparable to radiologists, identify pulmonary nodules on CT scans with sensitivity matching expert readers, classify skin lesions from dermoscopic images with performance exceeding dermatologists, and detect bone fractures from X-rays with accuracy competitive with emergency physicians appeared in rapid succession between 2017 and 2020 in high-profile journals.

The headline results were genuine: on the specific tasks evaluated, with the specific datasets used, the deep learning systems achieved the claimed performance levels. But the translation from these research results to clinical deployment revealed a consistent and important pattern: systems that performed impressively in the research context performed less impressively --- sometimes substantially less impressively --- in real clinical environments. The phenomenon, called distribution shift, reflected the fact that the statistical properties of images in deployed clinical systems differed from the statistical properties of the datasets on which the models had been trained and evaluated. Different CT scanners from different manufacturers produced images with different noise characteristics, different contrast levels, and different reconstruction artifacts. Different hospitals had different patient populations with different disease prevalences and different comorbidities. Different radiologists annotated similar cases differently, and the labels in the training data reflected the annotating radiologist’s judgments rather than ground truth.

The distribution shift problem was not merely a technical inconvenience; it was a fundamental challenge for the deployment of medical imaging AI that required both technical solutions --- domain adaptation methods, prospective validation on diverse datasets, continuous monitoring of performance in deployment --- and regulatory frameworks that went beyond the approval processes designed for conventional medical devices with fixed, well-characterized performance characteristics. The FDA’s 510(k) clearance pathway, through which the first AI-based medical imaging systems were approved in the late 2010s and early 2020s, was designed for devices with static performance characteristics rather than machine learning systems whose performance could change as their inputs drifted from their training distribution, and adapting regulatory frameworks to the specific characteristics of deployed ML systems became one of the central policy challenges of AI in healthcare.

Pathology: Teaching Machines to See Cancer

Digital pathology --- the scanning and analysis of tissue samples at high resolution --- provided a domain where deep learning’s advantages were particularly pronounced and where the limitations of human performance were well documented. Pathological diagnosis from tissue samples is the gold standard for cancer diagnosis, but it requires highly trained specialists and is subject to inter-observer variability: multiple pathologists examining the same slide often disagree, particularly for cases at the boundary between malignant and benign, or for cancers that are morphologically heterogeneous. The stakes of pathological diagnosis are high --- an incorrect diagnosis of cancer leads to unnecessary treatment with serious side effects; a missed cancer diagnosis leads to delayed treatment and potentially preventable mortality --- and the specialist workforce is insufficient to meet demand in many healthcare systems.

Deep learning systems for computational pathology trained on whole-slide images --- digitized glass slides at resolutions of 40,000 by 40,000 pixels or more --- demonstrated the ability to detect cancer with accuracy competitive with pathologists on several tumor types. A 2017 study from Google and collaborators demonstrated that a deep learning system for detecting metastatic breast cancer in lymph node biopsies achieved an area under the receiver operating characteristic curve of 0.994 on a held-out test set, compared to 0.884 for a pathologist operating under time constraints similar to those of routine clinical practice. When the pathologist was given unlimited time to review each slide, performance improved substantially --- demonstrating that the AI’s advantage was partly a function of time constraints rather than purely of diagnostic capability.

Beyond cancer detection, computational pathology AI demonstrated the ability to predict clinical outcomes --- such as prognosis and response to specific treatments --- from tissue morphology in ways that exceeded what pathologists could determine by visual inspection alone. These prognostic AI systems were not replacing a clinical judgment that pathologists were already making; they were making predictions that no human could make from the same data, identifying patterns in cellular morphology associated with outcomes that were below the threshold of conscious human perception. This capability --- discovering genuinely new clinical information from existing data --- was qualitatively different from automating an existing clinical task, and its potential to change how cancer treatment decisions were made was substantial.

Reflection: The medical imaging breakthroughs of the 2010s demonstrated something important about AI’s role in medicine: the most significant contributions were not always the ones that replicated what physicians already did, but the ones that surfaced information that physicians could not access through existing methods. An AI that detects diabetic retinopathy as well as an ophthalmologist is useful where ophthalmologists are unavailable; an AI that predicts treatment response from tissue morphology that no physician could interpret is useful regardless of how many specialists are available. The distinction between AI as automation and AI as augmentation mattered not just for how the technology was deployed but for how it was regulated, validated, and received by the clinical community.

Section 2: Drug Discovery and Genomics --- Accelerating the Search for Cures

Drug discovery is, by any measure, one of the most expensive and most difficult scientific endeavors that human civilization undertakes. The average time from initial target identification to regulatory approval for a new drug is approximately twelve years. The average cost, accounting for the many candidates that fail at each stage of development, exceeds two billion dollars per approved drug. The failure rate is staggering: of compounds that enter clinical trials, fewer than 12 percent ultimately receive regulatory approval, and many promising candidates fail not because they are ineffective against their intended target but because of toxicity, poor pharmacokinetics, or the discovery that the target they address is not, in fact, the cause of the disease being treated. Any technology that could meaningfully improve the success rate or reduce the time and cost of drug discovery had the potential to save both enormous sums of money and enormous numbers of lives.

Molecular Property Prediction and Virtual Screening

The most immediate application of machine learning to drug discovery was the prediction of molecular properties from molecular structure. Traditional drug discovery relied on high-throughput screening: physically synthesizing hundreds of thousands of candidate compounds and testing each one experimentally against a target of interest --- an enzyme, a receptor, a protein --- to identify the subset with the desired activity. The process was expensive, slow, and wasteful: the vast majority of synthesized compounds showed no activity, and the cost of synthesis and assay was borne for each candidate regardless of its eventual utility.

Machine learning models trained on large databases of molecular structure-activity relationships --- pairs of molecular structures and their experimentally measured activities against specific targets --- could predict the activity of new candidate compounds before they were synthesized, allowing computational virtual screening to prioritize the candidates most likely to be active for experimental testing. The predictive models ranged from relatively simple random forests and gradient boosting methods, which were effective when training data was abundant, to graph neural networks that represented molecular structures as graphs --- with atoms as nodes and bonds as edges --- and learned molecular representations directly from the graph structure. Graph neural networks proved particularly effective for property prediction because they could capture the geometric and topological properties of molecular structure that simpler representations missed.

The pharmaceutical company Atomwise, founded in 2012, was an early and influential example of an AI-first drug discovery organization, using deep learning on molecular graphs to screen large virtual libraries of compounds against protein targets and identify candidate binders for synthesis and experimental testing. In 2015, Atomwise published results demonstrating that its AI system had retrospectively identified two compounds with activity against the Ebola virus protein VP24, from a library of more than eight thousand candidate compounds, in less than a day of computation --- a screen that would have required months of laboratory work using conventional methods. While the retrospective nature of the analysis limited the strength of the conclusion, the result illustrated the potential of AI virtual screening to change the economics of early-stage drug discovery.

AlphaFold and the Protein Folding Revolution

The most scientifically consequential AI contribution to biology in the 2010s and early 2020s was DeepMind’s AlphaFold system for protein structure prediction, which we mentioned in Episode 9 and Episode 15 and which deserves the detailed treatment its significance warrants. The protein folding problem --- predicting the three-dimensional structure that a protein adopts from its amino acid sequence --- had been recognized as one of the central unsolved problems in molecular biology since Christian Anfinsen’s Nobel Prize-winning demonstration in the early 1970s that a protein’s sequence determines its structure. Knowing a protein’s structure is essential for understanding its biological function, because proteins perform their functions through specific three-dimensional shapes that determine what other molecules they can interact with. It is also essential for structure-based drug design, which aims to design drug molecules that fit precisely into specific binding sites on protein targets.

Experimental determination of protein structures through X-ray crystallography, nuclear magnetic resonance spectroscopy, or cryo-electron microscopy was possible but slow and expensive, requiring months of work per structure for straightforward cases and years for difficult ones. By 2020, after decades of experimental effort, the Protein Data Bank contained structural data for approximately 170,000 proteins --- impressive in absolute terms, but a tiny fraction of the approximately 200 million distinct protein sequences that had been identified through genome sequencing. The gap between the number of proteins whose sequences were known and the number whose structures were known represented an enormous obstacle to understanding the biological functions of the proteins that evolution had produced.

DeepMind’s first AlphaFold system, presented at the CASP13 protein structure prediction competition in December 2018, achieved state-of-the-art results on the competition benchmarks, substantially outperforming all other methods but not achieving the level of accuracy that would make predictions practically useful as substitutes for experimental determination. AlphaFold 2, presented at CASP14 in December 2020, was a different order of achievement entirely. The system’s predictions achieved median GDT_TS scores above 92 across the competition targets --- a level of accuracy that was, for many proteins, essentially indistinguishable from experimental determination given the resolution limits of the experimental methods themselves. The CASP organizers described the result as a solution to the protein folding problem as it had been posed by the competition, and the broader scientific community received it as one of the most significant computational achievements in the history of biology.

AlphaFold 2’s architecture, described in a July 2021 Nature paper, used a Transformer-based component called Evoformer that processed both the amino acid sequence and multiple sequence alignments --- collections of evolutionarily related sequences from other species, providing indirect evidence about which amino acid positions co-evolved and were therefore likely to be physically close in the folded structure --- through multiple rounds of attention that allowed information to pass between the sequence representation and the predicted distance and angle features. The architecture was innovative in its specific design for the protein structure prediction problem, but it drew directly on the Transformer building blocks that had been developed for natural language processing and reflected the broader principle that Transformer-based architectures could extract useful structure from any kind of sequential or relational data.

DeepMind’s decision to release AlphaFold 2’s predictions for more than 200 million protein sequences --- essentially the entire known protein universe --- through a freely accessible database in July 2022 was an act of scientific generosity with immediate and substantial practical impact. Researchers around the world who had been waiting months or years for experimental structure determination could now access predicted structures in seconds, accelerating the pace of work across structural biology, drug discovery, and fundamental research into the mechanisms of biological processes. The database became one of the most accessed scientific resources in the world within months of its release, and its impact on the pace and direction of biological research is likely to be felt for decades.

Genomics and Precision Medicine

The genomics revolution of the early twenty-first century had produced vast quantities of DNA sequence data, but translating that data into biological and clinical understanding remained a major challenge. The human genome contains approximately three billion base pairs, but only a small fraction of those base pairs directly encode proteins; the function of much of the rest --- the regulatory regions, the non-coding RNA genes, the repetitive sequences --- remained poorly understood. Genome-wide association studies (GWAS) had identified thousands of common genetic variants associated with specific diseases, but the mechanisms connecting those variants to disease and the clinical implications for individual patients were often unclear.

Machine learning approaches to genomics addressed several layers of this challenge. Deep learning models trained on genome sequence data could predict the functional effects of non-coding variants --- changes in the genome sequence outside protein-coding genes that affected how those genes were regulated --- with greater accuracy than previous computational methods. DeepMind’s Enformer model, published in 2021, predicted gene expression levels in different cell types from DNA sequence with accuracy substantially exceeding previous state-of-the-art methods, providing a computationally efficient way to assess the likely regulatory consequences of genetic variants across the genome.

Single-cell RNA sequencing, which measured the gene expression profile of individual cells rather than averaging across millions of cells in a tissue sample, generated datasets of extraordinary dimensionality --- thousands of genes measured in thousands of individual cells --- for which machine learning methods were essential for extracting biological meaning. Dimensionality reduction methods including UMAP and t-SNE, combined with clustering algorithms and trajectory inference methods, allowed researchers to identify distinct cell types and states within complex tissues, to reconstruct developmental trajectories showing how cells change over time, and to understand the cellular heterogeneity within tumors that contributed to treatment resistance. The single-cell revolution in biology was not possible without machine learning, and the machine learning methods it required drove developments in dimensionality reduction and clustering that influenced other domains.

COVID-19: AI Under Pressure

The COVID-19 pandemic, which began in early 2020, provided an unplanned large-scale test of AI’s ability to contribute to an acute global health crisis on a compressed timeline. The contributions were real but uneven, reflecting both the genuine potential of AI for pandemic response and the specific limitations of AI systems that had been developed for normal-time research conditions and were rapidly redeployed for an emergency that differed from those conditions in important ways.

In structural biology, the speed and accuracy of AlphaFold’s protein structure predictions proved immediately valuable. Within weeks of the SARS-CoV-2 genome sequence being published in January 2020, researchers used AlphaFold and other computational methods to predict the structures of viral proteins including the spike protein, which mediated viral entry into cells and was the primary target for vaccine development. These predicted structures provided starting points for the structure-based design of antibodies and small molecule inhibitors, accelerating the early stages of therapeutic development. The high-resolution experimental structures that were subsequently determined confirmed the predicted structures with high accuracy, validating the computational approach.

In epidemiology, machine learning models for predicting the trajectory of epidemic spread, identifying high-risk populations, and optimizing intervention timing were developed and deployed at multiple scales. The results were more mixed: epidemic prediction models that performed well in retrospect on historical data from earlier outbreaks performed less well in prospective prediction during COVID-19, because the pandemic’s dynamics were shaped by policy interventions, behavioral responses, and variant emergence that the models were not designed to capture. Several widely publicized AI-based diagnostic systems for COVID-19 from chest X-ray and CT images were subsequently found, in systematic reviews, to have been trained and evaluated on datasets with significant methodological flaws --- including training on datasets from patients already known to have COVID-19 and evaluating on patients who were healthy rather than presenting with COVID-like symptoms, which produced impressive performance metrics that did not reflect clinical utility.

The COVID-19 AI experience was instructive precisely because of its mixed outcomes. The applications where AI contributed most clearly --- structural prediction, virtual screening of drug candidates, analysis of genomic variant data --- were ones where the methods were well-established, the data was of high quality, and the specific task was well-defined. The applications where AI contributed less than expected --- epidemic prediction, chest imaging diagnosis --- were ones where the data was limited or biased, the task was poorly defined, or the deployment conditions differed substantially from the development conditions. The pandemic did not reveal that AI was either as powerful as its most optimistic advocates claimed or as useless as its harshest critics suggested; it revealed, with unusual clarity, the specific conditions under which AI contributed genuine value and the specific conditions under which it did not.

Reflection: Drug discovery and genomics AI illustrated a pattern that recurred across every domain where AI entered science: the most immediately impactful contributions were to well-defined computational sub-problems within larger research workflows, rather than to the full complexity of the scientific question. AlphaFold solved the protein folding problem as defined by the CASP competition; it did not solve the problem of understanding protein function, which depends on protein structure but is not determined by it alone. Virtual screening AI identified candidates for synthesis and testing; it did not solve the problem of which targets were worth pursuing or which patient populations would benefit from which treatments. This pattern is not a criticism of what AI contributed; it is a clarification of where in the research process AI was transforming practice and where human scientific judgment remained essential.

Section 3: Personalized Medicine --- From Population Statistics to Individual Prediction

Medicine has always aspired to treat individual patients rather than average patients, but the practical constraints of clinical practice --- the time available for each patient encounter, the information available at the point of care, the complexity of integrating multiple sources of evidence --- have meant that most clinical decisions are based on population-level evidence applied to individual cases with significant uncertainty. The promise of personalized or precision medicine was to reduce that uncertainty by integrating the specific characteristics of individual patients --- their genetic makeup, their molecular disease profile, their treatment history, their lifestyle and environment --- into clinical decisions that were genuinely tailored rather than merely generalized from populations.

Predictive Analytics: Anticipating Disease Before It Arrives

Predictive analytics systems in healthcare used machine learning models trained on electronic health record data, lab values, imaging results, and demographic information to identify patients at elevated risk of specific adverse outcomes: hospital readmission within thirty days of discharge, progression to sepsis in the ICU, deterioration of kidney function in hospitalized patients, onset of atrial fibrillation in outpatients, or development of type 2 diabetes in pre-diabetic patients. The goal was to enable earlier intervention: identifying high-risk patients before they experienced adverse outcomes, directing clinical attention and resources toward them, and initiating preventive measures that could improve outcomes or reduce costs.

Google’s 2018 paper in Nature, “Scalable and accurate deep learning with electronic health records,” trained deep learning models on de-identified electronic health records from two large academic medical centers and evaluated their performance on predicting a range of clinical outcomes including in-hospital mortality, 30-day readmission, prolonged length of stay, and discharge diagnoses. The models substantially outperformed the clinical prediction scores then in routine use across most tasks, and their advantage was particularly pronounced for outcomes requiring integration of information across long clinical histories --- the kind of longitudinal integration that is practically difficult for clinicians working with fragmented and voluminous medical records. The paper was influential both for its results and for its transparent reporting of methodology and limitations, providing a template for rigorous evaluation of clinical prediction AI that subsequent work could build on.

The deployment of predictive analytics in clinical practice proved more complex than the research results suggested. Models trained at one institution often did not perform as well when deployed at another, because patient populations, documentation practices, and local clinical culture differed in ways that affected the statistical properties of the data. Models trained on historical data reflected historical treatment patterns and could encode the biases of past clinical decision-making: if patients from certain demographic groups had historically been undertreated for pain, a model trained on pain management data would encode that undertreatment as appropriate care for those groups. And the behavioral response to predictions --- the fact that clinicians who received predictions changed their behavior, potentially invalidating the assumptions on which the predictions were based --- created feedback loops that needed to be anticipated and managed.

Oncology: Matching Patients to Treatments

Cancer treatment offers the clearest and most urgent case for personalized medicine: different cancers in different patients, even cancers of the same histological type in the same organ, can respond very differently to the same treatments, and the side effects of treatments like chemotherapy and radiation are severe enough that exposing patients to treatments unlikely to benefit them is itself a significant harm. The accumulation of genomic sequencing data from tumor samples, combined with treatment outcome data from large clinical registries, created datasets that machine learning models could use to predict which treatments were most likely to benefit which patients based on their tumor’s specific molecular profile.

The development of foundation models for oncology --- large pre-trained models adapted to cancer-specific tasks --- accelerated in the early 2020s, with systems that could integrate genomic data, pathology images, radiology images, and clinical notes from electronic health records to produce comprehensive patient representations useful for treatment decision support. IBM Watson for Oncology, launched with considerable fanfare in the mid-2010s, became a cautionary example: the system’s treatment recommendations were found in several evaluations to differ substantially from those of the expert oncologists at the institutions using it, and an investigation by Stat News in 2017 revealed that the system had been trained partly on hypothetical cases rather than real patient data, raising questions about the validity of its recommendations. The Watson for Oncology episode illustrated the risks of deploying complex AI systems in high-stakes clinical contexts without rigorous prospective validation against real clinical outcomes.

Wearables and Continuous Monitoring: Medicine Beyond the Clinic

Consumer wearable devices --- initially fitness trackers and later smartwatches --- accumulated health data continuously from the populations wearing them at a scale that was unprecedented in the history of medicine. A smartwatch recording heart rate, activity level, sleep patterns, and blood oxygen saturation every few seconds for millions of users generated datasets orders of magnitude larger and more longitudinally continuous than anything available from clinical records or research studies. Machine learning methods applied to this data could detect patterns predictive of clinically important events and conditions that would not have been visible in the episodic snapshots available from conventional clinical encounters.

Apple’s Apple Heart Study, published in 2019 in the New England Journal of Medicine, enrolled 419,297 participants through the Apple Watch app and used a machine learning classifier to identify irregular pulse notifications suggesting atrial fibrillation. Participants who received an irregular pulse notification were connected with telehealth physicians and, if indicated, provided with an ECG patch for further monitoring. Of participants who received a notification and returned a patch, 34 percent were subsequently confirmed to have atrial fibrillation. The study demonstrated that consumer wearables could detect clinically important arrhythmias at population scale --- a capability with potential implications for stroke prevention, since atrial fibrillation is a major stroke risk factor and early detection enables anticoagulation that substantially reduces stroke risk.

The Apple Heart Study also illustrated the complexities of population-scale health monitoring with consumer devices. The 34 percent confirmation rate among notified participants meant that 66 percent of participants who received a notification and returned a patch did not have confirmed atrial fibrillation --- a false positive rate that, at the scale of millions of Apple Watch users, would generate very large numbers of people receiving false positive notifications, experiencing anxiety, and potentially undertaking unnecessary clinical evaluation. The balance between the benefits of early detection and the harms of false positives --- a balance that clinical screening programs have always had to manage --- became newly complex at the scale and continuous nature of consumer health monitoring, and required regulatory and clinical frameworks that had not previously existed.

Reflection: Personalized medicine AI illustrated the fundamental tension between the statistical and the individual that runs through the application of AI to healthcare. Machine learning models are trained on populations and produce predictions for populations; clinical medicine is concerned with individual patients, each with a unique combination of characteristics that may or may not be well-represented in any training dataset. The gap between population-level accuracy and individual-level reliability is real and consequential, and the tools for understanding and communicating that gap --- confidence intervals, uncertainty quantification, explanations of model reasoning --- were less developed than the tools for training and evaluating the models themselves. Bridging that gap was one of the central technical and clinical challenges of the field.

Section 4: AI in Scientific Research --- The Machine as Scientific Partner

Beyond medicine and the life sciences, AI began contributing to scientific research across a range of fields in the 2010s, in applications that ranged from pattern recognition in large datasets to the discovery of genuinely new physical phenomena. The common thread across these applications was the ability of machine learning systems to identify structure in high-dimensional data at scales and with a comprehensiveness that exceeded what human researchers could achieve through manual analysis. Science has always required both the collection of data and the extraction of meaning from data; AI was transforming the second of these in ways that were changing the pace and character of discovery.

Physics and Materials Science: Simulating the Unsimulable

Computational physics and materials science had long used simulation --- numerical solution of physical equations governing the behavior of atoms, electrons, and materials --- to predict the properties of matter without requiring laboratory synthesis and measurement. The fundamental equations were often well-established; the challenge was computational: solving them with sufficient accuracy to make useful predictions required computational resources that scaled steeply with the complexity of the system being simulated. Density functional theory (DFT), the workhorse method of computational materials science, could predict the electronic structure and related properties of materials with reasonable accuracy, but its computational cost scaled as the cube of the number of electrons in the system, making simulations of complex materials or large systems prohibitively expensive.

Neural network interatomic potentials addressed this challenge by training neural networks to approximate the potential energy surface of atomic systems --- the relationship between atomic positions and total energy --- from DFT calculations on smaller, simpler systems, and then using those trained networks to perform molecular dynamics simulations orders of magnitude faster than DFT at comparable accuracy. Behler and Parrinello’s 2007 paper introducing the concept of neural network potentials, and the subsequent development by DeepMind of the general framework for machine-learning interatomic potentials, enabled molecular dynamics simulations of systems with millions of atoms at timescales of microseconds to milliseconds --- scales that had been completely inaccessible to DFT-based simulation. Applications ranged from understanding the mechanism of enzyme catalysis to predicting the properties of new materials for batteries, semiconductors, and structural applications.

The Materials Project, an initiative led by Kristin Persson at Berkeley and MIT, used high-throughput DFT calculations combined with machine learning to compute and organize the properties of more than 140,000 inorganic compounds in a publicly accessible database, enabling researchers to search computationally for materials with specific desired properties before undertaking experimental synthesis. Combined with active learning approaches --- using machine learning models to identify which unexplored materials were most worth calculating next, based on the uncertainty of the model’s predictions --- the Materials Project demonstrated how AI and high-throughput computation could be combined to accelerate the exploration of chemical space in ways that were not feasible through experiment alone.

Astronomy: Finding Needles in Cosmic Haystacks

Astronomy had been confronting the challenge of large-scale data analysis since the Sloan Digital Sky Survey began producing its first images in 2000, generating catalogs of hundreds of millions of celestial objects and spectra of more than three million galaxies and quasars. The Large Synoptic Survey Telescope (LSST, later renamed the Vera C. Rubin Observatory), designed to survey the entire visible sky every few nights and expected to generate approximately fifteen terabytes of imaging data per night when it became operational, made the challenge of astronomical data analysis acute: no human workforce could process data at this rate, and the science enabled by the survey depended on the ability to identify, classify, and characterize the objects and events in the data quickly enough for timely follow-up observations.

Machine learning methods were applied across virtually every domain of observational astronomy. Galaxy morphology classification --- assigning galaxies to categories based on their visual appearance, a task that the Galaxy Zoo citizen science project had used human volunteers to perform for hundreds of thousands of galaxies --- was automated using convolutional neural networks that could process millions of galaxies in the time that human classifiers would require for thousands. Transient detection --- identifying objects that had changed in brightness between observations, including supernovae, variable stars, and potentially hazardous asteroids --- was performed by neural networks trained on labeled examples of real transients and “bogus” detections caused by imaging artifacts. Gravitational wave detection from LIGO data used deep learning classifiers to distinguish genuine gravitational wave signals from the numerous noise sources that the detector was sensitive to, enabling the identification of mergers and other events that might otherwise be missed in the high-volume data stream.

Exoplanet detection --- the identification of planets orbiting other stars from the periodic dimming of stellar light caused by planetary transits --- was a task that NASA’s Kepler space telescope had been performing since 2009, generating light curves for hundreds of thousands of stars that needed to be analyzed for the subtle, periodic signals of planetary transits. A 2018 Google Brain paper demonstrated that a neural network trained on labeled Kepler light curves could identify planet candidates with high precision and recall, and applied the trained network to the full Kepler dataset to identify two previously overlooked exoplanet candidates in the Kepler-90 system --- confirming one, Kepler-90i, as a genuine planet and making Kepler-90 the first known star other than the Sun with eight identified planets. The result received considerable public attention and illustrated the potential for AI to discover phenomena in existing datasets that human analysts had missed.

Climate and Earth Science: Understanding a Changing Planet

Climate science and Earth observation represented applications of AI where the stakes extended beyond scientific curiosity to the most consequential challenge facing human civilization. The challenge of climate modeling --- simulating the coupled physical, chemical, and biological processes that determined the state and evolution of Earth’s climate system --- was among the most computationally demanding in science, with the resolution and complexity of state-of-the-art climate models pushing against the limits of the world’s most powerful supercomputers. Machine learning approaches offered the possibility of emulating specific components of climate models at a fraction of the computational cost, potentially enabling higher-resolution simulations or longer simulation periods within the same computational budget.

Weather prediction, a closely related domain, saw machine learning contribute substantial improvements in medium-range forecast accuracy. Google DeepMind’s GraphCast system, published in 2023, used a graph neural network trained on decades of historical weather data to produce 10-day weather forecasts that matched or exceeded the accuracy of the European Centre for Medium-Range Weather Forecasts’ deterministic operational model, at a fraction of the computational cost. The result was significant not because it replaced the physical simulation models that remained essential for process understanding and ensemble prediction, but because it demonstrated that data-driven approaches could achieve competitive forecast accuracy while enabling much faster generation of predictions --- potentially valuable for real-time emergency applications.

Reflection: AI’s role in scientific research was most powerful when it combined three properties: access to large, high-quality datasets that previous research had accumulated; clear, well-defined tasks with established evaluation criteria; and the ability to integrate multiple sources of evidence in ways that exceeded human analytical capacity. When these conditions were met --- as in protein structure prediction, materials property prediction, and astronomical survey analysis --- the contributions were transformative. When they were not --- when datasets were limited, tasks were poorly defined, or the crucial knowledge was tacit and qualitative rather than computable from data --- AI’s contribution was more modest. The pattern reinforced a general lesson: AI amplifies the quality of data and the clarity of the scientific question; it does not substitute for them.

Section 5: Challenges and Ethical Concerns --- The Human Stakes of Medical AI

The applications of AI to healthcare and science described in the preceding sections were accompanied by a set of ethical and practical challenges whose severity was commensurate with the domain’s stakes. The same features that made medical AI potentially valuable --- its ability to detect patterns in large datasets, to make predictions about individual patients, to operate at scale --- also made it potentially dangerous when those patterns were biased, those predictions were wrong, or that scale amplified errors across large populations. Understanding these challenges specifically, rather than generically, is essential for understanding both what responsible medical AI development requires and why the field’s progress has been slower than its most optimistic advocates predicted.

Bias: When Training Data Encodes Inequity

The fundamental problem of bias in medical AI is not primarily a technical problem; it is a reflection of the biases embedded in the healthcare system and the research enterprise from which medical AI’s training data is drawn. If certain demographic groups receive lower-quality care, their health outcomes are worse; if medical AI is trained on outcome data, it will learn to predict worse outcomes for those groups and may encode lower quality of care as an appropriate expectation. If certain demographic groups are underrepresented in clinical trials and medical research, the training data for medical AI will be less informative about those groups; models trained primarily on data from one population will be less accurate for others.

The pulse oximetry case illustrated this with particular clarity. Pulse oximeters --- the devices clipped to a finger to measure blood oxygen saturation --- had been known since at least 2005 to be less accurate for patients with darker skin tones, because the optical sensors used to measure light absorption through the finger were calibrated primarily on lighter-skinned populations and their calibration algorithms did not account for the effect of skin pigmentation on light absorption. During the COVID-19 pandemic, this known inaccuracy became clinically critical: pulse oximetry was used to identify patients with dangerous oxygen desaturation who needed hospitalization or supplemental oxygen, and the less accurate readings for darker-skinned patients meant they were more likely to appear adequately oxygenated when they were not --- leading to delayed treatment for conditions that were genuinely life-threatening. A 2020 New England Journal of Medicine paper documented occult hypoxemia --- dangerous oxygen desaturation not detected by pulse oximetry --- occurring nearly three times as frequently in Black patients as in white patients in a large retrospective analysis. AI systems trained on pulse oximetry data to predict patient deterioration would inherit this bias directly.

The skin lesion classification AI systems that had demonstrated impressive performance in research evaluations predominantly evaluated on datasets with limited representation of darker skin tones, because the clinical photography archives on which they were trained were drawn largely from dermatology clinics in higher-income countries with majority lighter-skinned patient populations. Studies evaluating these systems on images from diverse patient populations showed substantially worse performance on skin conditions in darker skin tones, a finding with direct implications for the use of AI-assisted dermatology in global health contexts where the disease burden was highest among exactly the populations for which the AI worked least well.

Data Privacy: The Cost of Learning from Patients

Training medical AI systems requires access to large quantities of patient data --- medical records, imaging studies, genomic sequences, wearable device data --- that are among the most sensitive categories of personal information that exist. Medical data can reveal conditions that affect employment, insurance, relationships, and self-conception; its unauthorized disclosure can cause harms ranging from discrimination to personal distress to concrete material loss. The legal frameworks governing medical data privacy --- HIPAA in the United States, GDPR in the European Union, and various national frameworks elsewhere --- were designed for a context in which patient data was used primarily for direct clinical care and medical research under institutional review board oversight, not for the large-scale commercial AI development that the 2010s and 2020s introduced.

The tension between the large datasets needed for medical AI development and the privacy interests of the patients represented in those datasets was real and not fully resolvable by any single technical or regulatory approach. De-identification --- removing obvious identifying information from medical records before using them for AI training --- provided a partial protection but was known to be imperfect: re-identification of de-identified medical records using auxiliary information was demonstrated repeatedly in academic research, and the large-scale linkage of medical records with other data sources that commercial AI development created multiplied the re-identification risk. Federated learning --- training AI models across multiple institutions without centralizing patient data, by sending model updates rather than raw data to a central server --- addressed the data centralization risk but required substantial coordination across institutions and did not eliminate privacy risks associated with the model updates themselves.

The Trust Problem: When to Listen to the Algorithm

Perhaps the most practically complex challenge of medical AI was the question of how clinicians should respond to algorithmic recommendations --- how to calibrate trust in AI systems whose accuracy varied across patient populations, whose failures were often not interpretable, and whose performance in the specific clinical context of deployment might differ from the benchmark performance on which their credibility was based. Existing clinical decision support systems had established a problematic precedent: alert fatigue, the phenomenon in which clinicians habituated to high volumes of algorithmic alerts by ignoring them, affected clinical information systems in virtually every hospital, and was a predictable consequence of systems that generated too many alerts of insufficient specificity.

The interpretability of medical AI was a specific and significant concern: a system that predicted a patient’s risk of readmission or deterioration but could not explain what features of the patient’s record contributed to the prediction gave clinicians little basis for evaluating whether to trust the prediction in a specific case, or for identifying patients who did not fit the population from which the model had been trained. The rapidly developing field of explainable AI (XAI) sought to address this by developing methods for attributing model predictions to specific input features, but the explanations produced by methods like LIME and SHAP were often post-hoc approximations rather than genuine accounts of the model’s decision process, and their reliability for guiding clinical trust had not been established.

The regulatory framework for medical AI was another dimension of the trust challenge. The FDA’s approach of treating AI-based medical devices under the existing 510(k) or de novo pathways, designed for conventional devices with static performance characteristics, struggled to accommodate the dynamic nature of machine learning systems that could be updated, retrained, or fine-tuned after initial approval. The FDA’s 2021 action plan for AI and machine learning-based software as a medical device acknowledged these challenges and proposed a framework for predetermined change control plans that would allow approved systems to be updated within pre-specified parameters, but the framework’s practical implementation remained a work in progress. The regulatory uncertainty slowed clinical adoption by creating ambiguity about what level of evidence and what approval process was required for different types of medical AI systems.

“The challenges of medical AI --- bias, privacy, interpretability, regulatory uncertainty --- were not reasons to slow the development of AI in medicine. They were specifications for what responsible development required.”

Reflection: The ethical challenges of medical AI were not unique to AI; they were reflections of longstanding challenges in medicine and medical research --- the underrepresentation of certain populations in research, the tension between individual privacy and collective benefit from data sharing, the difficulty of communicating probabilistic information to patients and clinicians --- that AI’s scale and speed of deployment made newly urgent. Addressing them required the collaboration of technical researchers, clinicians, ethicists, patient advocates, and regulators in ways that the AI field was not accustomed to and did not always embrace willingly. The alternative --- deploying AI systems in healthcare without addressing these challenges --- risked amplifying existing health inequities at scale, which was precisely the opposite of what the most optimistic advocates of medical AI hoped to achieve.

Conclusion: The Machine Enters the Clinic, Cautiously and Consequentially

The application of AI to healthcare and science in the 2010s and early 2020s represented the most consequential and most complex deployment of the technology in its history. The achievements were genuine and in some cases extraordinary: AlphaFold 2’s solution to the protein folding problem was a scientific breakthrough of the first order, whose implications for biology, medicine, and drug discovery are likely to be felt for decades. The deep learning diagnostic imaging systems that matched specialist physicians on specific tasks created real opportunities to extend specialist-level care to populations that could not access human specialists. The drug discovery AI platforms that could screen millions of candidate compounds computationally changed the economics of early-stage pharmaceutical research in ways that are still playing out.

The complications were equally real. The bias problems that emerged when systems trained on non-representative data were deployed on diverse patient populations demonstrated that AI could amplify existing health inequities as easily as it could address them. The distribution shift problems that caused research-validated systems to underperform in deployment demonstrated that impressive benchmark results were necessary but not sufficient evidence of clinical utility. The privacy challenges of training on large quantities of sensitive patient data created tensions that neither pure technical solutions nor existing regulatory frameworks fully resolved. And the trust challenges of deploying AI recommendations in clinical contexts where human judgment remained essential, and where the consequences of errors were harm to real patients, required frameworks for human-AI collaboration that the field was still developing.

The history of AI in healthcare illustrates, with unusual clarity, a general principle about the relationship between technical capability and social impact: the impact of a technology depends not just on what it can do but on how it is deployed, by whom, for whose benefit, with what safeguards, and in what institutional context. The same AI capability that, carefully validated and deployed with appropriate human oversight in a well-resourced clinical environment, could save lives, could, carelessly deployed on a biased training dataset in a context without clinical expertise or regulatory oversight, cause harm. The technical breakthrough is necessary but not sufficient; the social and institutional infrastructure for responsible deployment is equally essential and considerably harder to build.

───

Next in the Series: Episode 14

AI in Gaming --- From Chess and Go to Reinforcement Learning and the Mastery of Complex Strategy

Games have been a proving ground for artificial intelligence since the field’s founding. Chess, checkers, Go, Atari, StarCraft, Dota 2, and Minecraft --- each has served as a benchmark, a motivation, and a demonstration of AI capability at a specific moment in the field’s history. In Episode 14, we trace the arc from the rule-based and search-based approaches of early game AI through Deep Blue’s defeat of Kasparov in chess to the reinforcement learning revolution that produced AlphaGo, AlphaZero, and OpenAI Five: systems that not only mastered complex games but, in doing so, demonstrated principles about learning, planning, and decision-making under uncertainty that extend far beyond the games themselves. We also examine what game mastery does and does not tell us about general intelligence, and why the games we choose to solve reveal something about the problems we think intelligence is for.

--- End of Episode 13 ---