Thirty million. That’s the number of biomedical research articles indexed in PubMed. If you read one paper per hour, eight hours a day, five days a week, it would take you over 14,000 years to read them all. And while you’re reading, another 3,000 papers are published every single day. No human can keep up. The knowledge exists — it’s just distributed across a library so vast that no individual scientist can ever walk its full length.
But what if a computer could read all 30 million papers and actually understand them — not just search for keywords, but comprehend that “APOE ε4,” “rs429358,” and “the Cys130Arg variant” all refer to the same thing? That “upregulated” and “overexpressed” mean the same phenomenon? That a sentence about TP53 in a breast cancer paper is relevant to a lung cancer study? This is precisely the gap between keyword search engines and language models. One matches strings. The other understands meaning.
The challenge isn’t unique to literature. UniProt holds functional annotations for over 200 million protein sequences written in natural language. Clinical databases store millions of patient records with textual descriptions of symptoms, treatments, and outcomes. Connecting a GWAS hit to its biological mechanism means linking variant databases to clinical literature to mechanistic papers — all written in the messy, synonym-rich, context-dependent language that humans use. Traditional keyword search fails at every step of that chain.
This is what language models do: they learn the statistical structure of language well enough to understand it — synonyms, entity types, relational context, and all. Before we can apply these ideas to DNA sequences in the chapters ahead, we need to understand how they work on human language first. That understanding will make the leap to genomics feel not like a leap at all, but like an obvious next step.
Biological research generates not just experimental data, but enormous amounts of text. The PubMed database contains over 35 million biomedical abstracts, with approximately 4,000 new papers added every day. The UniProt protein database has detailed functional annotations for over 200 million protein sequences. Clinical databases store millions of patient records with textual descriptions of symptoms, treatments, and outcomes.
A single researcher studying a gene family might need to review thousands of papers. Annotating the function of newly discovered genes requires reading hundreds of studies. Understanding drug-gene interactions means parsing clinical trial reports, case studies, and mechanism descriptions. Connecting genotype to phenotype requires linking variant databases (with text descriptions) to clinical literature.
Traditional keyword search isn’t enough. Searching for “BRCA1” won’t find papers that only mention “breast cancer 1” or “RING finger protein 53.” Simple text matching can’t understand that “increased expression” and “upregulated” mean the same thing, or that “T cell activation” and “lymphocyte stimulation” might be related concepts.
We need computers that can read biological text the way biologists do—understanding synonyms, recognizing entities (genes, proteins, disorders), extracting relationships, and making inferences. The volume of text is too large for humans to process manually, but the information within is too valuable to ignore.
This is where natural language processing (NLP) and language models become essential tools. Before we can understand how language models work on DNA sequences (Chapters 12-13), we first need to understand how they work on human language.
After completing this chapter, you will be able to:
Computers work with numbers, not words. When you type “gene expression,” your computer stores this as a sequence of numbers (character codes), but these numbers don’t capture any meaning. The number for ‘g’ has no relationship to the number for ‘e’, and the word “gene” has no mathematical relationship to related words like “genetic” or “genomic.”
To apply machine learning to text, we need to convert words into numerical representations that preserve meaning. This is harder than it sounds.
The simplest approach is to assign each word a unique number: “gene” = 1, “protein” = 2, “expression” = 3, and so on.
But this creates a problem. In this encoding, the “distance” between “gene” (1) and “protein” (2) is the same as the distance between “gene” (1) and “umbrella” (let’s say, 5000). The numbers don’t capture that “gene” and “protein” are related biological concepts, while “gene” and “umbrella” are unrelated.
Worse, simple encoding can’t handle new words. If your encoding has 10,000 words and you encounter “CRISPR” (word #10,001), your model doesn’t know what to do with it.
The breakthrough came from a simple idea: words that appear in similar contexts probably have similar meanings. This is called the distributional hypothesis.
Consider these sentences from biology papers:
The words “insulin,” “glucose,” and “glucagon” appear in similar contexts. They’re all things you measure in blood. This suggests they’re related concepts—and indeed they are, all being molecules involved in glucose metabolism.
A word embedding is a vector (a list of numbers) that represents a word. Instead of representing “insulin” as a single number, we represent it as a vector of perhaps 300 numbers:
insulin → [0.2, -0.5, 0.8, 0.1, ..., 0.3] (300 numbers total)
glucose → [0.3, -0.4, 0.7, 0.2, ..., 0.4]
umbrella → [-0.8, 0.9, -0.2, 0.6, ..., -0.5]
These embeddings are learned from large text corpora so that words appearing in similar contexts get similar vectors. The mathematical distance between the “insulin” vector and “glucose” vector is small (they’re similar), while the distance between “insulin” and “umbrella” is large.
Think of word embeddings like how biologists break sequences into codons (3-letter words)—the model learns meaningful “words” from the sequence, grouping together elements that appear in similar functional contexts.
Word embeddings capture more than just similarity—they capture relationships. One famous example is the vector arithmetic:
king - man + woman ≈ queen
In biology, embeddings learn similar relationships:
BRCA1 - breast + ovary ≈ BRCA2
insulin - pancreas + liver ≈ glucose
This isn’t magic—it’s a consequence of how these words are used in text. Papers discussing BRCA1 and breast tissue often have parallel structures to papers discussing BRCA2 and ovarian tissue.
[Optional: The Math]
Vector Similarity
How do we measure if two word vectors are similar? The most common metric is cosine similarity.
Given two vectors a and b, cosine similarity is:
cosine_similarity(a, b) = (a · b) / (||a|| × ||b||)Where:
- a · b is the dot product (multiply corresponding elements and sum)
- ||a|| is the magnitude (length) of vector a
- Result ranges from -1 (opposite) to 1 (identical)
Example with simple 3D vectors:
insulin = [0.8, 0.3, 0.1] glucose = [0.7, 0.4, 0.2] Dot product = 0.8×0.7 + 0.3×0.4 + 0.1×0.2 = 0.56 + 0.12 + 0.02 = 0.70 Magnitude of insulin = √(0.8² + 0.3² + 0.1²) = 0.854 Magnitude of glucose = √(0.7² + 0.4² + 0.2²) = 0.824 Cosine similarity = 0.70 / (0.854 × 0.824) = 0.995A value near 1.0 indicates these vectors are very similar—exactly what we expect for related metabolic molecules.
Word embeddings like Word2Vec have a limitation: each word gets exactly one vector, regardless of context. But words often have multiple meanings.
Consider the word “expression” in biology:
Static embeddings give “expression” the same vector in both sentences, even though the meanings are completely different.
Modern language models solve this by creating context-dependent embeddings. The same word gets different vectors depending on surrounding words.
"gene expression" → expression_vector_1
"facial expression" → expression_vector_2
How does the model know which vector to use? By looking at the entire sentence and understanding context. This requires a mechanism for words to “communicate” with each other—to share information about what they mean in this particular sentence.
The breakthrough that enabled context-dependent understanding is called attention. When processing a word, attention lets the model look at all other words in the sentence and decide which ones are most relevant.
Let’s walk through a biological example:
"The TP53 gene encodes a protein that regulates cell cycle progression"
When processing the word “protein,” the model should pay attention to:
It should pay less attention to:
Here’s what attention scores might look like for “protein” in our sentence:
protein pays attention to:
TP53: 0.45 (high - tells us which protein)
regulates: 0.30 (high - tells us the function)
cell cycle: 0.15 (medium - tells us the process)
gene: 0.05 (low - already implied)
encodes: 0.03 (low - grammatical connection)
The, a, that: 0.02 total (very low - not meaningful)
[Optional: The Math]
Computing Attention
Attention uses three vectors for each word: Query (Q), Key (K), and Value (V).
- For each word, create Q, K, V vectors from the word embedding
- To find how much word i should attend to word j:
- Compute score = Q_i · K_j (dot product)
- Divide by √d (d = dimension of vectors, for scaling)
- Apply softmax to get probabilities
- Weighted sum: attention_output_i = Σ(attention_score_ij × V_j)
For our example with “protein” (word i) and “TP53” (word j):
score = Q_protein · K_TP53 / √64 = (high overlap, so high score) After softmax: attention_weight = 0.45 Output includes 0.45 × V_TP53 (incorporating TP53 information)The beauty is that this is computed in parallel for all word pairs simultaneously, making it efficient.
In 2017, researchers at Google introduced the transformer architecture in a paper titled “Attention Is All You Need.” This architecture has become the foundation for virtually all modern language models—and, as we’ll see in later chapters, for DNA sequence models too.
The transformer transforms input text into rich, context-aware representations. Unlike earlier models that processed text sequentially (one word at a time), transformers process all words simultaneously using attention mechanisms.
Key components:
Instead of using just one attention mechanism, transformers use multiple attention “heads” running in parallel. Each head can focus on different aspects of the relationship between words.
For the sentence “TP53 mutations increase cancer risk”:
Each head learns to capture different types of information. The outputs are combined to form a rich, multi-faceted representation.
Modern language models stack many transformer layers (12, 24, or even 96 layers). Information flows through these layers:
Layer 1: Basic word meanings and relationships
↓
Layer 2-4: Syntactic structure (subject, verb, object)
↓
Layer 5-8: Semantic relationships (what relates to what)
↓
Layer 9-12: Abstract concepts and implications
Early layers capture simple patterns. Later layers capture complex, abstract relationships. This hierarchical processing resembles how CNNs process DNA sequences (Chapter 9), but adapted for text.
Transformers come in three flavors:
Encoder-only (e.g., BERT - Bidirectional Encoder Representations from Transformers):
Decoder-only (e.g., GPT - Generative Pre-trained Transformer):
Encoder-decoder (e.g., T5 - Text-to-Text Transfer Transformer):
For biological text mining, encoder-only models are most common because we usually want to extract information (entities, relationships) rather than generate new text.
Here’s a remarkable fact: you don’t need to train a language model from scratch for your specific biological task. Instead, you can use pre-training and transfer learning.
Think of BERT pre-training like a medical student who reads millions of textbooks before specializing—they build broad knowledge first, then fine-tune that knowledge for a specific clinical task.
Stage 1: Pre-training (expensive, done once)
Stage 2: Fine-tuning (cheaper, done per task)
This is similar to transfer learning we saw in Chapter 3, but now applied to text instead of images.
During pre-training on general text, the model learns:
For biology, we can use models pre-trained on biomedical text:
Models trained on biomedical text learn biology-specific knowledge:
BioBERT is BERT pre-trained on PubMed abstracts and PMC articles. Compared to general BERT:
General BERT (trained on Wikipedia):
BioBERT (trained on PubMed):
For tasks like extracting gene-disorder relationships from papers, BioBERT dramatically outperforms general BERT—even with the same architecture and same amount of fine-tuning data. The difference is the pre-training corpus.
Let’s say you want to build a model to extract drug-gene interactions from papers. You would:
With just a few hundred labeled examples, you can build a highly accurate extractor—because BioBERT already understands 90% of what it needs to know. You’re just teaching it the specific format of drug-gene interactions.
Language models aren’t just academic curiosities. They solve real problems in biological research.
Identifying biological entities in text:
Challenge: Many entities have multiple names. The gene “TP53” is also called “p53”, “tumor protein p53”, and “cellular tumor antigen p53.” Language models learn these synonyms from context.
Case Study: Extracting Gene-Disease Associations
Researchers trained BioBERT to extract gene-disorder relationships from PubMed abstracts. Given the sentence:
”Mutations in CFTR cause cystic fibrosis, a genetic disorder affecting the lungs and digestive system”
The model extracts:
Across 1 million abstracts, the model extracted 150,000 gene-disorder associations in 6 hours. Manual extraction would have taken years. Importantly, the model found associations not in existing databases, including rare variant-disorder connections mentioned only in case reports.
Beyond identifying entities, language models can extract relationships:
This builds structured knowledge graphs from unstructured text.
Sometimes important connections hide across different papers. Consider:
A language model can connect these: fish oil → reduced inflammation → slower Alzheimer’s progression. This generates hypotheses for experimental testing.
Remember Dr. Kim from our opening vignette? Language models help with exactly her problem.
Given a specific variant (e.g., APOE ε4 / rs429358), a language model can:
This transforms a 2-month manual literature review into a 2-hour computational analysis.
Clinical notes contain valuable information but are unstructured text:
"62 yo F presents with progressive memory difficulties over 18 months.
Family hx significant for mother with early-onset dementia.
Genetic testing: APOE ε4/ε4 homozygous.
MRI shows bilateral hippocampal atrophy.
Impression: consistent with Alzheimer's disorder."
Language models can extract:
This structured data enables large-scale analyses: How do APOE genotypes correlate with symptom progression? What imaging findings predict clinical outcomes?
You might wonder: why are we learning about processing human language in a book about AI for biology?
Here’s the key insight: DNA sequences and text sequences have surprising similarities.
Both text and DNA are sequences of discrete symbols:
Text: "The cat sat on the mat"
DNA: "ATGCGATCGATCGATCGAT"
In text:
In DNA:
The transformer architecture designed for text has been adapted for DNA:
| Text Processing | DNA Processing |
|---|---|
| Tokenize words | Tokenize k-mers (e.g., 6-bp sequences) |
| Word embeddings | K-mer embeddings |
| Sentence context | Genomic context |
| Multi-head attention | Multi-head attention (same!) |
| Pre-train on books | Pre-train on genomes |
| Fine-tune for tasks | Fine-tune for tasks |
Just as biologists break sequences into codons (3-letter words) to find meaning in a reading frame, DNA language models break sequences into k-mers—short overlapping fragments—to build representations of genomic “vocabulary.” The model learns that some k-mers appear together in regulatory elements, much like certain words co-occur in scientific sentences.
The same attention mechanism that helps understand “The protein encoded by TP53” can help understand regulatory relationships in “…TATA…ATG…AATAAA…” sequences.
This connection means we can apply decades of NLP research to genomics:
Models like BERT revolutionized NLP in 2018. By 2020, researchers were applying BERT-style models to DNA (DNABERT). The transformer architecture introduced in 2017 for language now powers genomic models analyzing sequences millions of bases long.
Understanding language models isn’t a detour—it’s the foundation for understanding modern genomic AI.
Language models are powerful but not perfect. Understanding their limitations helps us use them appropriately.
Language models learn from their training data, including its biases and errors:
Publication bias: Positive results are published more often than negative results. A model trained on published papers might overestimate associations.
Terminology drift: Medical terms evolve. “Mental retardation” → “intellectual disability.” Older papers use outdated terminology, but the underlying biology remains the same.
Ambiguity: Gene symbols are reused. “CAT” could mean the CAT gene (catalase), the animal (Felis catus), CAT scan (computed axial tomography), or chloramphenicol acetyltransferase. Context usually resolves this, but not always.
Language models can generate plausible-sounding but incorrect statements:
Model output: ”BRCA3 mutations are associated with pancreatic cancer risk”
Problem: BRCA3 doesn’t exist (there’s BRCA1 and BRCA2, but no BRCA3)
This is particularly dangerous in biology, where incorrect information could mislead research. Always verify model outputs against primary sources.
Training large language models is expensive. For most biological applications, you’ll use pre-trained models (zero training cost) or fine-tune them (modest cost). But understanding these costs helps explain why relatively few organizations can train foundation models.
Why did the model predict a specific gene-disorder association? With neural networks, it’s hard to say. Attention weights provide some insight—we can see which words the model focused on—but this doesn’t fully explain the decision.
For clinical applications, interpretability is crucial. Regulatory agencies require explanations for AI-assisted diagnoses. “The model said so” isn’t sufficient.
Based on lessons learned across thousands of studies, here are guidelines for using language models in biology:
Don’t just report overall accuracy. Report:
Test on data from different sources than training.
For critical applications (clinical decisions, drug development):
Biological research generates enormous amounts of text that contains valuable information but requires computational methods to process at scale.
Word embeddings convert text to numerical vectors that capture semantic meaning, with similar words having similar vectors.
Context matters in language: The same word can mean different things in different contexts, requiring context-dependent representations.
Attention mechanisms allow models to focus on relevant parts of the input when processing each word, enabling sophisticated context understanding.
Transformers use multi-head attention and stacked layers to build rich, hierarchical representations of text.
Pre-training on large text corpora creates foundation models that understand general language, which can then be fine-tuned for specific biological tasks.
BioBERT and similar models pre-trained on biomedical text dramatically outperform general language models for biological applications.
Language models enable entity recognition, relationship extraction, and literature-based discovery at scales impossible for manual analysis.
DNA sequences can be processed like text sequences, using the same transformer architectures—this connection enables applying NLP advances to genomics.
Language models have limitations including data biases, hallucination, computational costs, and interpretability challenges that require careful consideration.
| Term | Definition |
|---|---|
| Attention mechanism | A method that allows a model to focus on relevant parts of the input when processing each element, computing weighted combinations based on learned importance scores. |
| BioBERT | A version of BERT pre-trained on biomedical text (PubMed abstracts and PMC articles), optimized for biological and medical language understanding. |
| Contextualized embeddings | Vector representations of words that change based on surrounding context, allowing the same word to have different representations in different sentences. |
| Decoder | A transformer component that generates output sequences one element at a time, used in text generation tasks. |
| Distributional hypothesis | The principle that words appearing in similar contexts tend to have similar meanings, forming the basis for word embeddings. |
| Encoder | A transformer component that processes input sequences to create rich representations, used in understanding and classification tasks. |
| Fine-tuning | The process of taking a pre-trained model and training it further on a specific task with a smaller, task-specific dataset. |
| Foundation model | A large model pre-trained on broad data that can be adapted to many downstream tasks through fine-tuning. |
| Multi-head attention | Using multiple attention mechanisms in parallel, each focusing on different aspects of relationships between sequence elements. |
| Named entity recognition (NER) | The task of identifying and classifying specific entities (genes, proteins, disorders, etc.) in text. |
| Pre-training | Initial training of a model on a large, general dataset to learn broad patterns before fine-tuning on specific tasks. |
| Relationship extraction | Identifying and classifying relationships between entities in text, such as protein-protein interactions or gene-disorder associations. |
| Transfer learning | Using knowledge learned from one task (pre-training) to improve performance on a different but related task (fine-tuning). |
| Transformer | A neural network architecture based on self-attention mechanisms that processes sequences in parallel rather than sequentially. |
| Word embedding | A dense vector representation of a word that captures its meaning, with similar words having similar vectors. |
Explain why simple one-hot encoding of words (assigning each word a unique number) is insufficient for natural language processing. What information does it fail to capture?
A researcher trains a language model on PubMed abstracts to extract gene names. The model performs well on papers from the journals it was trained on but poorly on clinical notes from patient records. What might explain this discrepancy, and how could the researcher address it?
Consider the biological term “expression” which has different meanings in different contexts (gene expression vs. facial expression). How do contextualized embeddings solve this problem differently than static word embeddings?
The attention mechanism is described as allowing words to “communicate” with each other. For the phrase “BRCA1 mutations increase breast cancer risk,” which words should “mutations” pay most attention to, and why?
Transformers can process all words in a sentence simultaneously, unlike earlier models that processed one word at a time. What computational advantage does this provide? Can you think of any disadvantages?
Why is pre-training on biomedical text (like in BioBERT) more effective for biological tasks than pre-training on general text (like in BERT), even when using the same architecture and the same amount of fine-tuning data?
A language model extracts 1,000 gene-disorder associations from literature, 800 of which are correct. What is the precision? If there are actually 1,200 true associations in the literature, what is the recall? Which metric matters more depends on the application—explain when you’d prioritize precision vs recall.
Language models trained on published papers might learn biases from the literature (e.g., over-representing well-studied genes). How could these biases affect scientific research if researchers rely heavily on model predictions? What safeguards would you recommend?