When the Human Genome Project (HGP) announced its completion in 2003, it was celebrated as a historic milestone. But here’s something surprising: the “complete” human genome wasn’t actually complete. About 8% of our genome remained unsequenced—not because scientists didn’t want to sequence it, but because the technology at the time simply couldn’t handle certain regions.
Think of it like trying to assemble a jigsaw puzzle where some pieces are nearly identical. Early sequencing technologies read short fragments of DNA (around 100–200 base pairs), which worked well for most of the genome. But in highly repetitive regions—where the same sequence appears over and over—these short reads couldn’t tell one repeat from another. It’s like having dozens of identical puzzle pieces and not knowing where each one goes.
The missing regions weren’t random leftovers. They included some of the most functionally important parts of our chromosomes: centromeres (the central structures that help separate chromosomes during cell division), telomeres (the protective caps at chromosome ends), and ribosomal DNA arrays (genes that make the machinery for protein production). These gaps limited our understanding of chromosome stability, cell division, and even how genetic diseases develop.
In 2022, the Telomere-to-Telomere (T2T) Consortium changed this. They produced the first truly complete human genome sequence, called T2T-CHM13, covering all 3.055 billion base pairs from one end (telomere) to the other end (telomere) of every chromosome—hence the name. This achievement added about 200 million base pairs of new sequence and revealed 1,956 previously unknown gene predictions, 99 of which are predicted to be protein coding (Nurk et al. 2022, Science).
Two major advances enabled the T2T project to succeed where earlier efforts had failed: new sequencing technologies and a special cell line.
The T2T Consortium combined several cutting-edge technologies:
PacBio HiFi sequencing: This technology reads much longer DNA fragments—around 20,000 base pairs (20 kb)—with high accuracy. These longer reads can span multiple repeats, making it possible to distinguish one repeat from another.
Oxford Nanopore ultra-long-read sequencing: These reads can exceed 100,000 base pairs, allowing scientists to read through even longer repetitive regions in a single pass.
Supporting methods: Additional techniques like Illumina short-read sequencing (for error correction), Hi-C (which maps how DNA folds in 3D space), Bionano optical mapping (which creates a physical map of the genome), and Strand-seq (which tracks which DNA strands came from each parent) all helped ensure accuracy.
The assembly process used a high-resolution string graph—a powerful computational approach that helped resolve the most complex repetitive regions (see Figure 2 in Nurk et al., 2022).
Figure: CHM13 String Graph. This diagram shows how the T2T-CHM13 genome was assembled using a string graph approach. Each line represents a DNA sequence, and connections show where sequences overlap. The tangled regions reveal highly repetitive areas like the ribosomal DNA arrays and centromeric satellites. Source: Nurk, S. et al. (2021). The complete sequence of a human genome. bioRxiv. https://doi.org/10.1101/2021.05.26.445798. License: CC-BY 4.0.
Unlike traditional linear reference genomes that represent a single sequence, a string graph represents DNA as a network where nodes are DNA sequences and edges show how they connect. Think of it like a subway map showing multiple routes between stations—the graph displays all possible connections, then long reads help scientists determine the correct path through repetitive regions.
This approach was critical for handling repeats: when you have many identical DNA sequences (like in centromeres), a linear approach gets confused about which piece connects where. The graph shows all possibilities, and ultra-long reads spanning the repeats reveal the correct connections. The final T2T-CHM13 assembly achieved an exceptionally high accuracy with an error rate of just 1 mistake per 10 million bases.
While T2T-CHM13 is presented as a linear sequence (because it represents one individual), the future of genomics is moving toward pangenome references that better capture human genetic diversity across populations.
The T2T project used DNA from a unique cell line called CHM13, which came from a complete hydatidiform mole—a rare type of tissue that forms when an egg without genetic material is fertilized by a sperm that duplicates its own genome. This means CHM13 has two identical copies of each chromosome from the father and none from the mother.
Why does this matter? In a typical human genome, you inherit one set of chromosomes from each parent, making assembly more complex because you have to distinguish between similar but not identical sequences. CHM13’s duplicated genome simplified assembly—it’s like solving a puzzle where you know matching pieces should be identical. However, CHM13 lacked a Y chromosome, which remained the last piece of the puzzle.
While the full T2T-CHM13 genome was announced in 2022, the project achieved its first major milestone in 2020 by completing the human X chromosome—the first chromosome to be assembled telomere-to-telomere with no gaps (Miga et al. 2020, Nature).
The research team chose the X chromosome as their first target for several strategic reasons:
Figure: Initial CHM13 X Chromosome Assembly. (See Figure 1b) The X chromosome was initially broken at three locations: the centromere (artificially collapsed in the assembly), a 120-kb segmental duplication (DMRTC1B), and a 134-kb segmental duplication with a paralogue on chromosome 2. Black bars indicate gaps in the GRCh38 reference, and red bars show known segmental duplications. Source: Miga, K.H. et al. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature, 585, 79-84. https://doi.org/10.1038/s41586-020-2547-7. License: CC-BY 4.0.
The two segmental duplications were resolved by finding ultra-long reads that completely spanned the repeats and were uniquely anchored on both sides. This allowed confident placement in the assembly.
The centromere presented a greater challenge—it’s a 3.1 Mb array of highly repetitive alpha satellite DNA where standard polishing methods failed. Initial attempts to polish the centromeric assembly actually decreased quality because reads were incorrectly placed due to the extreme repetitiveness.
The team developed a novel marker-assisted polishing strategy to finish the large repetitive regions:
1. Catalogue unique markers: They identified short (21 bp), unique sequences present only once in the genome. Remarkably, even within the DXZ1 centromeric array, there was enough variation between repeat copies to create unique markers at semi-regular intervals.
{@html `Genome-wide: average spacing = 66 bp between markers
DXZ1 array: average spacing = 2.3 kb between markers
Longest gap in DXZ1: 42 kb
`}
2. Anchor reads precisely: These markers guided the correct placement of long reads during polishing, preventing the quality degradation that occurred with standard methods.
3. Iterative polishing: Multiple rounds of polishing were performed with Nanopore, then PacBio, then Illumina data, each improving accuracy.
Figure: Marker-Assisted Polishing Improves Assembly Quality. (See Figure 1d) Example from the GAGE locus showing coverage depth before (top) and after (bottom) marker-assisted polishing. Black dots indicate primary allele coverage, red dots show secondary alleles. The uniform coverage and elimination of secondary alleles after polishing demonstrates the dramatic quality improvement achieved by this method. Source: Miga, K.H. et al. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature, 585, 79-84. https://doi.org/10.1038/s41586-020-2547-7. License: CC-BY 4.0.
The 3.1 Mb DXZ1 centromeric array consists of approximately 1,408 copies of a ~2,057-bp “higher-order repeat” (HOR) unit. This canonical repeat is made of 12 divergent alpha satellite monomers (each ~171 bp) arranged in a specific order. The team validated this structure through multiple independent methods:
Experimental validation:
Sequence-based validation:
Figure: Complete Structure of the 3.1 Mb X Centromere. The DXZ1 array consists of approximately 1,408 copies of a ~2-kb higher-order repeat unit. (a) Predicted restriction map showing the array structure. (b) Experimental PFGE Southern blots matching the in silico prediction. (c) ddPCR copy number estimates across multiple cell lines. (d) Catalogue of 33 structural variants identified within the array. Source: Miga, K.H. et al. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature, 585, 79-84. https://doi.org/10.1038/s41586-020-2547-7. License: CC-BY 4.0.
The assembly catalogued 33 different structural variants within the DXZ1 array, including:
This level of detail was impossible with previous technologies and provides insights into centromere evolution and function.
The complete X chromosome assembly:
The precisely anchored ultra-long reads enabled chromosome-wide methylation mapping at single-base resolution—even across complex repeats that were previously invisible to methylation studies. This revealed several surprising patterns:
Figure: Methylation Patterns Across the Complete X Chromosome. Nanopore sequencing captured methylation patterns across the entire chromosome. (a) Hypomethylation in PAR1, with detailed view showing unmethylated bases (blue) and methylated bases (red). (b) A 93-kb hypomethylated region within the DXZ1 centromere. (c) The DXZ4 macrosatellite array showing a transition from methylated to unmethylated regions. Source: Miga, K.H. et al. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature, 585, 79-84. https://doi.org/10.1038/s41586-020-2547-7. License: CC-BY 4.0.
The hypomethylated centromeric region was particularly intriguing. To test whether this was unique to the X or a general centromeric feature, the team manually assembled the centromere of chromosome 8 (D8Z2, ~2.02 Mb) and found a similar hypomethylated region, suggesting this may be a conserved feature marking functional centromeres.
This 2020 achievement demonstrated that completing an entire human chromosome was possible and established the methods that would be used to complete all other chromosomes:
The complete X chromosome became the foundation for finishing the rest of the human genome.
Building on the success of the X chromosome, the T2T Consortium completed the remaining chromosomes to produce the full T2T-CHM13 genome. This assembly includes several types of regions that were missing or incomplete in the previous reference genome (GRCh38):
Figure: T2T-CHM13 Assembly Ideogram. (See Figure 1a) This ideogram shows what’s new in T2T-CHM13 compared to GRCh38. Red areas highlight newly added regions, including complete centromeres and the short arms of five acrocentric chromosomes (13, 14, 15, 21, and 22). The track at the top shows that CHM13 has primarily European genetic ancestry. Source: Nurk, S. et al. (2021). The complete sequence of a human genome. bioRxiv. https://doi.org/10.1101/2021.05.26.445798. License: CC-BY 4.0.
Let’s explore the main types of regions that T2T-CHM13 finally completed. Each was challenging to sequence for different reasons, and each plays an important role in how our cells work.
Imagine each chromosome as having a “waist”—a pinched middle section. This is the centromere, and it’s made of highly repetitive DNA called satellite arrays. In humans, the most common type is alpha satellite DNA, where a short sequence (about 171 base pairs) repeats hundreds or thousands of times in a row.
During cell division, chromosomes need to be pulled apart and distributed equally to daughter cells. The centromere acts as a handle where proteins called kinetochores attach. These proteins connect to spindle fibers—molecular ropes that pull chromosomes to opposite ends of the dividing cell.
If centromeres don’t work properly, chromosomes can be distributed incorrectly, leading to cells with too many or too few chromosomes (called aneuploidy). This can cause serious problems: most aneuploid embryos don’t survive, and in cells that do survive, aneuploidy is linked to cancer and genetic disorders like Down syndrome.
T2T-CHM13 fully sequenced all centromeric regions, showing their complete structure for the first time. Each centromere has unique features:
Segmental duplications are large chunks of DNA—often thousands to millions of base pairs long—that appear in multiple places in the genome. These copies are usually 90–99% identical to each other. Think of them as long paragraphs that have been copied and pasted elsewhere in a document, with only minor edits.
Segmental duplications are major drivers of genetic diversity and evolution. Because they’re nearly identical, they can accidentally pair up during DNA replication or recombination, creating structural variants—duplications, deletions, or rearrangements of DNA.
Some structural variants are beneficial. For example, having extra copies of the AMY1 gene (which makes an enzyme that digests starch) helps people digest starchy foods better—populations that historically ate more starch tend to have more AMY1 copies.
But structural variants can also cause disease. Certain duplications are linked to facioscapulohumeral muscular dystrophy (FSHD), a muscle-weakening disorder, and many other genetic conditions.
T2T-CHM13 added substantial amounts of segmental duplication sequence and corrected many errors in GRCh38. The complete assembly revealed that about 6.61% of the genome consists of segmental duplications (201.93 Mb), compared to only 5.00% that could be properly identified in GRCh38. This includes many medically relevant genes that were previously incomplete or incorrectly assembled.
Five of our chromosomes—numbers 13, 14, 15, 21, and 22—are called acrocentric because their centromeres are located very close to one end. This creates a short arm (called the “p” arm) and a much longer arm (the “q” arm).
These short arms contain clusters of ribosomal DNA (rDNA)—genes that encode ribosomal RNA (rRNA), a key component of ribosomes. Ribosomes are the molecular machines that read messenger RNA (mRNA) and build proteins. Each rDNA unit is about 45,000 base pairs long and is repeated dozens to hundreds of times.
Without ribosomes, cells can’t make proteins, and without proteins, life stops. The rDNA arrays on acrocentric chromosomes cluster together to form the nucleolus—a structure inside the nucleus where ribosomes are assembled.
Interestingly, the number of rDNA copies varies between individuals. CHM13 has about 400 copies, but other people might have more or fewer. Scientists are still investigating whether this variation affects how efficiently cells make proteins, and whether it influences traits like growth, aging, or susceptibility to diseases like cancer.
T2T-CHM13 provided the first complete view of the short arms of all five acrocentric chromosomes, totaling 66.1 Mb of new sequence. These regions follow a similar structure: from telomere to centromere, they contain distal repeat arrays, the rDNA array, and proximal repeat arrays including various satellite sequences.
Remarkably, these short arms show about 98.7% identity to each other, suggesting frequent exchange of DNA between them. This high similarity is probably because these chromosomes cluster together in the nucleolus during interphase.
Although T2T-CHM13 was a major achievement, it had one limitation: CHM13 is a 46,XX cell line, meaning it has no Y chromosome. The Y chromosome had been notoriously difficult to sequence because of its complex repeat structure, including long palindromes (sequences that read the same forwards and backwards), tandem repeats, and segmental duplications. In fact, more than half of the Y chromosome was missing from GRCh38.
In 2023, the T2T Consortium completed this final piece of the puzzle by sequencing the Y chromosome from a different genome, HG002, which is commonly used for benchmarking (Rhie et al. 2023, Nature). The resulting assembly, called T2T-Y, is 62,460,029 base pairs long with no gaps—adding over 30 million base pairs of sequence compared to GRCh38-Y.
Figure: Complete Structure of the Human Y Chromosome. (See Figure 1a) This comprehensive view shows alignment of T2T-Y to GRCh38-Y, locations of protein-coding genes with clusters of ampliconic genes highlighted, organization of palindromic sequences and inverted repeats, and detailed structure of centromeric and satellite DNA regions. The bottom panel shows a dotplot revealing the highly repetitive nature of this chromosome. Source: Rhie, A. et al. (2022). The complete sequence of a human Y chromosome. bioRxiv. https://doi.org/10.1101/2022.12.01.518724. License: CC0 (US Government work).
The Y chromosome carries genes critical for male development and fertility:
1. SRY: The master gene that determines male sex
2. Ampliconic genes: These genes exist in multiple copies and are important for sperm production:
3. AZF regions (Azoospermia Factors): Three regions (AZFa, AZFb, AZFc) where deletions can cause male infertility
Like other chromosomes, the Y has a centromere, but T2T-Y revealed unique features:
Figure: Structure of the T2T-Y Centromere. (See Figure 1b) The Y centromere spans 366 kb and consists of the DYZ3 alpha satellite array with three different variants, no transposable elements within the main array, two regions of hypomethylation where CENP-A proteins bind, and a dotplot showing that the repeat units are 99.5-100% identical, demonstrating extreme homogeneity. Source: Rhie, A. et al. (2022). The complete sequence of a human Y chromosome. bioRxiv. https://doi.org/10.1101/2022.12.01.518724. License: CC0 (US Government work).
The centromere spans 366 kb and consists of highly similar alpha satellite repeats. Interestingly, the T2T-Y centromere shows two distinct hypomethylated regions where kinetochore proteins bind—a pattern also seen in some other chromosomes.
Perhaps the most mysterious part of the Y chromosome is Yq12, the large heterochromatic region on the long arm that was almost entirely missing from GRCh38 (represented as a single 30+ megabase gap). T2T-Y finally revealed what’s inside: over 30 million base pairs of alternating blocks of two satellite families:
Figure: The Mysterious Yq12 Region Revealed. (See Figure 1d) Fluorescence microscopy images show the Y chromosome with different colored probes. A map showing 86 large blocks alternating between DYZ2 and DYZ1, with nearly all repeat units being over 98% identical to consensus. Sequence composition shows DYZ2 contains an ancient AluY fragment. A phylogenetic tree shows that AluY fragments in HSat1B cluster together, suggesting this satellite family originated on the Y. Source: Rhie, A. et al. (2022). The complete sequence of a human Y chromosome. bioRxiv. https://doi.org/10.1101/2022.12.01.518724. License: CC0 (US Government work).
These satellite blocks show evidence of recent duplication events—some duplications span up to 5 megabases and include multiple DYZ1 and DYZ2 blocks. Interestingly, HSat1B is almost unique to the Y chromosome and the short arms of acrocentric chromosomes, while HSat3 is found on many chromosomes.
Region | What It Is | Why It Matters | Why It Was Hard to Sequence | Size in T2T |
---|---|---|---|---|
Centromeric Satellite Arrays | Repetitive DNA (171-bp alpha satellite repeats) at the chromosome’s central “waist” | Acts as a handle for proteins to pull chromosomes apart during cell division; errors can cause aneuploidy, cancer, or developmental disorders | Short reads couldn’t distinguish between identical repeats; long reads (20–100+ kb) finally resolved them | Varies by chromosome; Y centromere is 366 kb; X centromere is 3.1 Mb |
Segmental Duplications | Large DNA chunks (thousands to millions of bp) copied across the genome with 90–99% similarity | Drive genetic diversity through structural variants; can affect traits (e.g., starch digestion) or cause disease (e.g., muscular dystrophy) | Near-identical sequences confused short-read technology; long reads distinguished them and revealed true variation | 6.61% of genome (201.93 Mb) vs 5.00% in GRCh38 |
Short Arms of Acrocentric Chromosomes | Repetitive rDNA arrays (45,000-bp units repeated dozens to hundreds of times) on chromosomes 13, 14, 15, 21, 22 | Encode ribosomal RNA for ribosomes; essential for protein synthesis; copy number varies between people (~400 in CHM13) | Repetitive arrays were indistinguishable with short reads; long reads and specialized assembly algorithms mapped complete copies | Total 66.1 Mb across all five chromosomes |
Y Chromosome Heterochromatin (Yq12) | Alternating blocks of HSat1B/DYZ2 and HSat3/DYZ1 satellite repeats | Unknown function, but shows recent structural rearrangements; highly variable between individuals | Extremely long tandem repeats spanning 30+ Mb were impossible to sequence with short reads | Over 30 Mb (was a single gap in GRCh38) |
Before the T2T project, these regions—centromeric satellite arrays, segmental duplications, acrocentric short arms, and the Y chromosome—were often called genomic “dark matter” because they remained largely invisible to sequencing technologies. They left massive gaps in earlier reference genomes like GRCh37 (hg19) and GRCh38 (hg38).
By using long-read sequencing technologies and innovative computational methods like marker-assisted polishing, the T2T Consortium finally illuminated these regions:
Together, these assemblies were combined to create T2T-CHM13v2.0 (also called T2T-CHM13+Y), providing the first truly complete sequence of a human genome—all 24 chromosomes from telomere to telomere with no gaps.
These advances improve our understanding of:
These insights are already improving variant calling (identifying genetic differences) in large-scale studies, supporting precision medicine by identifying disease-causing mutations, and helping us understand male infertility and other Y-linked conditions.
The T2T assemblies represent major advances, but the work continues. Because these genomes represent specific individuals (CHM13 and HG002), they don’t capture the full spectrum of human genetic diversity. That’s where projects like the Human Pangenome Reference Consortium (HPRC) come in—by assembling genomes from hundreds of individuals from different populations, researchers are building references that reflect the true diversity of humanity.
The complete human genome isn’t an ending—it’s a new beginning. With these complete references, we can now:
As sequencing technologies continue to improve and become more affordable, we’ll be able to study these complex regions in even more detail, uncovering new insights into human health, evolution, and what makes us human.