When the Human Genome Project (HGP) announced its completion in 2003, it was celebrated as a historic milestone. Yet the “complete” human genome wasn’t actually complete. About 8% of our genome remained unsequenced — not because scientists didn’t want to sequence it, but because the technology at the time simply couldn’t handle certain regions.
Early sequencing technologies read short fragments of DNA (around 100-200 base pairs), which worked well for most of the genome. But in highly repetitive regions — where the same sequence appears over and over — these short reads couldn’t distinguish one repeat from another, like trying to place dozens of identical puzzle pieces without knowing where each belongs.
The missing regions weren’t random leftovers. They included some of the most functionally important parts of our chromosomes: centromeres (the central structures that help separate chromosomes during cell division), telomeres (the protective caps at chromosome ends), and ribosomal DNA arrays (genes that make the machinery for protein production). These gaps limited our understanding of chromosome stability, cell division, and even how genetic diseases develop.

Figure 2-1: Why Short-Read Sequencing Failed in Repetitive Regions. Short DNA reads (100-200 bp) cannot resolve highly repetitive genomic regions because they cannot distinguish between nearly identical repeat units. Long reads (20 kb to >100 kb) can span multiple repeats in a single read, enabling unambiguous placement and assembly of these challenging regions.
In 2022, the Telomere-to-Telomere (T2T) Consortium changed this. They produced the first truly complete human genome sequence, called T2T-CHM13, covering all 3.055 billion base pairs from one end (telomere) to the other end (telomere) of every chromosome — hence the name. This achievement added about 200 million base pairs of new sequence and revealed 1,956 previously unknown gene predictions, 99 of which are predicted to be protein coding (Nurk et al. 2022, Science).

Figure 2-2: The Telomere-to-Telomere Assembly Concept. The name “telomere-to-telomere” refers to complete, continuous DNA sequences spanning from one chromosome end (telomere) through the centromere to the other telomere, with no gaps in repetitive regions. Previous reference genomes contained numerous gaps, particularly in centromeres and other repetitive regions, while T2T assemblies provide uninterrupted sequences.
Two major advances enabled the T2T project to succeed where earlier efforts had failed: new sequencing technologies and a special cell line.
The T2T Consortium combined several cutting-edge technologies:
PacBio HiFi sequencing: This technology reads much longer DNA fragments — around 20,000 base pairs (20 kb) — with high accuracy. These longer reads can span multiple repeats, making it possible to distinguish one repeat from another.
Oxford Nanopore ultra-long-read sequencing: These reads can exceed 100,000 base pairs, allowing scientists to read through even longer repetitive regions in a single pass.
Supporting methods: Additional techniques like Illumina short-read sequencing (for error correction), Hi-C (which maps how DNA folds in 3D space), Bionano optical mapping (which creates a physical map of the genome), and Strand-seq (which tracks which DNA strands came from each parent) all helped ensure accuracy.
The assembly process used a high-resolution string graph — a powerful computational approach that helped resolve the most complex repetitive regions (see Figure 2-4).

Figure 2-4: CHM13 String Graph. This diagram shows how the T2T-CHM13 genome was assembled using a string graph approach. Each line represents a DNA sequence, and connections show where sequences overlap. The tangled regions reveal highly repetitive areas like the ribosomal DNA arrays and centromeric satellites. Source: Nurk, S. et al. (2021). The complete sequence of a human genome. bioRxiv. https://doi.org/10.1101/2021.05.26.445798. License: CC-BY 4.0.
Unlike traditional linear reference genomes that represent a single sequence, a string graph represents DNA as a network where nodes are DNA sequences and edges show how they connect. Like a subway map showing multiple routes between stations, the graph displays all possible connections, and long reads help determine the correct path through repetitive regions.
This approach was critical for handling repeats: when you have many identical DNA sequences (like in centromeres), a linear approach gets confused about which piece connects where. The graph shows all possibilities, and ultra-long reads spanning the repeats reveal the correct connections. The final T2T-CHM13 assembly achieved an exceptionally high accuracy with an error rate of just 1 mistake per 10 million bases.
While T2T-CHM13 is presented as a linear sequence (because it represents one individual), the future of genomics is moving toward pangenome references that better capture human genetic diversity across populations.
The T2T project used DNA from a unique cell line called CHM13, which came from a complete hydatidiform mole — a rare type of tissue that forms when an egg without genetic material is fertilized by a sperm that duplicates its own genome. This means CHM13 has two identical copies of each chromosome from the father and none from the mother.
In a typical diploid genome, the two sets of chromosomes inherited from each parent make assembly more complex because similar but not identical sequences must be distinguished. CHM13’s duplicated genome simplified this problem: matching sequences were known to be identical. However, CHM13 lacked a Y chromosome, which remained the final missing piece.

Figure 2-3: The CHM13 Advantage for Genome Assembly. In typical diploid genomes, assembly must distinguish between two similar but not identical chromosome copies (haplotypes) from each parent, creating complexity in repetitive regions. CHM13, derived from a complete hydatidiform mole, contains two identical copies of each chromosome, effectively creating a haploid-like assembly problem that greatly simplifies the resolution of complex repetitive sequences.
While the full T2T-CHM13 genome was announced in 2022, the project achieved its first major milestone in 2020 by completing the human X chromosome — the first chromosome to be assembled telomere-to-telomere with no gaps (Miga et al. 2020, Nature).
The research team chose the X chromosome as their first target for several strategic reasons:

Figure 2-5: Initial CHM13 X Chromosome Assembly. (See Figure 1b) The X chromosome was initially broken at three locations: the centromere (artificially collapsed in the assembly), a 120-kb segmental duplication (DMRTC1B), and a 134-kb segmental duplication with a paralogue on chromosome 2. Black bars indicate gaps in the GRCh38 reference, and red bars show known segmental duplications. Source: Miga, K.H. et al. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature, 585, 79-84. https://doi.org/10.1038/s41586-020-2547-7. License: CC-BY 4.0.
The two segmental duplications were resolved by finding ultra-long reads that completely spanned the repeats and were uniquely anchored on both sides. This allowed confident placement in the assembly.
The centromere presented a greater challenge — it’s a 3.1 Mb array of highly repetitive alpha satellite DNA where standard polishing methods failed. Initial attempts to polish the centromeric assembly actually decreased quality because reads were incorrectly placed due to the extreme repetitiveness.
The team developed a novel marker-assisted polishing strategy to finish the large repetitive regions:
1. Catalogue unique markers: They identified short (21 bp), unique sequences present only once in the genome. Remarkably, even within the DXZ1 centromeric array, there was enough variation between repeat copies to create unique markers at semi-regular intervals.
Genome-wide: average spacing = 66 bp between markers
DXZ1 array: average spacing = 2.3 kb between markers
Longest gap in DXZ1: 42 kb
2. Anchor reads precisely: These markers guided the correct placement of long reads during polishing, preventing the quality degradation that occurred with standard methods.
3. Iterative polishing: Multiple rounds of polishing were performed with Nanopore, then PacBio, then Illumina data, each improving accuracy.

Figure 2-6: Marker-Assisted Polishing Improves Assembly Quality. (See Figure 1d) Example from the GAGE locus showing coverage depth before (top) and after (bottom) marker-assisted polishing. Black dots indicate primary allele coverage, red dots show secondary alleles. The uniform coverage and elimination of secondary alleles after polishing demonstrates the dramatic quality improvement achieved by this method. Source: Miga, K.H. et al. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature, 585, 79-84. https://doi.org/10.1038/s41586-020-2547-7. License: CC-BY 4.0.
The 3.1 Mb DXZ1 centromeric array consists of approximately 1,408 copies of a ~2,057-bp “higher-order repeat” (HOR) unit. This canonical repeat is made of 12 divergent alpha satellite monomers (each ~171 bp) arranged in a specific order. The team validated this structure through multiple independent methods:
Experimental validation:
Sequence-based validation:

Figure 2-7: Complete Structure of the 3.1 Mb X Centromere. The DXZ1 array consists of approximately 1,408 copies of a ~2-kb higher-order repeat unit. (a) Predicted restriction map showing the array structure. (b) Experimental PFGE Southern blots matching the in silico prediction. (c) ddPCR copy number estimates across multiple cell lines. (d) Catalogue of 33 structural variants identified within the array. Source: Miga, K.H. et al. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature, 585, 79-84. https://doi.org/10.1038/s41586-020-2547-7. License: CC-BY 4.0.
The assembly catalogued 33 different structural variants within the DXZ1 array, including:
This level of detail was impossible with previous technologies and provides insights into centromere evolution and function.
The complete X chromosome assembly:
The precisely anchored ultra-long reads enabled chromosome-wide methylation mapping at single-base resolution — even across complex repeats that were previously invisible to methylation studies. This revealed several surprising patterns:

Figure 2-8: Methylation Patterns Across the Complete X Chromosome. Nanopore sequencing captured methylation patterns across the entire chromosome. (a) Hypomethylation in PAR1, with detailed view showing unmethylated bases (blue) and methylated bases (red). (b) A 93-kb hypomethylated region within the DXZ1 centromere. (c) The DXZ4 macrosatellite array showing a transition from methylated to unmethylated regions. Source: Miga, K.H. et al. (2020). Telomere-to-telomere assembly of a complete human X chromosome. Nature, 585, 79-84. https://doi.org/10.1038/s41586-020-2547-7. License: CC-BY 4.0.
The hypomethylated centromeric region was particularly intriguing. To test whether this was unique to the X or a general centromeric feature, the team manually assembled the centromere of chromosome 8 (D8Z2, ~2.02 Mb) and found a similar hypomethylated region, suggesting this may be a conserved feature marking functional centromeres.
This 2020 achievement demonstrated that completing an entire human chromosome was possible and established the methods that would be used to complete all other chromosomes:
The complete X chromosome became the foundation for finishing the rest of the human genome.
Building on the success of the X chromosome, the T2T Consortium completed the remaining chromosomes to produce the full T2T-CHM13 genome. This assembly includes several types of regions that were missing or incomplete in the previous reference genome (GRCh38):

Figure 2-9: Genomic Regions Revealed by T2T Assemblies. The T2T-CHM13 genome revealed four major types of previously missing or incomplete regions: (1) Complete centromeric satellite arrays essential for chromosome segregation, (2) Segmental duplications driving structural variation and evolution, (3) Full ribosomal DNA arrays on acrocentric chromosome short arms required for ribosome biogenesis, and (4) The previously unsequenced Y chromosome heterochromatic region (Yq12) containing over 30 Mb of satellite DNA.

Figure 2-10: T2T-CHM13 Assembly Ideogram. (See Figure 1a) This ideogram shows what’s new in T2T-CHM13 compared to GRCh38. Red areas highlight newly added regions, including complete centromeres and the short arms of five acrocentric chromosomes (13, 14, 15, 21, and 22). The track at the top shows that CHM13 has primarily European genetic ancestry. Source: Nurk, S. et al. (2021). The complete sequence of a human genome. bioRxiv. https://doi.org/10.1101/2021.05.26.445798. License: CC-BY 4.0.
T2T-CHM13 completed several major types of previously missing regions. Each presented unique sequencing challenges and plays distinct roles in cellular function.
The centromere is the pinched middle section of each chromosome, made of highly repetitive DNA called satellite arrays. In humans, the most common type is alpha satellite DNA, where a short sequence (about 171 base pairs) repeats hundreds or thousands of times in tandem.
During cell division, chromosomes must be pulled apart and distributed equally to daughter cells. The centromere serves as the attachment site for kinetochores, protein complexes that connect to spindle fibers and enable chromosome segregation.
Centromere dysfunction leads to chromosome missegregation, resulting in aneuploidy — cells with abnormal chromosome numbers. Most aneuploid embryos are not viable, and aneuploidy in surviving cells is associated with cancer and genetic disorders including Down syndrome.
T2T-CHM13 fully sequenced all centromeric regions, showing their complete structure for the first time. Each centromere has unique features:
Segmental duplications are large DNA sequences — typically thousands to millions of base pairs long — that appear in multiple genomic locations with 90-99% sequence identity.
Segmental duplications drive genetic diversity and evolution through non-allelic homologous recombination. Their high sequence similarity enables mispairing during DNA replication or recombination, generating structural variants — duplications, deletions, or rearrangements.
Some structural variants confer adaptive advantages. Copy number variation in the AMY1 gene, which encodes salivary amylase, correlates with dietary starch intake across human populations.
However, structural variants also cause disease. Certain duplications are linked to facioscapulohumeral muscular dystrophy (FSHD), a muscle-weakening disorder, and many other genetic conditions.
T2T-CHM13 added substantial amounts of segmental duplication sequence and corrected many errors in GRCh38. The complete assembly revealed that about 6.61% of the genome consists of segmental duplications (201.93 Mb), compared to only 5.00% that could be properly identified in GRCh38. This includes many medically relevant genes that were previously incomplete or incorrectly assembled.
Chromosomes 13, 14, 15, 21, and 22 are acrocentric, with centromeres positioned near one end, creating a short p arm and a much longer q arm.
The short arms contain tandem arrays of ribosomal DNA (rDNA) — genes encoding ribosomal RNA (rRNA), an essential structural and catalytic component of ribosomes. Each rDNA unit spans approximately 45,000 base pairs and is repeated dozens to hundreds of times.
Ribosomes are essential for protein synthesis. The rDNA arrays on acrocentric chromosomes cluster to form the nucleolus, the subnuclear compartment where ribosome biogenesis occurs.
rDNA copy number varies substantially between individuals (CHM13 contains approximately 400 copies). Whether this variation influences protein synthesis efficiency or phenotypes such as growth, aging, or cancer susceptibility remains under investigation.
T2T-CHM13 provided the first complete view of the short arms of all five acrocentric chromosomes, totaling 66.1 Mb of new sequence. These regions follow a similar structure: from telomere to centromere, they contain distal repeat arrays, the rDNA array, and proximal repeat arrays including various satellite sequences.
These short arms share approximately 98.7% sequence identity, consistent with frequent intergenic exchange. This homogenization likely reflects the physical clustering of acrocentric chromosomes in the nucleolus during interphase.
Although T2T-CHM13 was a major achievement, it had one limitation: CHM13 is a 46,XX cell line, meaning it has no Y chromosome. The Y chromosome had been notoriously difficult to sequence because of its complex repeat structure, including long palindromes (sequences that read the same forwards and backwards), tandem repeats, and segmental duplications. In fact, more than half of the Y chromosome was missing from GRCh38.
In 2023, the T2T Consortium completed this final piece of the puzzle by sequencing the Y chromosome from a different genome, HG002, which is commonly used for benchmarking (Rhie et al. 2023, Nature). The resulting assembly, called T2T-Y, is 62,460,029 base pairs long with no gaps — adding over 30 million base pairs of sequence compared to GRCh38-Y.

Figure 2-11: Complete Structure of the Human Y Chromosome. (See Figure 1a) This comprehensive view shows alignment of T2T-Y to GRCh38-Y, locations of protein-coding genes with clusters of ampliconic genes highlighted, organization of palindromic sequences and inverted repeats, and detailed structure of centromeric and satellite DNA regions. The bottom panel shows a dotplot revealing the highly repetitive nature of this chromosome. Source: Rhie, A. et al. (2022). The complete sequence of a human Y chromosome. bioRxiv. https://doi.org/10.1101/2022.12.01.518724. License: CC0 (US Government work).
The Y chromosome carries genes critical for male development and fertility:
1. SRY: The master gene that determines male sex
2. Ampliconic genes: These genes exist in multiple copies and are important for sperm production:
3. AZF regions (Azoospermia Factors): Three regions (AZFa, AZFb, AZFc) where deletions can cause male infertility
Like other chromosomes, the Y has a centromere, but T2T-Y revealed unique features:

Figure 2-12: Structure of the T2T-Y Centromere. (See Figure 1b) The Y centromere spans 366 kb and consists of the DYZ3 alpha satellite array with three different variants, no transposable elements within the main array, two regions of hypomethylation where CENP-A proteins bind, and a dotplot showing that the repeat units are 99.5-100% identical, demonstrating extreme homogeneity. Source: Rhie, A. et al. (2022). The complete sequence of a human Y chromosome. bioRxiv. https://doi.org/10.1101/2022.12.01.518724. License: CC0 (US Government work).
The centromere spans 366 kb and consists of highly similar alpha satellite repeats. Interestingly, the T2T-Y centromere shows two distinct hypomethylated regions where kinetochore proteins bind — a pattern also seen in some other chromosomes.
Perhaps the most mysterious part of the Y chromosome is Yq12, the large heterochromatic region on the long arm that was almost entirely missing from GRCh38 (represented as a single 30+ megabase gap). T2T-Y finally revealed what’s inside: over 30 million base pairs of alternating blocks of two satellite families:

Figure 2-13: The Mysterious Yq12 Region Revealed. (See Figure 1d) Fluorescence microscopy images show the Y chromosome with different colored probes. A map showing 86 large blocks alternating between DYZ2 and DYZ1, with nearly all repeat units being over 98% identical to consensus. Sequence composition shows DYZ2 contains an ancient AluY fragment. A phylogenetic tree shows that AluY fragments in HSat1B cluster together, suggesting this satellite family originated on the Y. Source: Rhie, A. et al. (2022). The complete sequence of a human Y chromosome. bioRxiv. https://doi.org/10.1101/2022.12.01.518724. License: CC0 (US Government work).
These satellite blocks show evidence of recent duplication events — some duplications span up to 5 megabases and include multiple DYZ1 and DYZ2 blocks. Interestingly, HSat1B is almost unique to the Y chromosome and the short arms of acrocentric chromosomes, while HSat3 is found on many chromosomes.
| Region | What It Is | Why It Matters | Why It Was Hard to Sequence | Size in T2T |
|---|---|---|---|---|
| Centromeric Satellite Arrays | Repetitive DNA (171-bp alpha satellite repeats) at the chromosome’s central “waist” | Acts as a handle for proteins to pull chromosomes apart during cell division; errors can cause aneuploidy, cancer, or developmental disorders | Short reads couldn’t distinguish between identical repeats; long reads (20-100+ kb) finally resolved them | Varies by chromosome; Y centromere is 366 kb; X centromere is 3.1 Mb |
| Segmental Duplications | Large DNA chunks (thousands to millions of bp) copied across the genome with 90-99% similarity | Drive genetic diversity through structural variants; can affect traits (e.g., starch digestion) or cause disease (e.g., muscular dystrophy) | Near-identical sequences confused short-read technology; long reads distinguished them and revealed true variation | 6.61% of genome (201.93 Mb) vs 5.00% in GRCh38 |
| Short Arms of Acrocentric Chromosomes | Repetitive rDNA arrays (45,000-bp units repeated dozens to hundreds of times) on chromosomes 13, 14, 15, 21, 22 | Encode ribosomal RNA for ribosomes; essential for protein synthesis; copy number varies between people (~400 in CHM13) | Repetitive arrays were indistinguishable with short reads; long reads and specialized assembly algorithms mapped complete copies | Total 66.1 Mb across all five chromosomes |
| Y Chromosome Heterochromatin (Yq12) | Alternating blocks of HSat1B/DYZ2 and HSat3/DYZ1 satellite repeats | Unknown function, but shows recent structural rearrangements; highly variable between individuals | Extremely long tandem repeats spanning 30+ Mb were impossible to sequence with short reads | Over 30 Mb (was a single gap in GRCh38) |
Before the T2T project, these regions — centromeric satellite arrays, segmental duplications, acrocentric short arms, and the Y chromosome — were often called genomic “dark matter” because they remained largely invisible to sequencing technologies. They left massive gaps in earlier reference genomes like GRCh37 (hg19) and GRCh38 (hg38).
By using long-read sequencing technologies and innovative computational methods like marker-assisted polishing, the T2T Consortium finally illuminated these regions:
Together, these assemblies were combined to create T2T-CHM13v2.0 (also called T2T-CHM13+Y), providing the first truly complete sequence of a human genome — all 24 chromosomes from telomere to telomere with no gaps.
These advances improve our understanding of:
These insights are already improving variant calling (identifying genetic differences) in large-scale studies, supporting precision medicine by identifying disease-causing mutations, and helping us understand male infertility and other Y-linked conditions.
The T2T assemblies represent major advances, but the work continues. Because these genomes represent specific individuals (CHM13 and HG002), they don’t capture the full spectrum of human genetic diversity. That’s where projects like the Human Pangenome Reference Consortium (HPRC) come in — by assembling genomes from hundreds of individuals from different populations, researchers are building references that reflect the true diversity of humanity.
The complete human genome isn’t an ending — it’s a new beginning. With these complete references, we can now:
As sequencing technologies continue to improve and become more affordable, we’ll be able to study these complex regions in even more detail, uncovering new insights into human health, evolution, and what makes us human.