When the Human Genome Project announced that they had completed the first human genome in 2003, it was a momentous accomplishment – for the first time, the DNA blueprint of human life was unlocked. But it came with a catch – they weren’t actually able to put together all the genetic information in the genome. There were gaps: unfilled, often repetitive regions that were too confusing to piece together.
With advancements in technology that could handle these repetitive sequences, scientists finally filled those gaps in May 2021, and the first end-to-end human genome was officially published on Mar. 31, 2022.
I am a genome biologist who studies repetitive DNA sequences and how they shape genomes throughout evolutionary history. I was part of the team that helped characterize the repeat sequences missing from the genome. And now, with a truly complete human genome, these uncovered repetitive regions are finally being explored in full for the first time.
The missing puzzle pieces
German botanist Hans Winkler coined the word “genome” in 1920, combining the word “gene” with the suffix “-ome,” meaning “complete set,” to describe the full DNA sequence contained within each cell. Researchers still use this word a century later to refer to the genetic material that makes up an organism.
One way to describe what a genome looks like is to compare it to a reference book. In this analogy, a genome is an anthology containing the DNA instructions for life. It’s composed of a vast array of nucleotides (letters) that are packaged into chromosomes (chapters). Each chromosome contains genes (paragraphs) that are regions of DNA which code for the specific proteins that allow an organism to function.
While every living organism has a genome, the size of that genome varies from species to species. An elephant uses the same form of genetic information as the grass it eats and the bacteria in its gut. But no two genomes look exactly alike. Some are short, like the genome of the insect-dwelling bacteria Nasuia deltocephalinicola with just 137 genes across 112,000 nucleotides. Some, like the 149 billion nucleotides of the flowering plant Paris japonica, are so long that it’s difficult to get a sense of how many genes are contained within.
But genes as they’ve traditionally been understood – as stretches of DNA that code for proteins – are just a small part of an organism’s genome. In fact, they make up less than 2% of human DNA.
The human genome contains roughly 3 billion nucleotides and just under 20,000 protein-coding genes – an estimated 1% of the genome’s total length. The remaining 99% is non-coding DNA sequences that don’t produce proteins. Some are regulatory components that work as a…