Reference genome

A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead, a reference provides a haploid mosaic of different DNA sequences from each donor. For example, one of the most recent human reference genomes, assembly GRCh38/hg38, is derived from >60 genomic clone libraries.^[1] There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.^[2]

Properties of reference genomes[edit]

Measures of length[edit]

The length of a genome can be measured in multiple different ways.

A simple way to measure genome length is to count the number of base pairs in the assembly.^[3]

The golden path is an alternative measure of length that omits redundant regions such as haplotypes and pseudo autosomal regions.^[4]^[5] It is usually constructed by layering sequencing information over a physical map to combine scaffold information. It is a 'best estimate' of what the genome will look like and typically includes gaps, making it longer than the typical base pair assembly.^[6]

Other genomes[edit]

Since the Human Genome Project was finished, multiple international projects have started, focused on assembling reference genomes for many organisms. Model organisms (e.g., zebrafish (Danio rerio), chicken (Gallus gallus), Escherichia coli etc.) are of special interest to the scientific community, as well as, for example, endangered species (e.g., Asian arowana (Scleropages formosus) or the American bison (Bison bison)). As of August 2022, the NCBI database supports 71 886 partially or completely sequenced and assembled genomes from different species, such as 676 mammals, 590 birds and 865 fishes. Also noteworthy are the numbers of 1796 insects genomes, 3747 fungi, 1025 plants, 33 724 bacteria, 26 004 virus and 2040 archaea.^[49] A lot of these species have annotation data associated with their reference genomes that can be publicly accessed and visualized in genome browsers such as Ensembl and UCSC Genome Browser.^[50]^[51]

Some examples of these international projects are: the Chimpanzee Genome Project, carried out between 2005 and 2013 jointly by the Broad Institute and the McDonnell Genome Institute of Washington University in St. Louis, which generated the first reference genomes for 4 subspecies of Pan troglodytes;^[52]^[53] the 100K Pathogen Genome Project, which started in 2012 with the main goal of creating a database of reference genomes for 100 000 pathogen microorganisms to use in public health, outbreaks detection, agriculture and environment;^[54] the Earth BioGenome Project, which started in 2018 and aims to sequence and catalog the genomes of all the eukaryotic organisms on Earth to promote biodiversity conservation projects. Inside this big-science project there are up to 50 smaller-scale affiliated projects such as the Africa BioGenome Project or the 1000 Fungal Genomes Project.^[55]^[56]^[57]

Reference genome

Properties of reference genomes[edit]

Measures of length[edit]

Other genomes[edit]

Genome Reference Consortium