The power of single molecule real-time sequencing technology in the de novo assembly of a eukaryotic

◇◆丶佛笑我妖孽 提交于 2019-12-13 02:04:42

The power of single molecule real-time sequencing technology in the de novo assembly of a eukaryotic genome

单分子实时测序技术在真核生物基因组从头组装中的力量

Abstract

Second-generation sequencers (SGS) have been game-changing, achieving cost-effective whole genome sequencing in many non-model organisms. However, a large portion of the genomes still remains unassembled. We reconstructed azuki bean (Vigna angularis) genome using single molecule real-time (SMRT) sequencing technology and achieved the best contiguity and coverage among currently assembled legume crops. The SMRT-based assembly produced 100 times longer contigs with 100 times smaller amount of gaps compared to the SGS-based assemblies. A detailed comparison between the assemblies revealed that the SMRT-based assembly enabled a more comprehensive gene annotation than the SGS-based assemblies where thousands of genes were missing or fragmented. A chromosome-scale assembly was generated based on the high-density genetic map, covering 86% of the azuki bean genome. We demonstrated that SMRT technology, though still needed support of SGS data, achieved a near-complete assembly of a eukaryotic genome.

Introduction

Genome projects used to consume a large amount of funds and labor. For example, the rice genome project1 took 14 years and cost several hundred million dollars. The paradigm was changed by the advent of pyrosequencing2 and Solexa sequencing3 technologies. The high-throughput sequencing capacity of these second-generation sequencers (SGS) enabled the assembly of diploid plant genomes with much less time and cost4.

However, the read length of SGS is not long enough to span repetitive sequences, which often comprise 50–80% of non-model plant genomes5. Although paired reads with long inserts could help resolve such repeated sequences, missing and fragmentation of gene coding sequences have been claimed6,7. As such, evolutionary studies based on such incorrect assemblies could reach incorrect conclusions. Moreover, simple misassemblies or mis-scaffolding could be deleterious in map-based cloning. Therefore, read length is one of the most important factors in determining the complete genome sequences.

The third generation, single molecule real-time (SMRT) sequencing platform8 now successfully generates reads of 10 kb on average9 and recently achieved an N50 of 4.3 Mb in assembling the haploid human genome10.

In de novo assembly, where a reference genome is not available, having a high-density genetic linkage map is also important. To reconstruct pseudomolecules of chromosomes, the assembled contigs/scaffolds have to be assigned according to the order of the marker loci. However, if the markers are not dense enough or evenly distributed, a large portion of the assembly can remain unanchored. In many cases, only 30–60% of the genomes have been assigned to pseudomolecules11,12,13,14,15,16,17,18,19.

Here, we present a near-complete genome sequence of the azuki bean (Vigna angularis), the second-most important grain legume in East Asia20. Nowadays, the breeding of azuki bean is extensively conducted and is targeting seed quality, cold tolerance and disease resistance. However, the narrow genetic diversity of this domesticated species and the lack of high-quality genome sequences have limited the process. Although this species was recently sequenced, the draft assembly covered ~70% of the genome and only half of it was anchored onto pseudomolecules14. As such, we sequenced the azuki bean genome using SMRT sequencing technology, in addition to SGS.

We tested several assembly approaches and found SMRT sequencing provided, by far, the best assembly. We also developed a high-density genetic map with evenly distributed markers, which we used not only for anchoring, but also for evaluating the accuracy of the assemblies. In addition, we evaluated the genome assemblies of legume crops based on some criteria used in Assemblathon 221.

摘要

第二代测序仪(SGS)已经改变了游戏规则,在许多非模型生物中实现了具有成本效益的全基因组测序。然而,大部分的基因组仍然没有组装好。我们利用单分子实时(SMRT)测序技术重建了赤豆基因组,并在目前组装的豆类作物中获得了最佳的连续性和覆盖率。与基于sgs的装配相比,基于smrt的装配产生100倍长的重叠,100倍的间隙。对装配体的详细比较表明,与成千上万个基因缺失或片段化的基于sgs的装配体相比,基于smrt的装配体能够实现更全面的基因注释。根据高密度的遗传图谱生成了一个染色体规模的装配体,覆盖了86%的红豆基因组。我们证明了SMRT技术,尽管仍然需要SGS数据的支持,实现了真核生物基因组的近乎完整的装配。

介绍

基因组计划曾耗费大量资金和人力。例如,水稻基因组计划(rice genome project1)耗时14年,耗资数亿美元。这种范式随着焦性测序和Solexa测序技术的出现而改变。这些第二代测序仪(SGS)的高通量测序能力使二倍体植物基因组的组装以更少的时间和成本4。然而,SGS的读取长度不足以跨越重复序列,这些重复序列通常包含50-80%的非模型植物基因组5。虽然长插入的成对读可以帮助解决这种重复序列,但是基因编码序列的缺失和片段化已经被提出6,7。同样,基于这些错误装配的进化研究可能会得出错误的结论。此外,在基于映射的克隆中,简单的错误组装或错误搭建可能是有害的。因此,读长是决定全基因组序列的重要因素之一。第三代单分子实时(SMRT)测序平台8现在成功地产生了平均10 kb的reads 9,并在最近实现了4.3 Mb的N50来装配单倍体人类基因组10。在没有参考基因组的从头组装中,拥有高密度的遗传连锁图谱也很重要。为了重建染色体的假分子,组装好的叠架/支架必须按照标记位点的顺序进行分配。然而,如果标记的密度不够大或分布不均匀,则可能会有很大一部分未锚定。在许多情况下,只有30-60%的基因组被指定为伪分子11、12、13、14、15、16、17、18、19。在这里,我们提出了一个近乎完整的基因组序列的红小豆(Vigna angularis),第二重要的粮食豆类在东亚20。目前,赤小豆育种已广泛开展,育种目标是种子品质、抗寒性和抗病性。然而,这种驯化物种的遗传多样性有限,缺乏高质量的基因组序列,限制了这一过程。虽然这个物种最近被测序了,但它的草图覆盖了大约70%的基因组,只有一半固定在假分子上。因此,除了SGS外,我们还使用SMRT测序技术对赤豆基因组进行了测序。我们测试了几种装配方法,发现SMRT测序提供了迄今为止最好的装配。我们还开发了一种具有均匀分布标记的高密度遗传图谱,不仅用于锚定,而且用于评估装配的准确性。此外,我们根据汇编中使用的一些标准对豆科作物的基因组汇编进行了评估。

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!