东北大学学报(自然科学版) ›› 2023, Vol. 44 ›› Issue (6): 770-776.DOI: 10.12068/j.issn.1005-3026.2023.06.002

• 信息与控制 • 上一篇    下一篇

一种针对大规模Read Mapping的高效DBG索引方法

于长永, 李俊杰, 马海涛, 赵宇海   

  1. (东北大学 计算机科学与工程学院, 辽宁 沈阳110169)
  • 发布日期:2023-06-20
  • 通讯作者: 于长永
  • 作者简介:于长永(1981-),男,辽宁海城人,东北大学副教授.
  • 基金资助:
    国家自然科学基金资助项目(61772124).

An Efficient DBG Indexing Method for Large-Scale Read Mapping

YU Chang-yong, LI Jun-jie, MA Hai-tao, ZHAO Yu-hai   

  1. School of Computer Science & Engineering, Northeastern University, Shenyang 110169, China.
  • Published:2023-06-20
  • Contact: LI Jun-jie
  • About author:-
  • Supported by:
    -

摘要: 为了能够回答生物信息学中关于de Bruijn graph(DBG)的两个问题——①对于任意的k-mer,回答其是否为DBG的顶点,②对于DBG的任意顶点,回答其邻接信息(入边和出边),提出了一种针对大规模read mapping的高效DBG索引方法.本文将以上两个问题转化为非重复多路径上的k-mer和(k+1)-mer的确切查找问题,并利用FM-index进行解决.首先,对给定的参考序列进行压缩,即非重复多路径的发现,从而压缩了序列中大量存在的重复(k+1)-mer.其次,基于非重复多路径FM-index对DBG进行索引.查找k-mer是否出现在DBG上,若找到,给出该k-mer的直接前驱和直接后继结点,从而提高时空效率.最后,在62种大肠杆菌菌株的基因组上进行实验.实验结果表明,所提出的方法可以高效地对多参考序列的DBG进行索引.

关键词: de Bruijn graph;索引;read mapping(序列映射);FM-index;参考序列

Abstract: In order to answer two questions about de Bruijn graph(DBG) in bioinformatics: ① For any k-mer, answer whether it is the vertex of DBG; ② For any vertex of DBG, answer its adjacency information(incoming edge and outgoing edge), an efficient DBG indexing method for large-scale read mapping is proposed. In this paper, the above two questions are transformed into the exact finding problem of k-mer and (k+1)-mer on non-repetitive multipath, and are solved by using FM-index. Firstly, a given reference sequence is compressed, which is the discovery of non-repetitive multipaths, thereby compressing the presence of large-scale repetitive (k+1)-mers in the sequence. Secondly, the DBG is indexed based on the non-repetitive multipath FM-index. The k-mer is searched whether it appears on the DBG, and if so, the direct predecessor and direct successor nodes of the k-mer are provided to improve the spatial-temporal efficiency. Finally, experiments are performed on the genomes of 62 E. coli strains. The experimental results show that the method proposed can efficiently index the DBG of multi-reference sequences.

Key words: de Bruijn graph; index; read mapping; FM-index; reference sequence

中图分类号: