东北大学学报:自然科学版 ›› 2018, Vol. 39 ›› Issue (7): 959-963.DOI: 10.12068/j.issn.1005-3026.2018.07.010

• 信息与控制 • 上一篇    下一篇

一种带有长度和位置约束的字符串索引方法

于长永1, 高明1, 柏禄一1, 赵宇海2   

  1. (1. 东北大学秦皇岛分校 计算机与通信工程学院, 河北 秦皇岛066004; 2. 东北大学 计算机科学与工程学院, 辽宁 沈阳110169)
  • 收稿日期:2017-05-24 修回日期:2017-05-24 出版日期:2018-07-15 发布日期:2018-07-11
  • 通讯作者: 于长永
  • 作者简介:于长永(1981-),男,辽宁海城人,东北大学副教授.冯明杰(1971-), 男, 河南禹州人, 东北大学副教授; 王恩刚(1962-), 男, 辽宁沈阳人, 东北大学教授,博士生导师.
  • 基金资助:
    国家自然科学基金资助项目(61772124,61332014,61401080,61402087); 河北省自然科学基金资助项目(F2015501049); 河北省教育厅项目(QN2014339); 中央高校基本科研业务费专项资金资助项目(N150402002).国家自然科学基金资助项目(51171041).

A String Collection Indexing Method with String Length and Position Constraint

YU Chang-yong1, GAO Ming1, BAI Lu-yi1, ZHAO Yu-hai2   

  1. 1. School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, Qinhuangdao 066004, China; 2. School of Computer Science & Engineering, Northeastern University, Shenyang 110169, China.
  • Received:2017-05-24 Revised:2017-05-24 Online:2018-07-15 Published:2018-07-11
  • Contact: YU Chang-yong
  • About author:-
  • Supported by:
    -

摘要: 提出了一种基于BWT(Burrows-wheeler-transform)的字符串集合的索引方法,以解决带有匹配字符串长度和匹配子串位置约束的子串确切匹配查找问题.讨论了BWT和基于BWT索引进行确切子串查找的基本原理.分析了字符串集合、匹配字符串长度和匹配子串位置约束对原BWT索引的影响.重点解决了快速地从匹配后缀位置到字符串ID和匹配子串位置的计算问题.在3个真实的数据集上进行了比对实验,结果表明:所提出的基于BWT索引方法在没有增加原索引大小的情况下,大大提升了带有匹配字符串长度和匹配位置约束的确切子串的查找的性能,因此该算法更加适用于大规模的字符串集合的索引进行近似字符串匹配和连接.

关键词: BWT, 字符串索引, 倒排链表, 字符串近似匹配, 序列比对

Abstract: An index method of string collection was proposed based on BWT (Burrows-wheeler-transform) for solving the exact substring queries with string length and matching position constraints. Firstly, the BWT and exact string query based on it were discussed. Then the impact of string collection, string length and substring position upon the original BWT index was analyzed. Finally, the fast calculation problem was discussed and solved from the position of the matching suffix to the string ID and position on the string of the matching substring. The approximate string matching was conducted on three real string collections and compared the results of index method proposed and the original one. The experimental results showed that the method proposed based on BWT speeded up the process of exact substring queries with string length and matching position constraints considerably in the case of keeping the index size. Therefore, the proposed method was suitable for indexing large-scale string collection for string similarity match and joint.

Key words: BWT (Burrows-Wheeler-transform), string index, inverted list, string similarity match, sequence alignment

中图分类号: