东北大学学报(自然科学版) ›› 2022, Vol. 43 ›› Issue (3): 321-327.DOI: 10.12068/j.issn.1005-3026.2022.03.003

• 信息与控制 • 上一篇    下一篇

基于动态双重前缀的模糊相似性连接算法

于长永, 王雯函, 温秀静, 赵宇海   

  1. (东北大学 计算机科学与工程学院, 辽宁 沈阳110169)
  • 修回日期:2021-09-07 接受日期:2021-09-07 发布日期:2022-05-18
  • 通讯作者: 于长永
  • 作者简介:于长永(1981-),男,辽宁海城人,东北大学副教授.
  • 基金资助:
    国家自然科学基金资助项目(61772124).

Fuzzy Similarity Join Algorithm Based on Dynamic Double Prefixes

YU Chang-yong, WANG Wen-han, WEN Xiu-jing, ZHAO Yu-hai   

  1. School of Computer Science & Engineering, Northeastern University, Shenyang 110169, China.
  • Revised:2021-09-07 Accepted:2021-09-07 Published:2022-05-18
  • Contact: WANG Wen-han
  • About author:-
  • Supported by:
    -

摘要: 针对相似性连接问题, 提出了动态双重前缀的模糊相似性连接算法.与之前的算法不同的是,本文采用双重前缀,即在查找候选以及构建索引时使用不同的前缀来提高过滤效率,并在此基础上进行了优化.首先通过取各个前缀生成的候选集合的交集来缩小候选集合;其次提出最大区分任选前缀,利用此前缀进行预验证来减少最终进入到验证过程的候选对,以此来减少连接时间.并且在三个真实数据集上进行实验,将本文算法与Silkmoth算法以及MF-Join算法进行比较,结果表明所提算法可以生成更小的候选集集合并且需要更少的连接时间.

关键词: 相似性连接;任选前缀;候选;前缀过滤;验证过程

Abstract: Focusing on the similarity join problem, a fuzzy similarity join algorithm was proposed based on dynamic double. The difference from the previous algorithms is that double prefixes are introduced, which improves the filtering efficiency when searching for candidates and building indexes due to the differences of prefixes. On this basis, optimization is realized. First, the candidate set is narrowed by taking the intersection of the candidate sets generated by each prefix. Afterwards, the maximum distinguishing arbitrary-selected prefix is proposed, and this prefix is used for pre-verification to reduce the final candidate pairs that enter the verification process, thereby reducing the join time. Experiments are conducted on three real datasets, and the proposed algorithm is compared with the Silkmoth and MF-Join. The results show that the proposed algorithm can generate a smaller set of candidate set and requires less join time.

Key words: similarity join; arbitrary-selected prefix; candidate; prefix filter; verification process

中图分类号: