Journal of Northeastern University(Natural Science) ›› 2022, Vol. 43 ›› Issue (3): 321-327.DOI: 10.12068/j.issn.1005-3026.2022.03.003

• Information & Control • Previous Articles     Next Articles

Fuzzy Similarity Join Algorithm Based on Dynamic Double Prefixes

YU Chang-yong, WANG Wen-han, WEN Xiu-jing, ZHAO Yu-hai   

  1. School of Computer Science & Engineering, Northeastern University, Shenyang 110169, China.
  • Revised:2021-09-07 Accepted:2021-09-07 Published:2022-05-18
  • Contact: WANG Wen-han
  • About author:-
  • Supported by:
    -

Abstract: Focusing on the similarity join problem, a fuzzy similarity join algorithm was proposed based on dynamic double. The difference from the previous algorithms is that double prefixes are introduced, which improves the filtering efficiency when searching for candidates and building indexes due to the differences of prefixes. On this basis, optimization is realized. First, the candidate set is narrowed by taking the intersection of the candidate sets generated by each prefix. Afterwards, the maximum distinguishing arbitrary-selected prefix is proposed, and this prefix is used for pre-verification to reduce the final candidate pairs that enter the verification process, thereby reducing the join time. Experiments are conducted on three real datasets, and the proposed algorithm is compared with the Silkmoth and MF-Join. The results show that the proposed algorithm can generate a smaller set of candidate set and requires less join time.

Key words: similarity join; arbitrary-selected prefix; candidate; prefix filter; verification process

CLC Number: