Journal of Northeastern University Natural Science ›› 2020, Vol. 41 ›› Issue (8): 1091-1096.DOI: 10.12068/j.issn.1005-3026.2020.08.005

• Information & Control • Previous Articles     Next Articles

A Web Spam Link Detection Method Based on Web Page Structure and Text Features

YANG Wang, JIANG Yong-han, ZHANG San-feng   

  1. School of Cyber Science, Southeast University, Nanjing 211189, China.
  • Received:2019-09-28 Revised:2019-09-28 Online:2020-08-15 Published:2020-08-28
  • Contact: YANG Wang
  • About author:-
  • Supported by:
    -

Abstract: The existing spam website detection methods are mainly aimed at self-built spam websites, and not suitable for injected spam websites because of the low efficiency of link detection. This paper proposes a new detection method, in which a detection framework is based on multi-dimensional features of webpage structure and text. The framework divides the webpage into blocks. Then content features are extracted by calculating odd ratio and structural features based on tags, attribute keys and attribute values are extracted by using the one-hot rate. The detection model is generated by proper machine learning and used to detect spam links. The detection accuracy of this framework is increased by up to 13%, compared with the algorithms based on content detection and on blacklist matching.

Key words: web spam link detection, black SEO, one-hot rate, machine learning, URL injection

CLC Number: