东北大学学报:自然科学版 ›› 2020, Vol. 41 ›› Issue (8): 1091-1096.DOI: 10.12068/j.issn.1005-3026.2020.08.005

• 信息与控制 • 上一篇    下一篇

基于网页结构与语言特征的垃圾网页链接检测方法

杨望, 江咏涵, 张三峰   

  1. (东南大学 网络空间安全学院, 江苏 南京211189)
  • 收稿日期:2019-09-28 修回日期:2019-09-28 出版日期:2020-08-15 发布日期:2020-08-28
  • 通讯作者: 杨望
  • 作者简介:杨望(1979-),男,安徽宣城人,东南大学讲师,博士.
  • 基金资助:
    国家重点研发计划项目(2017YFB0801703); 国家自然科学基金资助项目(61602114).

A Web Spam Link Detection Method Based on Web Page Structure and Text Features

YANG Wang, JIANG Yong-han, ZHANG San-feng   

  1. School of Cyber Science, Southeast University, Nanjing 211189, China.
  • Received:2019-09-28 Revised:2019-09-28 Online:2020-08-15 Published:2020-08-28
  • Contact: YANG Wang
  • About author:-
  • Supported by:
    -

摘要: 现有的垃圾网站检测方法主要针对自建的垃圾网站,对于通过入侵正常网站注入垃圾网络链接的检测效率不高.本文提出一种基于网页结构与文本多维特征的检测框架,该框架将网页进行分块处理.通过计算优势率的方法提取内容特征,根据标签数、属性键和属性值利用独热率的方法提取结构特征.使用机器学习算法进行训练并得到检测模型,进而有效地检测垃圾网站链接.同时,将本文的检测方法与基于内容分析的检测算法和黑名单匹配算法进行对比,本文提出的方法检测准确率最高有13%的提高.

关键词: 垃圾网站检测, 黑色SEO, 独热率, 机器学习, 链接注入

Abstract: The existing spam website detection methods are mainly aimed at self-built spam websites, and not suitable for injected spam websites because of the low efficiency of link detection. This paper proposes a new detection method, in which a detection framework is based on multi-dimensional features of webpage structure and text. The framework divides the webpage into blocks. Then content features are extracted by calculating odd ratio and structural features based on tags, attribute keys and attribute values are extracted by using the one-hot rate. The detection model is generated by proper machine learning and used to detect spam links. The detection accuracy of this framework is increased by up to 13%, compared with the algorithms based on content detection and on blacklist matching.

Key words: web spam link detection, black SEO, one-hot rate, machine learning, URL injection

中图分类号: