基于网页结构与语言特征的垃圾网页链接检测方法

doi:10.12068/j.issn.1005-3026.2020.08.005

东北大学学报:自然科学版 ›› 2020, Vol. 41 ›› Issue (8): 1091-1096.DOI: 10.12068/j.issn.1005-3026.2020.08.005

基于网页结构与语言特征的垃圾网页链接检测方法

杨望，江咏涵，张三峰

(东南大学网络空间安全学院，江苏南京211189)

收稿日期:2019-09-28 修回日期:2019-09-28 出版日期:2020-08-15 发布日期:2020-08-28
通讯作者: 杨望
作者简介:杨望(1979-)，男，安徽宣城人，东南大学讲师，博士.
基金资助:
国家重点研发计划项目(2017YFB0801703)；国家自然科学基金资助项目(61602114).

A Web Spam Link Detection Method Based on Web Page Structure and Text Features

YANG Wang， JIANG Yong-han， ZHANG San-feng

School of Cyber Science， Southeast University， Nanjing 211189， China.

Received:2019-09-28 Revised:2019-09-28 Online:2020-08-15 Published:2020-08-28
Contact: YANG Wang
About author:-
Supported by:
-

摘要/Abstract

摘要： 现有的垃圾网站检测方法主要针对自建的垃圾网站，对于通过入侵正常网站注入垃圾网络链接的检测效率不高.本文提出一种基于网页结构与文本多维特征的检测框架，该框架将网页进行分块处理.通过计算优势率的方法提取内容特征，根据标签数、属性键和属性值利用独热率的方法提取结构特征.使用机器学习算法进行训练并得到检测模型，进而有效地检测垃圾网站链接.同时，将本文的检测方法与基于内容分析的检测算法和黑名单匹配算法进行对比，本文提出的方法检测准确率最高有13%的提高.

关键词: 垃圾网站检测, 黑色SEO, 独热率, 机器学习, 链接注入

Abstract: The existing spam website detection methods are mainly aimed at self-built spam websites， and not suitable for injected spam websites because of the low efficiency of link detection. This paper proposes a new detection method， in which a detection framework is based on multi-dimensional features of webpage structure and text. The framework divides the webpage into blocks. Then content features are extracted by calculating odd ratio and structural features based on tags， attribute keys and attribute values are extracted by using the one-hot rate. The detection model is generated by proper machine learning and used to detect spam links. The detection accuracy of this framework is increased by up to 13%， compared with the algorithms based on content detection and on blacklist matching.

Key words: web spam link detection, black SEO, one-hot rate, machine learning, URL injection

中图分类号:

TP393.08

杨望，江咏涵，张三峰. 基于网页结构与语言特征的垃圾网页链接检测方法[J]. 东北大学学报:自然科学版, 2020, 41(8): 1091-1096.

YANG Wang， JIANG Yong-han， ZHANG San-feng. A Web Spam Link Detection Method Based on Web Page Structure and Text Features[J]. Journal of Northeastern University Natural Science, 2020, 41(8): 1091-1096.

参考文献

[1]中国互联网信息中心(CNNIC).第42次中国互联网络发展状况统计报告［EB/OL］.［2019-08-19］. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201808/P020180820630889299840.pdf.(China Internet Network Information Center.The 42nd statistical report on China’s Internet development［EB/OL］.［2019-08-19］.http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201808/P020180820630889299840.pdf.)
[2]Jansen B J，Spink A.An analysis of web documents retrieved and viewed［C］//The 4th International Conference on Internet Computing.Las Vegas，2003:65-69.
[3]杨向军.Web spam检测系统的设计和实现［D］.广州:华南理工大学，2010.(Yang Xiang-jun.Design and implementation of web spam detection system［D］.Guangzhou:South China University of Technology，2010.)
[4]da Costa Carvalho A L，Chirita P A，de Moura E S，et al.Site level noise removal for search engines［C］//Proceedings of the 15th International Conference on World Wide Web.Edinburgh，Scotland，2006 :73-82.
[5]Malaga R A.Search engine optimization—black and white hat approaches［J］.Advances in Computers，2010，78:1-39.
[6]Google Inc.Google Panda［EB/OL］.［2019-07-15］.https://baike.baidu.com/item/%E7%86%8A%E7%8C%AB%E7%AE%97%E6%B3%95.
[7]Baidu Inc.Baidu Luluo algorithm［EB/OL］.［2019-07-18］.https://baike.baidu.com/item/%E7%99%BE%E5%BA%A6%E7%BB%BF%E8%90%9D%E7%AE%97%E6%B3%95/6023432?fromtitle=%E7%BB%BF%E8%90%9D%E7%AE%97%E6%B3%95&fromid=5994878&fr=aladdin.
[8]周文怡，顾徐波，施勇，等.基于机器学习的网页暗链检测方法［J］.计算机工程，2018，44(10):22-27.(Zhou Wen-yi，Gu Xu-bo，Shi Yong，et al.Detection method for hidden hyperlink based on machine learning［J］.Computer Engineering，2018，44(10):22-27.)
[9]Gyngyi Z，Garcia-Molina H.Web spam taxonomy［C/OL］.Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web［2019-08-05］.http://airweb.cse.lehigh.edu/2005/gyongyi.pdf.
[10]Ntoulas A，Najork M，Manasse M，et al.Detecting spam web pages through content analysis［C］//Proceedings of the 15th International Conference on World Wide Web. Edinburgh，Scotland，2006 :83-92.
[11]Fetterly D，Manasse M，Najork M.Spam，damn spam，and statistics:using statistical analysis to locate spam web pages［C］//Proceedings of the 7th International Workshop on the Web and Databases.Paris，2004:1-6.
[12]Gyngyi Z，Garcia-Molina H，Pedersen J.Combating web spam with trustrank［C］//Proceedings of the 30th International VLDB Conference.New York:ACM Press，2004:576-587.
[13]Gyngyi Z，Berkhin P，Garcia-Molina H，et al.Link spam detection based on mass estimation［C］//Proceedings of the 32nd International Conference on Very Large Data Bases.［S.l.］:VLDB Endowment，2006:439-450.
[14]Wu B，Davison B D.Cloaking and redirection:a preliminary study［J/OL］.［2019-08-16］.https://www.researchgate.net/publication/303137682_Cloaking_and_Redirection_A_Preliminary_Study.
[15]Sun J Y.jieba［EB/OL］.［2019-07-28］.https://pypi.org/project/jieba/.
[15]关守平，房少纯.一种新型的区间-粒子群优化算法［J］.东北大学学报(自然科学版)，2012，33(10):1381-1384.(Guan Shou-ping，Fang Shao-chun.A new interval particle swarm optimization algorithm［J］.Journal of Northeastern University(Natural Science)，2012，33(10):1381-1384.)

基于网页结构与语言特征的垃圾网页链接检测方法

A Web Spam Link Detection Method Based on Web Page Structure and Text Features

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价

[1]	赵海，陈佳伟，施瀚，王相. 一种应用于人体活动识别的迁移学习算法[J]. 东北大学学报（自然科学版）, 2022, 43(6): 776-782.
[2]	马海涛，路家蕊，于文鑫，于长永. 线性区域数量与PLNN表达能力的相关性[J]. 东北大学学报（自然科学版）, 2021, 42(2): 201-207.
[3]	李壮年，储满生，柳政根，李宝峰. 基于机器学习和遗传算法的高炉参数预测与优化[J]. 东北大学学报:自然科学版, 2020, 41(9): 1262-1267.
[4]	郭甲腾，刘寅贺，韩英夫，王徐磊. 基于机器学习的钻孔数据隐式三维地质建模方法[J]. 东北大学学报:自然科学版, 2019, 40(9): 1337-1342.
[5]	王蒙湘，李芳芳，于戈. 交互式数据探索框架的特征自适应技术[J]. 东北大学学报:自然科学版, 2018, 39(12): 1685-1690.
[6]	朱继召，乔建忠，林树宽. 表示学习知识图谱的实体对齐算法[J]. 东北大学学报:自然科学版, 2018, 39(11): 1535-1539.
[7]	王彦华，乔建忠，林树宽，赵廷磊. 基于SVM的CPU-GPU异构系统任务分配模型[J]. 东北大学学报:自然科学版, 2016, 37(8): 1089-1094.
[8]	朱靖波;陈文亮. 基于领域知识的文本分类[J]. 东北大学学报(自然科学版), 2005, 26(8): 733-735.
[9]	孙杰;李晶皎;张俐;姚天顺. 机器翻译系统中词类搭配规则的自动获取方法[J]. 东北大学学报(自然科学版), 1999, 20(2): 4--.
[10]	李景银;郭宏飞;周伟. 高炉异常炉况判断专家系统的设计与实现[J]. 东北大学学报:自然科学版, 1997, 18(2): 5--.
[11]	李景银;郭宏飞;周伟. 高炉炉况评价系统的设计与实现[J]. 东北大学学报:自然科学版, 1996, 17(6): 6--.
[12]	杨英杰;虞和济. 结构损伤状态识别的神经网络方法[J]. 东北大学学报:自然科学版, 1994, 15(2): 5--.