东北大学学报:自然科学版 ›› 2016, Vol. 37 ›› Issue (9): 1245-1249.DOI: 10.12068/j.issn.1005-3026.2016.09.007

• 信息与控制 • 上一篇    下一篇

大数据环境下的不确定数据流在线分类算法

吕艳霞, 王翠荣, 王聪, 于长永   

  1. (东北大学 信息科学与工程学院, 辽宁 沈阳110819)
  • 收稿日期:2015-05-24 修回日期:2015-05-24 出版日期:2016-09-15 发布日期:2016-09-18
  • 通讯作者: 吕艳霞
  • 作者简介:吕艳霞(1982-),女,河北沧州人,东北大学讲师,博士; 王翠荣(1964-),女,河北唐山人,东北大学教授.
  • 基金资助:
    国家自然科学基金资助项目(61300195); 河北省自然科学基金资助项目(F2014501078); 辽宁省教育厅科学研究资助项目(L2013099); 东北大学秦皇岛分校科研基金资助项目(XNK201402).

Online Classification Algorithm for Uncertain Data Stream in Big Data

LYU Yan-xia, WANG Cui-rong, WANG Cong, YU Chang-yong   

  1. School of Information Science & Engineering, Northeastern University, Shenyang 110819, China.
  • Received:2015-05-24 Revised:2015-05-24 Online:2016-09-15 Published:2016-09-18
  • Contact: LYU Yan-xia
  • About author:-
  • Supported by:
    -

摘要: 在大数据环境下,由于隐私保护、数据丢失等原因,数据普遍存在不确定性;数据流系统中数据不断地到达系统,只扫描一遍且不能一次性全部获得;所以要构建一个增量分类模型来处理不确定数据流分类.本文基于VFDT算法提出了WBVFDTu算法,该算法在学习和分类阶段都可快速而有效地分析不确定信息.在学习期间,采用Hoeffding分解定理构造决策树模型;在分类期间,在决策树的叶子节点利用加权贝叶斯分类算法提高模型的分类准确率和算法的执行效率.最终证明该算法能够非常快速地学习不确定数据流,提高分类的准确率.

关键词: 不确定数据流, 加权贝叶斯, VFDT, 分类算法, 大数据

Abstract: Under the background of big data, there exist data uncertainties due to privacy protection, data loss and so on. In data stream system, data arrive at continuously and cannot be obtained all. In addition, all the inforation cannot be aquired with only one scan. Therefore, an incremental classification model is constructed to deal with uncertain data stream classification. The weighted Bayes based on VFDT (very fast decision tree) for uncertain data stream—WBVFDTu on the basis of VFDT algorithm is presented in the paper. The uncertain information can be analysed quickly and effectively in both the learning stage and classification stage. In the learning stage, a decision tree model for uncertain data stream is quickly constructed by using Hoeffding bound theory. In the classification stage, the weighted Bayes classifier in the tree leaves is used to improve the performance of the classification. Experimental results show that the proposed algorithm can very quickly learn uncertain data stream and improve the classification performance of the model.

Key words: uncertain data stream, weighted Bayes, VFDT(very fast decision tree), classification algorithm, big data

中图分类号: