东北大学学报:自然科学版 ›› 2016, Vol. 37 ›› Issue (12): 1677-1682.DOI: 10.12068/j.issn.1005-3026.2016.12.002

• 信息与控制 • 上一篇    下一篇

一种面向不确定数据流的聚类算法

韩东红1, 王坤1, 邵崇雷2, 马畅1   

  1. (1. 东北大学 计算机科学与工程学院, 辽宁 沈阳110169; 2. 沈阳理工大学 机械工程学院, 辽宁 沈阳110159)
  • 收稿日期:2015-08-28 修回日期:2015-08-28 出版日期:2016-12-15 发布日期:2016-12-23
  • 通讯作者: 韩东红
  • 作者简介:韩东红(1968-),女,河北平山人,东北大学副教授.
  • 基金资助:
    国家自然科学基金资助项目(61173029,61332006,61672144).

A Cluster Algorithm for Uncertain Data Stream

HAN Dong-hong1, WANG Kun1, SHAO Chong-lei2, MA Chang1   

  1. 1. School of Computer Science & Engineering, Northeastern University, Shenyang 110169, China; 2. School of Mechanical Engineering, Shenyang Ligong University,Shenyang 110159, China.
  • Received:2015-08-28 Revised:2015-08-28 Online:2016-12-15 Published:2016-12-23
  • Contact: HAN Dong-hong
  • About author:-
  • Supported by:
    -

摘要: 作为大数据的重要组成,产生于传感器、移动电话设备、社交网络等的不确定流数据因其具有流速可变、规模宏大、单遍扫描及不确定性等特点,传统聚类算法不能满足用户高效实时的查询要求.首先利用MBR (minimum bounding rectangle) 描述不确定元组的分布特性,并提出一种基于期望距离的不确定数据流聚类算法,计算期望距离范围的上下界剪枝距离较远的簇以减少计算量;其次针对簇内元组的分布特征提出了簇MBR的概念,提出一种基于空间位置关系的聚类算法,根据不确定元组MBR和簇MBR的空间位置关系排除距离不确定元组较远的簇,从而提高聚类算法效率;最后在合成数据集和真实数据集进行实验,结果验证了所提出算法的有效性和高效性.

关键词: 不确定数据流, 聚类, 大数据, 数据挖掘, 最小边界矩形

Abstract: As an important component of big data generated in the sensor, mobile phone devices, social networks etc., uncertain streaming data have many characteristics, such as variable rate, large-scale, single-pass scanning, and uncertainty. Traditional clustering algorithms cannot meet efficient real-time inquiry requirements for the users. Firstly, MBR (minimum bounding rectangle) was used to describe the distribution characteristics of uncertain tuples. And then, a clustering algorithm based on expected distance was proposed for uncertain data stream. The bounds of expected distance range to filter the clusters with far distance can be calculated. Secondly, cluster MBR concept based on the distribution of the tuples in a cluster was presented. Then, a clustering algorithm was given, which excludes the clusters far from the uncertain tuple by the spatial location relationship between uncertainty tuple MBR and clusters MBR, thereby increasing the efficiency of clustering algorithm. Finally, experiments running on synthetic datasets and real datasets verify that the proposed algorithms are effective and efficient.

Key words: uncertain data stream, cluster, big data, data mining, MBR (minimum bounding rectangle)

中图分类号: