东北大学学报:自然科学版 ›› 2015, Vol. 36 ›› Issue (1): 19-23.DOI: 10.12068/j.issn.1005-3026.2015.01.005

• 信息与控制 • 上一篇    下一篇

一种面向医学短文本的自适应聚类方法

栗伟1, 许洪涛2, 赵大哲1,3, 刘积仁3   

  1. (1. 东北大学 医学影像计算教育部重点实验室, 辽宁 沈阳110819; 2. 郑州市人力资源和社会保障数据管理中心, 河南 郑州450000; 3. 东软集团股份有限公司, 辽宁 沈阳110179)
  • 收稿日期:2013-12-05 修回日期:2013-12-05 出版日期:2015-01-15 发布日期:2014-11-07
  • 通讯作者: 栗伟
  • 作者简介:栗伟(1980-),男,河南驻马店人,东北大学博士研究生; 赵大哲(1960-),女,辽宁沈阳人,东北大学教授,博士生导师; 刘积仁(1955-),男,辽宁丹东人,东北大学教授,博士生导师.
  • 基金资助:
    国家自然科学基金资助项目(61172002); 国家科技支撑计划项目(2014BAI17B01); 国家高技术研究发展计划项目(2012AA02A607).

An Adaptive Clustering Method on Medical Short Text

LI Wei1, XU Hong-tao2, ZHAO Da-zhe1,3, LIU Ji-ren3   

  1. 1. Key Laboratory of Medical Image Computing, Ministry of Education, Northeastern University, Shenyang 110819, China; 2. The Zhengzhou Municipal Human Resources and Social Security Data Management Center, Zhengzhou 450000, China; 3. Neusoft Group Ltd., Shenyang 110179, China.
  • Received:2013-12-05 Revised:2013-12-05 Online:2015-01-15 Published:2014-11-07
  • Contact: LI Wei
  • About author:-
  • Supported by:
    -

摘要: 针对电子病历中疾病诊断文本同义词识别和命名标准化问题,提出了一种自适应的文本聚类方法.首先提出了一种新的基于集合的文本相似性度量算法;然后采用基于相似度分布的文本聚类算法实现同义文本识别,该算法能够自动确定类簇个数;最后采用基于序列模式的中心概念提取算法实现了疾病命名的标准化,同时对聚类簇进行合并和优化,进一步提升了聚类的准确性.测试结果表明,所述方法具有较高的准确率和聚类效率,在病历文本的预处理、分类和分析中具有广泛意义.

关键词: 聚类分析, 相似性度量, 频繁序列模式, 电子病历, 相似度分布

Abstract: An adaptive clustering method on short text was presented for synonyms text recognition and disease naming standardization of diagnosis in electronic medical record. Firstly, a new set based text similarity measure algorithm was proposed. Then, a similarity distribution based text clustering algorithm which could automatically determine the number of clusters was applied to recognize the synonymous disease texts. Finally, the disease naming texts were standardized by the central concept extraction algorithm based on frequent sequence pattern, while clusters were merged and optimized to further improve the clustering accuracy. The results showed that the proposed approach has a high accuracy and clustering efficiency which is of great significance for medical application such as medical text preprocessing, classification and analysis.

Key words: clustering analysis, similarity measurement, frequent sequence pattern, electronic medical record, similarity distribution

中图分类号: