Journal of Northeastern University ›› 2008, Vol. 29 ›› Issue (3): 328-331.DOI: -

• OriginalPaper • Previous Articles     Next Articles

A method generating data sets to test data mining algorithms

Wei, Wei-Jie (1); Zhang, Bin (1); Wang, Bo (1); Zhang, Ming-Wei (1)   

  1. (1) School of Information Science and Engineering, Northeastern University, Shenyang 110004, China
  • Received:2013-06-22 Revised:2013-06-22 Online:2008-03-15 Published:2013-06-22
  • Contact: Wei, W.-J.
  • About author:-
  • Supported by:
    -

Abstract: Because of security, uncertain time, diversity of data etc, the problem of how to acquire the data set to test data mining algorithms has been confusing the study on data mining. A simulating method is therefore suggested to generate the data set on the basis of the genetic algorithm and entropy. The method extends a few data which were collected from reality by GA, then evaluates the similarity between extended data sets and real one with entropy, and generates the most similar data set of big size among the extended ones as the data set to test the data mining algorithms. A generation algorithm is also given. This method is available to generate the data set for testing, which has the same attributes, scales of attribute value and distributions of attribute value to the data set from reality, as well as the correlations among the attributes. This data set for testing will accelerate the study on data mining algorithms.

CLC Number: