东北大学学报(自然科学版) ›› 2025, Vol. 46 ›› Issue (5): 37-45.DOI: 10.12068/j.issn.1005-3026.2025.20240015

• 信息与控制 • 上一篇    下一篇

一种高效的分布式FDR假阳性控制算法

刘旭泽1, 王慧颖2, 褚良宇3, 赵宇海1()   

  1. 1.东北大学 计算机科学与工程学院,辽宁 沈阳 110819
    2.国家电网辽宁省电力有限公司 信息通信分公司,辽宁 沈阳 110065
    3.东北大学 医学与生物信息工程学院,辽宁 沈阳 110819
  • 收稿日期:2024-01-17 出版日期:2025-05-15 发布日期:2025-08-07
  • 通讯作者: 赵宇海
  • 作者简介:刘旭泽(1992—),男,辽宁盘锦人,东北大学博士研究生
  • 基金资助:
    国家自然科学基金资助项目(62432003)

An Efficient Distributed False Positive Control Algorithm for FDR

Xu-ze LIU1, Hui-ying WANG2, Liang-yu CHU3, Yu-hai ZHAO1()   

  1. 1.School of Computer Science & Engineering,Northeastern University,Shenyang 110819,China
    2.Information and Communication Branch of Liaoning Electric Power Company,State Grid,Shenyang 110065,China
    3.School of Medicine & Bioinformatics Engineering,Northeastern University,Shenyang 110819,China.
  • Received:2024-01-17 Online:2025-05-15 Published:2025-08-07
  • Contact: Yu-hai ZHAO

摘要:

为了解决大数据挖掘中多重假设检验导致的假阳性问题,以及控制伪发现率(false discovery rate,FDR) 理论结果计算过程极其耗时的问题,针对理论FDR值的计算效率问题,提出了一种分布式假阳性控制算法DPFDR(distributed permutation testing-based false discovery rat, DPFDR).该算法首先基于条件频繁模式树(conditional frequent pattern tree,CFP)方法进行代表模式挖掘,利用代表模式对模式空间进行压缩.然后,根据代表模式对相应任务的工作量进行预估,按照工作量进行数据划分,并通过负载均衡策略将任务分配到各计算结点上.最后,通过合并、排序各结点的计算结果,获得有效的FDR假阳性控制阈值.真实数据集上的一系列实验结果表明,提出的DPFDR算法能极大提升FDR假阳性控制阈值的计算效率.

关键词: 假阳性, 数据挖掘, 分布式计算, 伪发现率, 显著性阈值

Abstract:

To address the issue of false positives caused by multiple hypothesis testing in big data mining, as well as the extremely time-consuming nature of calculating theoretical results for controlling the false discovery rate (FDR). Aiming at the computational efficiency of theoretical FDR values, a distributed false-positive control algorithm based on DPFDR(distributed permutation testing-based false discovery rate) is proposed. The algorithm firstly mining the representative patterns based on the conditional frequent pattern tree (CFP) method, and using the representative patterns to compress the pattern space. Then, the workload of the corresponding task is estimated according to the representative mode, the data is divided according to the workload, and the task is allocated to each compute node through the load balancing policy. Finally, the effective FDR false-positive control threshold is obtained by merging and sorting the calculation results of each node. A series of experimental results on real data sets show that the proposed DPFDR algorithm can greatly improve the computational efficiency of FDR false positive control threshold.

Key words: false positive, data mining, distributed computing, false discovery rate, significance threshold

中图分类号: