An Efficient Distributed False Positive Control Algorithm for FDR

doi:10.12068/j.issn.1005-3026.2025.20240015

Abstract

Abstract:

To address the issue of false positives caused by multiple hypothesis testing in big data mining， as well as the extremely time-consuming nature of calculating theoretical results for controlling the false discovery rate （FDR）. Aiming at the computational efficiency of theoretical FDR values， a distributed false-positive control algorithm based on DPFDR（distributed permutation testing-based false discovery rate） is proposed. The algorithm firstly mining the representative patterns based on the conditional frequent pattern tree （CFP） method， and using the representative patterns to compress the pattern space. Then， the workload of the corresponding task is estimated according to the representative mode， the data is divided according to the workload， and the task is allocated to each compute node through the load balancing policy. Finally， the effective FDR false-positive control threshold is obtained by merging and sorting the calculation results of each node. A series of experimental results on real data sets show that the proposed DPFDR algorithm can greatly improve the computational efficiency of FDR false positive control threshold.

Key words: false positive, data mining, distributed computing, false discovery rate, significance threshold

CLC Number:

TP 311

Xu-ze LIU, Hui-ying WANG, Liang-yu CHU, Yu-hai ZHAO. An Efficient Distributed False Positive Control Algorithm for FDR[J]. Journal of Northeastern University(Natural Science), 2025, 46(5): 37-45.

Figures/Tables 16

References 23

[1]	Erdogmus H. Bayesian hypothesis testing illustrated： an introduction for software engineering researchers［J］. ACM Computing Surveys， 2022， 55（6）： 1-28.
[2]	Kelter R. Power analysis and type I and type II error rates of Bayesian nonparametric two-sample tests for location-shifts based on the Bayes factor under Cauchy priors［J］. Computational Statistics & Data Analysis， 2022， 165： 107326.
[3]	de Araújo Silva A， Gouvêa M A. Study on the effect of sample size on type I error， in the first， second and first-two digits Excess tests［J］. International Journal of Accounting Information Systems， 2023， 48： 100599.
[4]	Liu H P， Zhang J V， Wang D， et al. Extended endocrine therapy in breast cancer： a basket of length-constraint feature selection metaheuristics to balance type I against type II errors［J］. Journal of Biomedical Informatics， 2022， 131： 104112.
[5]	Sharma V S， Afthanorhan A， Barwar N C， et al. A dynamic repository approach for small file management with fast access time on Hadoop cluster： Hash based extended Hadoop archive［J］. IEEE Access， 2022， 10： 36856-36867.
[6]	Luo C， Cao Q， Li T R， et al. MapReduce accelerated attribute reduction based on neighborhood entropy with Apache Spark［J］. Expert Systems with Applications， 2023， 211： 118554.
[7]	Llinares-López F， Sugiyama M， Papaxanthos L， et al. Fast and memory-efficient significant pattern mining via permutation testing［C］// Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney， 2015： 725-734.
[8]	Dey M， Bhandari S K. FWER goes to zero for correlated normal［J］. Statistics & Probability Letters， 2023， 193： 109700.
[9]	Terada A， Sese J. Bonferroni correction hides significant motif combinations［C］// 13th IEEE International Conference on BioInformatics and BioEngineering. Chania，2013： 1-4.
[10]	Holm S. A simple sequentially rejective multiple test procedure［J］. Scandinavian Journal of Statistics， 1979， 6（2）： 65-70.
[11]	Simes R J. An improved Bonferroni procedure for multiple tests of significance［J］. Biometrika， 1986， 73（3）： 751-754.
[12]	Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance［J］. Biometrika， 1988， 75（4）： 800-802.
[13]	Chaubey Y P， Westfall P H， Young S S. Resampling-based multiple testing： examples and methods for p-value adjustment［J］. Technometrics， 1993， 35（4）： 450.
[14]	Benjamini Y， Hochberg Y. Controlling the false discovery rate： a practical and powerful approach to multiple testing［J］. Journal of the Royal Statistical Society： Series B （Methodological）， 1995， 57（1）： 289-300.
[15]	Nawaz M S， Azam M， Aslam M. An efficient double exponentially weighted moving average Benjamini-Hochberg control chart to control false discovery rate［J］. Quality and Reliability Engineering International， 2019， 35（8）： 2677-2686.
[16]	Cui J F， Wang G H， Zou C L， et al. Change-point testing for parallel data sets with FDR control［J］. Computational Statistics & Data Analysis， 2023， 182： 107705.
[17]	Liu G M， Zhang H J， Wong L S. Controlling false positives in association rule mining［J］. Proceedings of the VLDB Endowment， 2011， 5（2）： 145-156.
[18]	Pellizzoni P， Borgwardt K. FASM and FAST-YB： significant pattern mining with false discovery rate control［C］// 2023 IEEE International Conference on Data Mining （ICDM）. Shanghai，2023： 1265-1270.
[19]	Sidák Z. On multivariate normal probabilities of rectangles： their dependence on correlations［J］. The Annals of Mathematical Statistics， 1968， 39（5）： 1425-1434.
[20]	Bestgen Y. Using Fisher’s exact test to evaluate association measures for N-grams［EB/OL］. （2021-04-29）［2023-12-29］. .
[21]	Liu G M， Zhang H J， Wong L S. A flexible approach to finding representative pattern sets［J］. IEEE Transactions on Knowledge and Data Engineering， 2013， 26（7）： 1562-1574.
[22]	Liu G M， Lu H J， Yu J X. CFP-tree： a compact disk-based structure for storing and querying frequent itemsets［J］. Information Systems， 2007， 32（2）： 295-319.
[23]	季策，王金芝，耿蓉. 基于Dice系数的弱选择回溯匹配追踪算法［J］.东北大学学报（自然科学版）， 2021，42（2）： 189-195.
	Ji Ce， Wang Jin-zhi， Geng Rong. Weak-selection backtracking matching pursuit algorithm based on Dice coefficient［J］. Journal of Northeastern University （Natural Science）， 2021，42（2）： 189-195.

项目	不拒绝H₀	拒绝H₀	总计
H₀为真	U	V	n₀
H₀为假	T	S	n-n₀
总计	n-R	R	n

项目	不拒绝H₀	拒绝H₀	总计
H₀为真	U	V	n₀
H₀为假	T	S	n-n₀
总计	n-R	R	n

TID	事务
1	a,b,c,j
2	a,b,c,d,j
3	a,b,c
4	c,d,e,f
5	d,e,f,j

TID	事务
1	a,b,c,j
2	a,b,c,d,j
3	a,b,c
4	c,d,e,f
5	d,e,f,j

ID	模式	支持度
1	a	3
2	b,c	3
3	a,c	3
4	a,b,c	3
5	a,b,c,d	1