Efficient Mining of the Most Significant Patterns with Permutation Testing
Leonardo Pellegrina (University of Padova); Fabio Vandin (University of Padova)
The extraction of patterns displaying significant association with a class label is a key data mining task with wide application in many domains. We study a variant of the problem that requires to mine the top-k statistically significant patterns, thus providing tight control on the number of patterns reported in output. We develop TopKWY, the first algorithm to mine the top-k significant patterns while rigorously controlling the family-wise error rate of the output and provide theoretical evidence of its effectiveness. TopKWY crucially relies on a novel strategy to explore statistically significant patterns and on several key implementation choices, which may be of independent interest. Our extensive experimental evaluation shows that TopKWY enables the extraction of the most significant patterns from large datasets which could not be analyzed by the state-of-the-art. In addition, TopKWY improves over the state-of-the-art even for the extraction of all significant patterns.