What is data dredging

11/4/2023

To avoid perpetrating this form of data fraud (and reduce positive-results bias to boot), some journals and funding organizations are now requiring researchers to preregister their clinical trials, stating in advance what hypotheses they are going to be testing.

The lesson here is this: beware of so-called “statistically significant” results. Data dredging is when data mining is abused, so that the same data set is examined too many times. Such ex post results, however, are often just spurious correlations. In the words of Wikipedia: “The process of data dredging involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching … for combinations of variables that might show a correlation ….” This form of data fraud thus occurs when researchers perform multiple statistical tests on a single set of data and then selectively publish only those results that satisfy some test of statistical significance. Classification is the most frequently used data mining function with a predominance of the implementation of Bayesian classifiers, neural networks, and SVMs. Let’s proceed with our parade of fraudulent data practices, shall we? Next up is data dredging (a/k/a “p-hacking”), a more sophisticated (and less transparent) form of cherry picking. Results: Clinical data mining has three objectives: understanding the clinical data, assist healthcare professionals, and develop a data analysis methodology suitable for medical data.

0 Comments

What is data dredging

Leave a Reply.

Author

Archives

Categories