Text Mining in Clinical Domain: Dealing with Noise.
Hoang Nguyen*, National ICT Australia; Jon Patrick, University of Sydney
Text mining in clinical domain is usually more diﬃcult than general domains (e.g. newswire reports and scientiﬁc literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Un-known words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non-words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators’ attentiveness. Clinical domain also suﬀers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially aﬀect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems.
This paper introduces a general clinical data mining architecture which is potential of addressing all of these challenges using: automatic proof-reading process, trainable ﬁnite state pattern recogniser, iterative model development and active learning. The reportability classiﬁer based on this architecture achieved 98.25% sensitivity and 96.14% speciﬁcity on an Australian cancer registry’s held-out test set and up to 92% of training data provided for supervised ML was saved by active learning.
Filed under: Classification