Ali Baba - PubMed as a graph
Corpora for evaluation of text mining components
Word sense disambiguation
Many words that refer to entities recognized by Ali Baba are ambiguous in their meaning. The name of the drug 'Duration' can also be a common English word, as can the protein 'lamp'. 'Hippocampus' can refer to the brain areal, or a seahorse. Currently, Ali Baba disambiguates 304 such words, with an average accuracy of 89.7%. On an enriched data set that covers 422 terms, our method achieves 93% (average F-measure of 90%).
We collected a set of texts for each meaning of each word. On these texts, we trained support vector machine models that help to decide on the meaning of a new occurrence. Our corpus consists of ambigous names, and for each meaning of a name, a set of examples texts.
The WSD collection is organized as follows. For every category of entities (proteins, cells, common English words, etc.), we provide sets of texts for all terms that belong to each category and could have a meaning in any of the other categories. We used three different methods to collect texts for each meaning with high quality. One used identification of abbreviations that also had long forms in the same text. While an abbreviation ('CAT') might be ambigous, the long form often is unambigous. Such texts are contained in the subdirectory 'LF' (long form) in each category. The other method looked for known and unambigous synonyms for each term, which also had to appear in text same text. These can be found in the 'synSet1' and 'synSet2' subdirectories. Each file is named after the term and contains a Java String array with all texts: tin.obj contains all text for the term 'tin', the meaning is given via the directory structure (tin can be a protein or a common English word).
- wsd_corpus.tar.gz (ca. 0.9GB)
- Readme.txt
Other corpora related to text mining
We provide more corpora related to text mining on this page.