Reginald Ferber
Projects & Areas of Interest

Print View ] [ DEUTSCH ]

 Projects and Areas of Interest
 +Information Retrieval
 +Digital Libraries
 +Associative Word Nets
 +Knowledge Discovery in Text Collections
 +eLearning Search and Broker System
 +Simulation of Word Recognition
 +Language and Gender
 +Cellular Automata
 Contact and Disclaimer

 +ECDL'97: Automated Indexing ...
 +Data Mining und Lernen

Knowledge Discovery or Text Mining

The assumption that the relations between words or terms are an essential part of the knowledge of a domain is as well one of the basic assumptions of a traditional thesaurus. A thesaurus is defined as a collection of words or terms and the relations between them. Several types of relations can be used for such a thesaurus: broader-term, narrower-term, is-the-opposite-of, is-part-of, has-part, is-instance-of, or just is-related-to. In general such a thesaurus is constructed manually i. e. by an expert or a team of experts.

Associative Thesauri

Instead of relations defined or singled out by experts one can use automatically generated associations to build a thesaurus. In general associations do not have different types like the relations generated by experts but describe only a kind of similarity. Sometimes this is a drawback compared to manually generated thesauri. The automated generation of associations and their use in so called associative thesauri has - however - as well some advantages: It is less expensive (if the necessary collection of text documents or examples is available), its construction is faster, can therefore be more up to date, and can be done for more specialized domains. And the way the associations are generated is often more clear than the decisions of a variety of experts made over a long period of time.

IMAGINE - a System to Generate and Use Associative Thesauri

Based on the models of associative word nets I have developed in 1997 IMAGINE, the Interaction Merger for Associations Gained by Inspection of Numerous Exemplars. This program allows to generate co-occurrence based associations from large text collections, to optimize this calculation with a collection of examples, and test and apply the result. The system has been used successfully to simulate indexing of bibliographic records from the British Library of Development Studies with the OECD thesaurus.

Further Reading

The indexing study with IMAGINE is described in more detail in Automated Indexing with Thesaurus Descriptors: A Co-occurrence based Approach to Multilingual Retrieval Ferber 1997 [->]. More Details can be found here. A German introduction to Data Mining and Machine Learning is available in my lectures Data Mining und Information Retrieval and in the second part of my book Information Retrieval - Suchmodelle und Data-Mining-Verfahren für Textsammlungen und das Web.

HTML file generated 18. 3. 2004 by R. Ferber