as described in Salton and McGill (1983)

- Decompose the text into words
- remove stop words
- use a replacement rule based stemmer to generate terms
- these terms are weighted or replaced according to the following
steps:
- for terms with medium document frequency use a weighting scheme
like
w _{i,k}=^{(h(i,k))}/_{(d(k))} - terms with very high document frequencies are replaced by term pairs built with the other terms in a neighborhood of given size. Weights are constructed based on the frequencies of the two terms of a pair.
- terms with very low document frequencies are replaced by more general terms from a thesaurus or by groups of related terms

