Basic text analysisThis page contains descriptions of the basic text analysis and
search functions. Most of the
functions described below will produce different results depending on
the mode. When ontology mode is active common words with little
content are not included in the results. For example, will not show words like
the, is, and on when ontology mode is active.
Something similar also applies when searching for compounds. In
ontology mode, prefixing cheese will
find goat cheese, it will not show the cheese.
In language
mode, all words and/or terms are included in the results.
The menu provides access to
options that summarize the frequencies of words, and subsets of words
in the corpus. 
The screendump illustrates the user interface after the user
has selected . This
resulted in all lemmas being displayed in the second browser.
Thereafter, the user clicked on one of the lemmas (egg). The
third browser then displays the inflections (e.g. the plural eggs)
and the first browser the documents that contain the lemma egg. - : all words in the
corpus and their frequency. The results are case sensitive.
Selecting a result shows the case variations in the third browser.
- : all plain versions of
the words in the corpus. A plain word is a word without uppercase
letters and without diacritics. For example, the plain version of
RĂ´le is role.
- : all headwords of the
lemmas in the corpus. A lemma is the base or uninflected form of the word.
Selecting a lemma shows the inflected variations in the third browser.
- : all
capitalized words in the corpus. Useful to find names of people and places.
- : words that
only contain capitals. Could be abbreviations.
- : words that
contain at least one diacritic.
- : all words that
can be an adjective (little, half, new).
- : all words that
can be an adverb.
- : all words that
can be a noun.
- : all words that
can be a numeral (one, ten, hundred).
- : all words that
can be a pronoun.
- : all words that
can be a verb.
- : all words not in the
dictionary.
Prefixing, infixing
and postfixing (also called suffixing) are simple and powerful methods
to find compound terms. In English, the general rule is that the last
word of a compound determines what it is. For example, chocolate
cake is a cake (and not a kind of chocolate).
Enter a term into the text entry field and then click on the prefix, infix
or postfix icon. The results are shown in the third browser. Prefix
looks before the term: prefixing cheese could result in
compounds like goat cheese or fresh cheese. Postfix
looks after the term: postfixing cheese results in cheese
platter. Infix looks both before and after the term, it is hardly
used. When the third browser is showing a list of "fix" terms, a
drag-and-drop from any browser results in the "fix" being applied to
the term dropped. The drop-down menu provides a Sub-string function which searches for all words that
contain the sub-string provided in the text entry field. For example,
with the sub-string xt, possible results are next and
extra. The results are displayed in the second browser. The special character ^ matches the beginning of a word and
$ matches the end of a word. Thus, xt$ matches all
words ending on xt. English is particularly regular with
regard to endings, nearly any verb can be turned into a noun by
suffixing the verb with er (work, worker; read, reader; etc.). All five browsers contain a popup menu under the right mouse
button. The options in these popups are described below. - . Saves the content of
the browser in a text file.
- . Saves the content of
the browser in a comma-separated file, suitable for loading into a
program that can read such files (e.g. Microsoft's Excel or
OpenOffice).
- . Saves the content in
an XML format.
- . Sorts the entries in the browser in
lexicographic order.
- . Sorts the entries in the browser
by frequency (shown between square brackets).
- . Sorts the entries in the browser
by score (shown between angle brackets).
- . Reverse the order in
which the entries are displayed.
- . Removes all
entries in the browser which already appear in the ontology.
- . The next time a
document is selected the (first 100) terms in the browser are
highlighted in the document. Highlighted words are shown with a light
blue background as shown below.

- . If the browser contains
compound terms, for each such term the first word (the head) is
removed. This is mainly intended in combination with postfix. For example, after postfixing
cheese, a consecutive
removes the cheese part leaving all words that come after cheese in
the corpus.
- . Rather than removing the
first word of compounds it removes the last word (the tail). Useful
after a prefix.
- . Removes the spaces in compounds.
|