![]() | ![]() |
| Home Download Getting started Creating an application User interface Documentation Basic text analysis Pattern search Creating a corpus Advanced topics Custom properties Licence About ![]() | Pattern searchThe pattern search facility in tOKo provides an intuitive and powerful mechanism to search for (short) phrases in a corpus. Pattern search is different from typing a query to a search engine (e.g. Google). A search engine returns documents related to the query, normally documents that contain the words in the query. Pattern search returns phrases that "precisely" match what you are looking for. On this page a pattern looks like this: This is a pattern. In the user interface pattern search is available through . The results are shown in the fourth browser.
OverviewTo illustrate the various possibilities of the pattern search facility we consider searching for fruit. The pattern fruit matches the word fruit, including case variations, and nothing else. When we take fruit as a lemma in the English language, we may also want it to match inflections such as fruits. The pattern then becomes (fruit), brackets are the notational convention stating we are searching for a lemma. The next level is to consider fruit as a concept in a user-defined ontology. The notion of fruit may then have sub-concepts called apple, pear, banana and so forth. The pattern that refers to fruit as a concept is [fruit], and it matches all lexical variations of fruit and its sub-concepts. Grammar-based patterns can be created from primitives that represent a word class. The notation is <word class>. For example <noun> matches all words in the dictionary that can be a noun. When we want all matches for certain amounts of fruit, the pattern <numeral> [fruit] would match two apples. The space is itself also a pattern element, it matches white space. Syntax and special charactersSyntax follows common practice by using a simple quoting scheme and the \ (backslash) as the escape character. The pattern "(a)" matches precisely (a), a quote itself can be escaped: "\"" matches a double quote. The pattern language defines several special characters. When you want to match such a character literally, it needs to be escaped with a backslash. The special characters are: Content: # * @ _ Curly brackets are used for grouping, and largely determine how a pattern is parsed. For example, a|b c is potentially ambiguous. It could be interpreted as: {a|b} c or as a|{b c}. Tokens and white space handlingtOKo uses a special definition of what a token is. In most text analysis tools a space is used to delineate a token. A sentence like BBC1 starts at 12 o'clock then contains five tokens. tOKo defines the following kinds of tokens:
For example, *# matches a word immediately followed by an integer: BBC1, and * # matches BBC 1. The pattern for any single token is _ (an underscore). The pattern (a space) matches any amount of white space and may include both spaces and newlines. Sequences, disjunctions and optional patternsA pattern is normally matched from left to right against the corpus. For example, if the pattern is a b (were a and b are arbitrary patterns) then matches can only occur if the corpus contains a phrase that matches a, immediately followed by white space, immediately followed by phrases that match b. The only construct to modify this order is a disjunction (also called alternation) with the | (vertical bar) operator. For example, apple|pear matches both apple and pear. More complex patterns are constructed by using curly brackets for grouping: one {apple|pear} or I like {[fruit]|[drink]}. There is no special symbol for optional patterns. They are constructed by a disjunction of the pattern itself and the empty pattern. For example, one {*|} apple matches one apple as well as one tasty apple. Syntactic glue: near and furtherThe pattern language provides two constructs to cope with the enormous flexibility language provides. The syntactic glue operators are an alternative for extensive grammatical rules that try to take all possibilities that could possibly occur into account. The near operator ... (ellipses, three dots), for example <verb> ... <noun> matches eat an apple. The near operator matches at most five tokens, and would thus also match eat a tasty apple. In general, for a high recall you should use the near operator lavishly. There is a subtle difference between a...b and a ... b. In the first pattern, spaces between a and b have to be matched by the near operator, in the second pattern they need not be. The near operator may be suffixed with a number indicating the maximum number of tokens it can match: an ...1 apple. The further operator ,,, (three comma's) is specifically for natural language analysis: it matches until the end of the sentence if necessary. See below. Neither near nor further can start a pattern. They can end a pattern however. In that case, near matches as many tokens as it can, whereas further matches until the end of the sentence. Both are useful for exploration purposes: Fred.... Sentences and other boundariesIn most cases phrases that match a pattern should not run over sentence boundaries. Unfortunately it is not clear where sentences start and end. The pattern language provides a solution for two special cases. Firstly, if the input is HTML a special null token is inserted whenever a natural language phrase cannot be not continued (e.g. over paragraph boundaries). The following HTML tags insert null tokens before and after them: blockquote, br, br, dd, div, dl, dt, h1, h2, h3, h4, h5, h6, hr, li, ol, p, table, tbody, td, tr, ul. Secondly, the most common way to end a sentence and start a new one is the pattern: {.|"!"|?} <word;capitalized>. That is, a sentence ends with a dot, exclamation mark or question mark and starts with an capitalized word. In both cases, <sos> matches the start of a sentence and <eos> matches the end of sentence. This implies that one way to find question sentences is <sos>,,,?<eos>. The automatically inserted null token can be matched explicitely with <null>. Set operations on phrasesConsider the phrases that is good and that was not good. Both match that...good. In cases such as this one it makes sense to think of matching phrases as a set and introduce operators that operate on sets. The operators are ! which excludes some elements and ^ which selects (or includes) some elements. Applied to the example: {that...good}!not excludes the phrases that contain not, whereas {that...good}^not selects the phrases that contain not. Language is subtle at times. In one application, the distinction between A influences B and B was influenced by A. The two sentences are semantically identical, although A and B are reversed. In this case the by can be used as determiner to decide on the direction. Named patterns
Named patterns are created as follows:
Sigmund termsIf Sigmund has extracted a set of meaningful terms from the corpus, @ can be used to refer to any of these terms. This can be very useful as Sigmund terms can be compound and, consequently, the results might be interesting to include in an ontology. For example @ and @ applied to a cooking weblog yields salt and pepper as the most frequent match, but also olive oil and balsamic vinegar. Lists and other recursive structuresA pattern that should match a list of something can be defined as a named pattern. For example, a list of nouns is either a noun or a noun followed by a space followed by a list of nouns. The named $noun_list$ pattern is therefore: {<noun>|<noun> $nounlist$}. Overlapping matches, disjointnessIn an ontology about food, the concepts apple and apple pie overlap. The default is to consider both matches in the search. In many cases this is not desirable and the disjoint option for concept and named patterns prevents matching of overlaps. For example, in our food ontology, [food;disjoint] the phrase ``an apple pie is a pie made of apples'' will match apple only once. ReferenceThe elements that can make up a pattern are:
APIThis is an initial description of how the pattern search facility can be accessed by researchers. Running pattern searchpattern_search_corpus(+Pattern, -Matches, +Options) Runs a pattern search for the Pattern, the matching
phrases are returned in Matches. The list of
Options is normally empty for non-interactive searches. An example is:
:- use_module(load).
:- use_module(toko(pattern_search)).
?- pattern_search_corpus('a ... apple', Matches, []).
Matches = [tf('a warm apple', 1),
tf('a splash of apple', 1),
tf('a regular apple', 1),
tf('a truly delicious apple', 1)] Examining the resultsAs may be obvious from the example it is not entirely trivial to relate matches to pattern elements (e.g. that the dots matched warm). A provisional predicate relating a pattern and its results is: pattern_search_match(-LinearPattern, -Match) Unifies a match of the previous pattern search with the pattern itself.
LinearPattern is a pattern in the internal list representation
that contains no disjunctions. Match is a list of the
matching atoms in the corpus. The pattern elements in LinearPattern and the
atoms in Match are aligned:
:- use_module(load). :- use_module(toko(pattern_search)). ?- pattern_search_match(LP, Match). LP = [plain(a), word_break, near(5), plain(apple)] Match = [a, ' ', 'warm ', apple]For example, the third element of the pattern is near(5) (internal for ...), and the third element in the match is warm. Technical issuesDepending on the size and complexity of the ontology, the initial call to pattern search can take some time (indexing the ontology). Thereafter, pattern search is extremely fast. We are looking for suggestions on how to present the results of pattern search both to end-users and for researchers who want to analyse the results. |