SearchEngine:
Eliminating Words
Topics
This chapter discusses the filters available for eliminating words from entire
files, useless words such as "and" or "the", reducing words
such as "www.javasoft.com", and removing words within specific HTML
tags.
Before looking at the various methods of eliminating words, it is necessary
to describe what the compiler considers a 'word' to be. The word parser, incorporated
into the compiler, parses words according to two separate algorithms.
Numbers
- Any numeric value (0 to 9 or a valid ISO-Latin1 numeric value) followed
by other numeric values, or "." or ","
is considered to be a number. Trailing "." or ","
characters are ignored.
Words
- Any letter, followed by letters, numeric values, ".",
"-", or "_" is considered to
be a word. Trailing ".", "-",
or "_" characters are ignored.
If you wish that a hyphenated word be split into its components, use the ­
(­) ampersand entity, also known as a soft hyphen, instead
of the hyphen character '-', such as profitmargin.
Values such as "1.0" or "1,000"
or even dewey decimal values such as "1.2.3" would all
be considered to be numbers. Note however, that "1..6"
would also be considered to be a number.
The compiler provides the -xn option, which removes all numbers
from the word list.
Values such as "wasn't" would be considered to be two
separate words; "wasn" and "t".
The apostrophe is not tested by the word parser, as it would then have been
required to understand single quoted phrases. Since there are no syntactical
rules in HTML for #PCDATA (the text within tags),
it would be impossible to tell when an apostrophe marks the start or end of
a single quoted phrase, and when it is, well, just an apostrophe. Some people
also prefer to use the "`" character to start a single
quoted phrase.
Removing
documents from the word list 
A table of contents (TOC) document is an ideal candidate for word removal.
Although needed to generate the dependency list, it would be unproductive for
the TOC document contents to appear in the word database, since the descriptors
(words) in that document invariably link the user to other pages.
In this case, all words within a document can be removed from the word list
in the same way as documents are removed from the dependency list, described
below.
Removing a specific document from the word list
To remove all words in a specific document from the word list, use the -xwu
option, and specify the document's URL path and filename components,
for example:
-xwu /www/rational/application/search/doc/TOC.html
Removing multiple documents from the word list
To remove all words in multiple documents from the word list, use the -xwu
option, and a filter using the wildcard character '*'. For example:
-xwu */TOC.html
In this example, all words in all URLs ending with /TOC.html
will be excluded from the word list.
Another more dangerous example of filtering is:
-xwu /www/extawt/*
In the above example, all words in all URLs beginning with
/www/extawt/ will be excluded from the word list.
Finally an even more dangerous example of filtering is:
-xwu */extawt/*
In this example, all words in all URLs containing /extawt/
will be excluded from the word list.
No other combinations of the wildcard character '*' are valid.
A filter definition of */extawt/*remove.* will result in a (probably
useless) filter to remove all words from URLs containing /extawt/*remove.,
and not the probable intention of removing all words in all URLs
containing /extawt/ and also remove.
The wildcard character '*' can appear at the start of the URL,
and/or at the end of the URL, anywhere else it is treated as an
ordinary character.
Generating a word list
Before individual words can be removed, you have to know what words appear
in the search database. The compiler provides the -lw filename
option, which lists all filtered words in HTML document format
to the specified filename.
The following is an excerpt from the generated word list:
<dl>
<dt>absolute
<dt>accept
<dt>acceptable
<dt>access
<dt>according
<dt>accumulates
<dt>achieve
<dt>achieved
<dt>acronyms
<dt>add
<dt>added
<dt>addition
<dt>address
...
</dl>
Creating word
filter documents 
Common usage words, or useless words, can be removed from the database using
word lists, which are stored in an HTML document, known as a word
filter document. The same format is used as the parsed documents of the dependency
list, so that HTML entity characters (&) can
be used to represent ISO-Latin1 characters in ASCII files. The current list
of valid ampersand entities is given in the appendix Ampersand
entities.
Since the word filter document (see below) and generated word list file are
both in HTML format, you can use your favorite text editor to cut
and paste words to be removed from the word list to the word filter document.
Eliminating a word 
A specific word can be eliminated by simply having the word appear in a word
filter document. This is a file in HTML format, which lists the
specific words or word filters to be used when removing words. It is a good
idea to list them one per line, for readability, and ease of editing. The following
is an excerpt from the exclude.english.html file:
<dl>
<dt>a
<dt>able
<dt>about
<dt>above
<dt>accomplish
<dt>accomplished
<dt>accomplishes
<dt>across
<dt>act
<dt>acts
<dt>actual
...
</dl>
Word filter documents are specified using the -xwf option, for
example:
-xwf exclude.english.html
Reducing words

The compiler also provides for simple though potentially dangerous word reduction
filters, which trim or reduce words. Generally, word reduction filters should
be avoided, since they can have unexpected side-effects, similar to the filters
used for eliminating URLs from the dependency list or word list.
In addition, word reduction filters slow down the speed of compilation, since
each word parsed (there may be several thousand of them) has to be checked against
each filter, until a filter is matched, or all the filters have been checked.
Word reduction filters have the same form as URL filters, only
that, instead of being declared on the command line, they are placed in a word
filter document. If a word matches a filter, that word is not eliminated, but
reduced and put back into the word list.
For example, after a first compilation, the word list might produce words (taken
from the text of links), such as:
ftp.javasoft.com
...
splash.javasoft.com
...
www.javasoft.com
In this case, say you are interested in keeping the javasoft part
as a word in the database, and discarding the rest. You can achieve this by
creating the word reduction filter (in your word filter document) as follows:
<dl>
<dt>*javasoft*
...
</dl>
You might think that such filters can be used for reducing plurals, or reducing
adjectives, but this is not the case. If you create word reduction filters
such as:
<dl>
<dt>*s
<dt>*ing
...
</dl>
they will reduce for example cards to card and playing
to play, but will also reduce miss to mis
and king to k. Caveat emptor.
Removing words in specific HTML tags 
The compiler can remove words found in specific tags. There are four such tag
groups:
- -nt
- exclude <TITLE> tagged words.
- -nh
- exclude <H1..H6> and <CAPTION> tagged
words.
- -nl
- exclude <DT> and <LI> tagged words.
- -nb
- exclude words not inside the above listed tags.
The order of filtering
The compiler takes the parsed word list, and filters them for the final word
list in the following order:
- All words are converted to lower case.
- If any of -nb, -nh, -nl, or -nt
flags are set, all words corresponding to those HTML tags are removed from
the list.
- If the -xn flag is set, all numbers are removed from the list.
- The resulting word list is tested against word reduction filters, matches
are removed, reduced and put back into the list.
- The resulting word list is tested against the exclusion word lists, and
matching words are removed.
This ordering allows for words which were reduced to then be removed.
Copyright
© 1987 - 2001 Rational Software Corporation
|