Full-text search

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references).

In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques appeared in the 1960s, for example IBM STAIRS from 1969, and became common in online bibliographic databases in the 1990s. Many websites and application programs (such as word processing software) provide full-text-search capabilities. Some web search engines, such as the former AltaVista, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems.^[1]

Indexing[edit]

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called "serial scanning". This is what some tools, such as grep, do when searching.

However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an index, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.^[2]

The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore stop words (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive".

False-positive problem[edit]

Full-text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives (see Type I error). The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language. In the sample diagram to the right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).

Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of "bank", clustering can be used to categorize the document/data universe into "financial institution", "place to sit", "place to store" etc. Depending on the occurrences of words relevant to the categories, search terms or a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.

. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text.

Keywords

. Some search engines enable users to limit full text searches to a particular field within a stored data record, such as "Title" or "Author."

Field-restricted search

Boolean queries. Searches that use operators (for example, "encyclopedia" AND "online" NOT "Encarta") can dramatically increase the precision of a full text search. The AND operator says, in effect, "Do not retrieve any document unless it contains both of these terms." The NOT operator says, in effect, "Do not retrieve any document that contains this word." If the retrieval list retrieves too few documents, the OR operator can be used to increase recall; consider, for example, "encyclopedia" AND "online" OR "Internet" NOT "Encarta". This search will retrieve documents about online encyclopedias that use the term "Internet" instead of "online." This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall.^[5]

Boolean

. A phrase search matches only those documents that contain a specified phrase, such as "Wikipedia, the free encyclopedia."

Phrase search

. A search that is based on multi-word concepts, for example Compound term processing. This type of search is becoming popular in many e-discovery solutions.

Concept search

. A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context.

Concordance search

. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for "Wikipedia" WITHIN2 "free" would retrieve only those documents in which the words "Wikipedia" and "free" occur within two words of each other.

Proximity search

. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision.

Regular expression

will search for document that match the given terms and some variation around them (using for instance edit distance to threshold the multiple variation)

Fuzzy search

. A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. For example, using the asterisk in a search query "s*n" will find "sin", "son", "sun", etc. in a text.

Full-text search

Indexing[edit]

False-positive problem[edit]

Keywords

Field-restricted search

Boolean

Phrase search

Concept search

Concordance search

Proximity search

Regular expression

Fuzzy search

Wildcard search

Pattern matching

Compound term processing

Enterprise search

Information extraction

Information retrieval

Faceted search

WebCrawler

Search engine indexing