Compound Term Processing
Concept Searching’s unique Compound Term Processing performs matching on the basis of compound terms as opposed to keywords. Compound terms are built by combining two (or more) simple terms, for example ‘triple’ is a single word term but ‘triple heart bypass’ is a compound term. The ambiguity in single word terms results in inefficient search. For example, does the word ‘triple’ mean three or is it a baseball term? Does heart mean an organ or center? Is bypass a highway or does it mean to avoid? A traditional search query return would return all documents that contained the words ‘triple’, all the words that contain ‘heart’, and all the words that contain ‘bypass’.
By identifying and forming compound (multi-word) terms and placing these in the search engine’s index the search can be performed with a greater degree of accuracy because the ambiguity inherent in single words is no longer a problem. A search for ‘survival rate after triple heart bypass’ will locate documents about this topic even if the precise phrase is not contained in any of the documents.
The Metadata Issue
The metadata generation issue is increasingly a growing concern in large enterprises. A comprehensive approach requires more than syntactic metadata (i.e. date, author, title) and requiring end users to add rich metadata is haphazard and subjective at best. Since Concept Searching’s technology is no longer restricted to keyword identification, compound term metadata can be automatically generated either when the content is created or ingested. The generation of metadata based on concepts extracts compound terms and keywords from a document or corpus of documents that are highly correlated to a particular concept. By identifying the most significant patterns in any text, these compound terms can then be used to generate non-subjective metadata based on an understanding of conceptual meaning.
The ability to identify ‘concepts in context’ generates far richer metadata, improving the precision and relevancy in the information retrieval process. Meta-tags are automatically added to the properties field of each document making the document more valuable to the organization by increasing the ability of the document to be retrieved using Microsoft Search Products that use keywords and metadata to retrieve information.
Precision versus Recall
Precision and recall are the two key performance measures for information retrieval. Precision is the retrieval of only those items that are relevant to the query. Recall is the retrieval of all items that are relevant to the query. Yet most information retrieval technologies are less than 22% accurate for both precision and recall. The ideal goal is to have them balanced. Compound Term Processing has the ability to increase precision with no loss of recall.
Managing Content
Taxonomy development and maintenance has traditionally been a laborious and on-going challenge, not to mention costly. The most effective approach is to use rules-based categorization providing enterprises complete control of rules-based descriptors unique to their organization. Since all rules can be defined and managed, error prone results utilizing ‘training’ algorithms typically found in other approaches is eliminated.
A concept based automatic classification process identifies during indexing categories that each document belongs to. Each category is identified by a unique descriptor and is associated with key descriptive words and/or phrases held in the database. This approach enables a rapid implementation of a corporate taxonomy with all documents classified to multiple nodes at index time. Ideally, the taxonomy can be used to browse the document collection or as a filter when running ad hoc searches.
An easy-to-use taxonomy and automatic classification tool creates the framework to classify content based on concepts to one or more nodes in the taxonomy. Features to enable Subject Matter Experts to interact with the taxonomy can simplify on-going maintenance. For example; automatically generating compound term clues from the document corpus, dynamically showing the effect of changes on the taxonomy, and class weighting influenced by parent, child, and sibling can reduce taxonomy development and on-going maintenance by 66%-80%.