METHOD AND APPARATUS FOR GENERATION AND AUGMENTATION OF SEARCH TERMS FROM EXTERNAL AND INTERNAL SOURCES
First Claim
1. A method for identifying names, personalities, titles, and topics, whether or not said names, personalities, titles and topics are present in a given repository, comprisingextracting candidate search terms from unstructured published content by any of named entity extraction (NEE);
- topic detection and tracking (TDT);
direct human intervention;
natural language processing; and
a combination of NEE, TDT, direct human intervention, and natural language processing;
storing said candidate search terms in a historical database of candidate search terms;
storing a history of said extracted search term candidates;
extracting verified search terms from internal sources of said repository;
matching candidate search terms against verified search terms by applying linguistic edit distance techniques to obtain plausible linguistic variants of verified search terms;
using said linguistic variants to generate augmented verified search terms;
storing a history of said augmented verified search terms;
establishing a set of null search terms comprising candidate search terms having a threshold incidence count in said history of said extracted search term candidates and in said history of said augmented verified search terms; and
adding a set of search terms comprising any of said augmented verified search terms and said null search terms to any of an automatic speech recognition or natural language processing system.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus to identify names, personalities, titles, and topics that are present in a repository and to identify names, personalities, titles, and topics that are not present in the repository, uses information from external data sources, notably the text used in non-speech, text-based searches, to expand the search terms. The expansion takes place in two forms: (1) finding plausible linguistic variants of existing search terms that are already comprehended in the repository, but that are present under slightly different names; and (2) expanding the existing search term list with items that should be there by virtue of their currency in popular culture, but which for whatever reason have not yet been reflected with content items in the repository.
4 Citations
6 Claims
-
1. A method for identifying names, personalities, titles, and topics, whether or not said names, personalities, titles and topics are present in a given repository, comprising
extracting candidate search terms from unstructured published content by any of named entity extraction (NEE); - topic detection and tracking (TDT);
direct human intervention;
natural language processing; and
a combination of NEE, TDT, direct human intervention, and natural language processing;storing said candidate search terms in a historical database of candidate search terms; storing a history of said extracted search term candidates; extracting verified search terms from internal sources of said repository; matching candidate search terms against verified search terms by applying linguistic edit distance techniques to obtain plausible linguistic variants of verified search terms; using said linguistic variants to generate augmented verified search terms; storing a history of said augmented verified search terms; establishing a set of null search terms comprising candidate search terms having a threshold incidence count in said history of said extracted search term candidates and in said history of said augmented verified search terms; and adding a set of search terms comprising any of said augmented verified search terms and said null search terms to any of an automatic speech recognition or natural language processing system. - View Dependent Claims (2)
- topic detection and tracking (TDT);
-
3. An apparatus for identifying names, personalities, titles, and topics, whether or not said names, personalities, titles and topics are present in a given repository, comprising:
-
a plurality of external data sources, comprising non-speech, published lists of the text of frequent searches presented to popular text-based search engines, published lists of popular artists and song titles, published lists of most popular tags, published lists of most-emailed stories, and published news feeds; a processor configured for extracting search term candidates from said external sources, the step of extracting further comprising; extracting candidate search terms from at least one document from among a plurality of documents available from a plurality of sources of unstructured published content available over a computer network, wherein said sources of unstructured published content at least includes sources selected from among a group of sources consisting of published lists of most-emailed stories and published news feeds, and wherein extracting further comprises an automatic extraction means selected from among; named entity extraction (NEE); topic detection and tracking (TDT); direct human intervention; and a combination of NEE, TDT, and direct human intervention; storing said candidate search terms in a historical database of candidate search terms; said processor configured for extracting verified search terms from one or more internal sources; said processor configured for expanding search terms entered using information from said external data sources, said means for expanding search terms comprising means for matching candidate search terms against verified search terms by applying linguistic edit distance techniques to obtain plausible linguistic variants of verified search terms and further comprising any of; said processor configured for finding plausible linguistic variants of existing search terms that are already comprehended in the repository, but that are under slightly different names; and said processor configured for expanding an existing search term list with items that should be in said list by virtue of their currency in popular culture, but which for whatever reason have not yet been reflected with content items in the repository; said processor configured for using said linguistic variants to generate augmented verified search terms; said processor configured for storing said augmented verified search terms in a historical database of verified search terms; said processor configured for establishing a set of null search terms comprising candidate search terms having a high incidence count in said historical database of candidate search terms and in said historical database of verified search terms; and said processor configured for adding said set of search terms comprising any of said augmented verified search terms and said null search terms to any of an automatic speech recognition or natural language processing system. - View Dependent Claims (4, 5, 6)
-
Specification