Method and apparatus for generation and augmentation of search terms from external and internal sources

US 8,321,427 B2
Filed: 10/31/2007
Issued: 11/27/2012
Est. Priority Date: 10/31/2002
Status: Active Grant

First Claim

Patent Images

1. A method for identifying names, personalities, titles, and topics, whether or not said names, personalities, titles and topics are present in a given repository, and for placing them into a grammar for use in an automatic speech recognition (ASR) system, comprising the steps of:

extracting search term candidates from published lists of the text of frequent searches presented to popular text-based search engines, published lists of popular artists and song titles, published lists of most popular tags, published lists of most-emailed stories, and published news feeds, the step of extracting further comprising;

automatically identifying explicitly marked candidate search terms from at least one structured published list of content; and

extracting candidate search terms from unstructured published content by performing an extraction means selected from among;

available named entity extraction (NEE);

topic detection and tracking (TDT);

direct human intervention; and

a combination of NEE, TDT, and direct human intervention;

storing said candidate search terms in a historical database of candidate search terms;

storing a history of said extracted search term candidates;

extracting verified search terms from internal sources of said repository;

matching candidate search terms against verified search terms by edit distance techniques to obtain plausible linguistic variants of verified search terms;

using said linguistic variants to generate augmented verified search terms;

storing a history of said augmented verified search terms;

establishing a set of null search terms comprising candidate search terms having a threshold incidence count in said history of said extracted search term candidates and in said history of said augmented verified search terms; and

expanding said grammar by adding said set of null search terms to said grammar of said automatic speech recognition (ASR) system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus to identify names, personalities, titles, and topics that are present in a repository, and place them into a grammar, and to identify names, personalities, titles, and topics that are not present in the repository, and place them into a grammar, uses information from external data sources, notably the text used in non-speech, text-based searches, to expand the search terms entered into the ASR grammars. The expansion takes place in two forms: (1) finding plausible linguistic variants of existing search terms that are already comprehended in the repository, but that are present under slightly different names; and (2) expanding the existing search term list with items that should be there by virtue of their currency in popular culture, but which for whatever reason have not yet been reflected with content items in the repository.

97 Citations

View as Search Results

7 Claims

1. A method for identifying names, personalities, titles, and topics, whether or not said names, personalities, titles and topics are present in a given repository, and for placing them into a grammar for use in an automatic speech recognition (ASR) system, comprising the steps of:
- extracting search term candidates from published lists of the text of frequent searches presented to popular text-based search engines, published lists of popular artists and song titles, published lists of most popular tags, published lists of most-emailed stories, and published news feeds, the step of extracting further comprising;
  
  automatically identifying explicitly marked candidate search terms from at least one structured published list of content; and
  
  extracting candidate search terms from unstructured published content by performing an extraction means selected from among;
  
  available named entity extraction (NEE);
  
  topic detection and tracking (TDT);
  
  direct human intervention; and
  
  a combination of NEE, TDT, and direct human intervention;
  
  storing said candidate search terms in a historical database of candidate search terms;
  
  storing a history of said extracted search term candidates;
  
  extracting verified search terms from internal sources of said repository;
  
  matching candidate search terms against verified search terms by edit distance techniques to obtain plausible linguistic variants of verified search terms;
  
  using said linguistic variants to generate augmented verified search terms;
  
  storing a history of said augmented verified search terms;
  
  establishing a set of null search terms comprising candidate search terms having a threshold incidence count in said history of said extracted search term candidates and in said history of said augmented verified search terms; and
  
  expanding said grammar by adding said set of null search terms to said grammar of said automatic speech recognition (ASR) system.
- View Dependent Claims (2, 5)
- - 2. The method of claim 1, said internal sources comprising explicitly marked titles, authors, artist names, etc. that are associated to content elements in said repository.
  - 5. The method of claim 1, said internal sources comprising:
    - sources obtained by application of named entity extraction (NEE) and/or topic detection and tracking (TDT) methods to descriptive text associated to content elements in said repository.

3. An apparatus for identifying names, personalities, titles, and topics, whether or not said names, personalities, titles and topics are present in a given repository, and for placing them into a grammar, comprising:
- a plurality of external data sources, comprising non-speech, published lists of the text of frequent searches presented to popular text-based search engines, published lists of popular artists and song titles, published lists of most popular tags, published lists of most-emailed stories, and published news feeds;
  
  means for extracting search term candidates from said external sources, wherein search term candidates are either explicitly marked candidates or extracted candidates, the step of extracting further comprising;
  
  automatically identifying explicitly marked candidate search terms from at least one structured published list of content from among a plurality of structured lists of content available over a computer network, wherein said structured published lists of content are organized by an attribute selected from among a group of attributes consisting of;
  
  popular search engine search terms, popular artists, popular songs, and popular news feed tags; and
  
  extracting candidate search terms from at least one document from among a plurality of documents available from a plurality of sources of unstructured published content available over a computer network, wherein said sources of unstructured published content at least includes sources selected from among a group of sources consisting of published lists of most-emailed stories and published news feeds, and wherein extracting further comprises an automatic extraction means selected from among;
  
  named entity extraction (NEE);
  
  topic detection and tracking (TDT);
  
  direct human intervention; and
  
  a combination of NEE, TDT, and direct human intervention;
  
  storing said candidate search terms in a historical database of candidate search terms;
  
  means for extracting verified search terms from said internal sources;
  
  means for expanding search terms entered into one or more automatic speech recognition ASR grammars by using information from said external data sources, said means for expanding search terms comprising means for matching candidate search terms against verified search terms by edit distance techniques to obtain plausible linguistic variants of verified search terms and further comprising any of;
  
  means for finding plausible linguistic variants of existing search terms that are already comprehended in the repository, but that are under slightly different names; and
  
  means for expanding an existing search term list with items that should be in said list by virtue of their currency in popular culture, but which for whatever reason have not yet been reflected with content items in the repository;
  
  means for using said linguistic variants to generate augmented verified search terms;
  
  means for storing said augmented verified search terms in a historical database of verified search terms;
  
  means for establishing a set of null search terms comprising candidate search terms having a high incidence count in said historical database of candidate search terms and in said historical database of verified search terms; and
  
  means for expanding said grammar by adding said set of null search terms to said grammar of said automatic speech recognition (ASR) system.
- View Dependent Claims (4, 6)
- - 4. The apparatus of claim 3, said internal sources comprising any of:
    - explicitly marked titles, authors, artist names, etc. that are associated to content elements in said repository.
  - 6. The apparatus of claim 3, said internal sources comprising:
    - sources obtained by application of named entity extraction (NEE) and/or topic detection and tracking (TDT) methods to descriptive text associated to content elements in said repository.

7. A method for identifying names, personalities, titles, and topics, whether or not said names, personalities, titles and topics are present in a given repository and for placing them into a grammar for use in an automatic speech recognition (ASR) system, comprising the steps of:
- extracting search term candidates from external sources, wherein search term candidates are either explicitly marked candidates or extracted candidates, the step of extracting further comprising;
  
  automatically identifying explicitly marked candidate search terms from at least one structured published list of content from among a plurality of structured lists of content available over a computer network, wherein said structured published lists of content are organized by an attribute selected from among a group of attributes consisting of;
  
  popular search engine search terms, popular artists, popular songs, and popular news feed tags; and
  
  extracting candidate search terms from at least one document from among a plurality of documents available from a plurality of sources of unstructured published content available over a computer network, wherein said sources of unstructured published content at least includes sources selected from among a group of sources consisting of published lists of most-emailed stories and published news feeds, and wherein extracting further comprises an automatic extraction means selected from among;
  
  named entity extraction (NEE);
  
  topic detection and tracking (TDT);
  
  direct human intervention; and
  
  a combination of NEE, TDT, and direct human intervention;
  
  storing said candidate search terms in a historical database of candidate search terms;
  
  extracting verified search terms from internal sources;
  
  matching candidate search terms against verified search terms by edit distance techniques to obtain plausible linguistic variants of verified search terms that were extracted from said internal sources;
  
  using said linguistic variants to generate augmented verified search terms;
  
  storing said augmented verified search terms in a historical database of verified search terms;
  
  establishing a set of null search terms comprising candidate search terms having a high incidence count in said historical database of candidate search terms and in said historical database of verified search terms; and
  
  expanding said grammar by adding said set of null search terms to said grammar of said automatic speech recognition (ASR) system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Promptu Systems Corporation
Original Assignee
Promptu Systems Corporation
Inventors
Printz, Harry, Stampleman, Joseph Bruce
Primary Examiner(s)
Hu, Jensen

Application Number

US11/930,951
Publication Number

US 20080104072A1
Time in Patent Office

1,854 Days
Field of Search

707/3, 707/999.003, 707/736, 707/749
US Class Current

707/749
CPC Class Codes

G06F 16/95   Retrieval from the web

G06F 16/9535   Search customisation based ...

G06Q 30/02   Marketing; Price estimation...

G10L 15/02   Feature extraction for spee...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/18   using natural language mode...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/22   Procedures used during a sp...

G10L 17/26   Recognition of special voic...

G10L 2015/025   Phonemes, fenemes or fenone...

Method and apparatus for generation and augmentation of search terms from external and internal sources

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

97 Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for generation and augmentation of search terms from external and internal sources

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

97 Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links