Methods and apparatus for semantic unit based automatic indexing and searching in data archive systems

US 7,177,795 B1
Filed: 11/10/1999
Issued: 02/13/2007
Est. Priority Date: 11/10/1999
Status: Expired due to Term

First Claim

Patent Images

1. A method of processing audio-based data associated with a particular language, the method comprising the steps of:

storing the audio-based data;

generating a textual representation of the audio-based data, the textual representation being in the form of one or more semantic units corresponding to the audio-based data, wherein each of at least a portion of the one or more semantic units comprise a sub-unit of a word and not a complete word itself; and

indexing the one or more semantic units and storing the one or more indexed semantic units for use in searching the stored audio-based data in response to a user query, wherein at least one segment of the stored audio-based data is retrievable by obtaining a location indicative of where the at least one segment is stored from a direct correspondence between at least one of the indexed semantic units and the at least one segment.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An audio-based data indexing and retrieval system for processing audio-based data associated with a particular language, comprising: (i) memory for storing the audio-based data; (ii) a semantic unit based speech recognition system for generating a textual representation of the audio-based data, the textual representation being in the form of one or more semantic units corresponding to the audio-based data; (iii) an indexing and storage module, operatively coupled to the semantic unit based speech recognition system and the memory, for indexing the one or more semantic units and storing the one or more indexed semantic units; and (iv) a search engine, operatively coupled to the indexing and storage module and the memory, for searching the one or more indexed semantic units for a match with one or more semantic units associated with a user query, and for retrieving the stored audio based data based on the one or more indexed semantic units. The semantic unit may preferably be a syllable or morpheme. Further, the invention is particularly well suited for use with Asian and Slavic languages.

92 Citations

View as Search Results

29 Claims

1. A method of processing audio-based data associated with a particular language, the method comprising the steps of:
- storing the audio-based data;
  
  generating a textual representation of the audio-based data, the textual representation being in the form of one or more semantic units corresponding to the audio-based data, wherein each of at least a portion of the one or more semantic units comprise a sub-unit of a word and not a complete word itself; and
  
  indexing the one or more semantic units and storing the one or more indexed semantic units for use in searching the stored audio-based data in response to a user query, wherein at least one segment of the stored audio-based data is retrievable by obtaining a location indicative of where the at least one segment is stored from a direct correspondence between at least one of the indexed semantic units and the at least one segment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 2. The method of claim 1, wherein the semantic unit is a syllable.
  - 3. The method of claim 2, wherein the syllable is a phonetically based syllable.
  - 4. The method of claim 3, wherein a phonetically-based syllable comprises a toneme.
  - 5. The method of claim 3, wherein two or more different pronunciations are associated with a phonetically-based syllable.
  - 6. The method of claim 1, wherein the semantic unit is a morpheme.
  - 7. The method of claim 1, wherein the generating step comprises decoding the audio-based data in accordance with a speech recognition system.
  - 8. The method of claim 7, wherein the speech recognition system employs a syllable language model.
  - 9. The method of claim 8, wherein production of the syllable language model comprises the steps of:
    - transcribing audio data to generate syllables;
      
      deriving conditional probabilities of distribution based on the generated syllables; and
      
      using syllable counts and the conditional probabilities to construct the syllable language model.
  - 10. The method of claim 7, wherein the speech recognition system employs a semantic unit based language model.
  - 11. The method of claim 1, wherein the indexing step comprises time stamping the one or more semantic units.
  - 12. The method of claim 1, wherein the searching step comprises:
    - processing the user query to generate one or more semantic units representing the information that the user seeks to retrieve;
      
      searching the one or more indexed semantic units to find a substantial match with the one or more semantic units associated with the user query; and
      
      retrieving one or more segments of the audio-based data using the one or more indexed semantic units that match the one or more semantic units associated with the user query.
  - 13. The method of claim 12, wherein the searching step further comprises presenting the retrieved data to the user.
  - 14. The method of claim 1, wherein the particular language is an Asian based language.
  - 15. The method of claim 14, wherein the particular language is Chinese.
  - 16. The method of claim 15, wherein the semantic unit is a Chinese character.
  - 17. The method of claim 1, wherein the particular language is a Slavic based language.
  - 18. The method of claim 1, wherein the one or more semantic units are indexed according to speaker attributes.
  - 19. The method of claim 1, wherein the one or more semantic units are indexed according to at least one of when the audio based data was produced and where the audio based data was produced.
  - 20. The method of claim 1, further comprising the step of storing video based data associated with the audio based data for use in searching the stored audio based data and the video based data in response to a user query.
  - 21. The method of claim 20, wherein the searching step includes a hierarchical search routine.
  - 22. The method of claim 1, wherein the generating step comprises stenographically transcribing the audio-based data to generate the textual representation.
  - 23. The method of claim 1, wherein the user query comprises a word.
  - 24. The method of claim 23, wherein the searching step further comprises transforming the word into a sequence of syllables using a text-to-phonetic syllable map.
  - 25. The method of claim 1, wherein the generating step comprises producing the textual representation via stenography.
  - 26. The method of claim 1, wherein the searching step comprises use of a hierarchical index.
  - 27. The method of claim 1, wherein the searching step comprises use of an automatic boundary marking system.

28. Apparatus for processing audio-based data associated with a particular language, the apparatus comprising:
- a memory; and
  
  at least one processor coupled to the memory and operative to;
  
  (i) store the audio-based data in the memory;
  
  (ii) generate a textual representation of the audio-based data, the textual representation being in the form of one or more semantic units corresponding to the audio-based data, wherein each of at least a portion of the one or more semantic units comprise a sub-unit of a word and not a complete word itself; and
  
  (iii) index the one or more semantic units and store the one or more indexed semantic units for use in searching the stored audio-based data in response to a user query, wherein at least one segment of the stored audio-based data is retrievable by obtaining a location indicative of where the at least one segment is stored from a direct correspondence between at least one of the indexed semantic units and the at least one segment.

29. An audio-based data indexing and retrieval system for processing audio-based data associated with a particular language, the system comprising:
- memory for storing the audio-based data;
  
  a semantic unit based speech recognition system for generating a textual representation of the audio-based data, the textual representation being in the form of one or more semantic units corresponding to the audio-based data, wherein each of at least a portion of the one or more semantic units comprise a sub-unit of a word and not a complete word itself;
  
  an indexing and storage module, operatively coupled to the semantic unit based speech recognition system and the memory, for indexing the one or more semantic units and storing the one or more indexed semantic units; and
  
  a search engine, operatively coupled to the indexing and storage module and the memory, for searching the one or more indexed semantic units for a match with one or more semantic units associated with a user query, and for retrieving the stored audio based data based on the one or more indexed semantic units, wherein at least one segment of the stored audio-based data is retrievable by obtaining a location indicative of where the at least one segment is stored from a direct correspondence between at least one of the indexed semantic units and the at least one segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Kanevsky, Dimitri, Chen, Chengiun Julian
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
SPOONER, LAMONT M

Application Number

US09/437,971
Time in Patent Office

2,652 Days
Field of Search

707/536, 707/535, 707/533, 704/7, 704/8, 704/9, 704/235, 704/2, 704/10, 704/231, 704/251
US Class Current

704/9
CPC Class Codes

G06F 16/685 using automatically derived...

G10L 15/1815 Semantic context, e.g. disa...

Methods and apparatus for semantic unit based automatic indexing and searching in data archive systems

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

92 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and apparatus for semantic unit based automatic indexing and searching in data archive systems

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

92 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links