Phrase recognition method and apparatus

US 5,819,260 A
Filed: 01/22/1996
Issued: 10/06/1998
Est. Priority Date: 01/22/1996
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method of processing a stream of document text to form a list of phrases that are indicative of conceptual content of the document, the phrases being used as index terms and search query terms in full text document searching performed after the phrase list is formed, the method comprising:

partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and

selecting certain chunks as the phrases of the phrase list, based on frequencies of occurrence of the chunks within the stream of document text.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A phrase recognition method breaks streams of text into text "chunks" and selects certain chunks as "phrases" useful for automated full text searching. The phrase recognition method uses a carefully assembled list of partition elements to partition the text into the chunks, and selects phrases from the chunks according to a small number of frequency based definitions. The method can also incorporate additional processes such as categorization of proper names to enhance phrase recognition. The method selects phrases quickly and efficiently, referring simply to the phrases themselves and the frequency with which they are encountered, rather than relying on complex, time-consuming, resource-consuming grammatical analysis, or on collocation schemes of limited applicability, or on heuristical text analysis of limited reliability or utility.

Citations

30 Claims

1. A computer-implemented method of processing a stream of document text to form a list of phrases that are indicative of conceptual content of the document, the phrases being used as index terms and search query terms in full text document searching performed after the phrase list is formed, the method comprising:
- partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and
  
  selecting certain chunks as the phrases of the phrase list, based on frequencies of occurrence of the chunks within the stream of document text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1, wherein the partitioning step includes:
    - scanning a portion of the document text stream;
      
      comparing the scanned portion of the document text stream to partition entities in the partition list;
      
      substituting a partition tag for portions of the document text stream which match a partition entity;
      
      generating a text chunk list;
      
      scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and
      
      revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks.
  - 3. The method of claim 1, wherein the selecting step includes:
    - selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks.
  - 4. The method of claim 1, wherein:
    - a) the partitioning step includes;
      
      a1) scanning a portion of the document text stream;
      
      a2) comparing the scanned portion of the document text stream to partition entities in the partition list;
      
      a3) substituting a partition tag for portions of the document text stream which match a partition entity;
      
      a4) generating a text chunk list;
      
      a5) scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and
      
      a6) revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks; and
      
      b) the selecting step includes selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks.
  - 5. The method of claim 1, wherein the selecting step includes:
    - excluding a chunk from being determined as a phrase if the chunk is a single word beginning with a lower case letter.
  - 6. The method of claim 1, wherein the selecting step includes:
    - determining a chunk as being a phrase if the chunk includes a plurality of words each constituting lower case letters only if the chunk occurs at least twice in the document text stream.
  - 7. The method of claim 1, wherein the selecting step includes:
    - determining a chunk as being a proper name if the chunk includes a plurality of words each having at least a first letter which is upper case.
  - 8. The method of claim 1, wherein the selecting step includes:
    - mapping a sub-phrase to a phrase.
  - 9. The method of claim 1, wherein the selecting step includes:
    - mapping single upper case words to their respective proper names.
  - 10. The method of claim 1, wherein the selecting step includes:
    - detecting presence of acronyms;
      
      incrementing a count of a proper name corresponding to the respective detected acronyms; and
      
      copying the proper name and the acronym to an acronym list.
  - 11. The method of claim 1, wherein the selecting step includes:
    - combining a phrase list of lower case words with a phrase list of proper names.
  - 12. The method of claim 1, further comprising:
    - reducing the phrase list by consolidating phrases in the phrase list by using a synonym thesaurus.
  - 13. The method of claim 1, further comprising:
    - adding phrases to the phrase list by combining phrases which are separated in the document text stream only by prepositions.
  - 14. The method of claim 1, further comprising:
    - trimming the phrase list by eliminating phrases which occur in fewer than a threshold number of document text streams.
  - 15. The method of claim 1, further comprising:
    - categorizing proper names in the proper name list into groups based on corresponding group lists.

16. An apparatus of processing a stream of document text to form a list of phrases that are indicative of conceptual content of the document, the phrases being used as index terms and search query terms in full text document searching performed after the phrase list is formed, the apparatus comprising:
- means for partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and
  
  means for selecting certain chunks as the phrases of the phrase list, based on frequencies of occurrence of the chunks within the stream of document text.
- View Dependent Claims (17, 18, 19)
- - 17. The apparatus of claim 16, wherein the partitioning means includes:
    - means for scanning a portion of the document text stream;
      
      means for comparing the scanned portion of the document text stream to partition entities in the partition list;
      
      means for substituting a partition tag for portions of the document text stream which match a partition entity;
      
      means for generating a text chunk list;
      
      means for scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and
      
      means for revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks.
  - 18. The apparatus of claim 16, wherein the selecting means includes:
    - means for selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks.
  - 19. The apparatus of claim 16, wherein:
    - a) the partitioning means includes;
      
      a1) means for scanning a portion of the document text stream;
      
      a2) means for comparing the scanned portion of the document text stream to partition entities in the partition list;
      
      a3) means for substituting a partition tag for portions of the document text stream which match a partition entity;
      
      a4) means for generating a text chunk list;
      
      a5) means for scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and
      
      a6) means for revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks; and
      
      b) the selecting means includes means for selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks.

20. A computer-readable memory which, when used in conjunction with a computer, can carry out a phrase recognition method to form a phrase list containing phrases that are indicative of conceptual content of a document, the phrases being used as index terms and search query terms in full-text document searching performed after the phrase list is formed, the computer-readable memory comprising:
- computer-readable code for partitioning document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and
  
  computer-readable code for selecting certain chunks as the phrases of the phrase list based on frequencies of occurrence of the chunks within the stream of document text.
- View Dependent Claims (21, 22, 23)
- - 21. The computer-readable memory of claim 20, wherein the computer-readable code for partitioning includes:
    - computer-readable code for scanning a portion of the document text stream;
      
      computer-readable code for comparing the scanned portion of the document text stream to partition entities in the partition list;
      
      computer-readable code for substituting a partition tag for portions of the document text stream which match a partition entity;
      
      computer-readable code for generating a text chunk list;
      
      computer-readable code for scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and
      
      computer-readable code for revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks.
  - 22. The computer-readable memory of claim 20, wherein the computer-readable code for selecting includes:
    - computer-readable code for selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks.
  - 23. The computer-readable memory of claim 20, wherein:
    - a) the computer-readable code for partitioning includes;
      
      a1) computer-readable code for scanning a portion of the document text stream;
      
      a2) computer-readable code for comparing the scanned portion of the document text stream to partition entities in the partition list;
      
      a3) computer-readable code for substituting a partition tag for portions of the document text stream which match a partition entity;
      
      a4) computer-readable code for generating a text chunk list;
      
      a5) computer-readable code for scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and
      
      a6) computer-readable code for revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks; and
      
      b) the computer-readable code for selecting includes computer-readable code for selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks.

24. A computer-implemented method of full-text, on-line searching, the method comprising:
- a) receiving and executing a search query to display at least one current document;
  
  b) receiving a command to search for documents having similar conceptual content to the current document;
  
  c) executing a phrase recognition process to extract phrases allowing full text searches for documents having similar conceptual content to the current document, the phrase recognition process including the steps of;
  
  c1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and
  
  c2) selecting certain chunks as the phrases, based on frequencies of occurrence of the chunks within the stream of document text; and
  
  d) automatically forming a second search query based at least on the phrases determined in the phrase recognition process so as to allow automated searching for documents having similar conceptual content to the current document.
- View Dependent Claims (25, 26)
- - 25. The method of claim 24, further comprising:
    - validating phrases recognized by the phrase recognition process against phrases in a phrase dictionary before automatically forming the second search query.
  - 26. The method of claim 24, further comprising:
    - displaying an error message if less than a threshold number of phrases are recognized for the current document.

27. A computer-implemented method of forming a phrase list containing phrases that are indicative of conceptual content of each of a plurality of documents, which phrases are used as index terms or in document search queries formed after the phrase list is formed, the method comprising:
- a) selecting document text from the plurality of documents;
  
  b) executing a phrase recognition process including the steps of;
  
  b1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and
  
  b2) selecting certain chunks as the phrases, based on frequencies of occurrence of the chunks within the stream of document text; and
  
  c) forming the phrase list, wherein the phrase list includes;
  
  1) phrases extracted by the phrase recognition process; and
  
  2) respective frequencies of occurrence of the extracted phrases.

28. The method of 27, further comprising:
- forming a modified phrase list having only those phrases whose respective frequencies of occurrence are greater than a threshold number of occurrences.

29. The method of 27, further comprising:
- forming a phrase dictionary based on the phrase list formed in the forming step.

30. A computer-implemented method of forming phrase lists containing phrases that are indicative of conceptual content of documents, which phrases are used as index terms or in document search queries formed after the phrase list is formed, the method comprisinga) selecting document text from a sampling of documents from among a larger collection of documents;
- andb) executing a phrase recognition process to extract phrases to form a phrase list for each document processed, the phrase recognition process including;
  
  b1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and
  
  b2) selecting certain chunks as the phrases of the phrase list based on frequencies of occurrence of the chunks within the stream of document text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Brain Gaylord-Tousana Brain, Jennifer Williams, RELX Inc. (RELX PLC)
Original Assignee
Lexis-Nexis
Inventors
Wassum, John Richard, Miller, David James, Lu, Xin Allan
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
ROBINSON, GRETA LEE

Application Number

US08/589,468
Time in Patent Office

988 Days
Field of Search

395/604, 395/603, 395/605, 395/600, 395/602, 707/3, 707/1, 707/4, 707/5
US Class Current

707/700
CPC Class Codes

G06F 16/30   of unstructured textual dat...

Y10S 707/917   Text

Y10S 707/968   Partitioning

Y10S 707/99931   Database or file accessing

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Phrase recognition method and apparatus

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Phrase recognition method and apparatus

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links