Phrase recognition method and apparatus
First Claim
1. A computer-implemented method of processing a stream of document text to form a list of phrases that are indicative of conceptual content of the document, the phrases being used as index terms and search query terms in full text document searching performed after the phrase list is formed, the method comprising:
- partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and
selecting certain chunks as the phrases of the phrase list, based on frequencies of occurrence of the chunks within the stream of document text.
7 Assignments
0 Petitions
Accused Products
Abstract
A phrase recognition method breaks streams of text into text "chunks" and selects certain chunks as "phrases" useful for automated full text searching. The phrase recognition method uses a carefully assembled list of partition elements to partition the text into the chunks, and selects phrases from the chunks according to a small number of frequency based definitions. The method can also incorporate additional processes such as categorization of proper names to enhance phrase recognition. The method selects phrases quickly and efficiently, referring simply to the phrases themselves and the frequency with which they are encountered, rather than relying on complex, time-consuming, resource-consuming grammatical analysis, or on collocation schemes of limited applicability, or on heuristical text analysis of limited reliability or utility.
-
Citations
30 Claims
-
1. A computer-implemented method of processing a stream of document text to form a list of phrases that are indicative of conceptual content of the document, the phrases being used as index terms and search query terms in full text document searching performed after the phrase list is formed, the method comprising:
-
partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and selecting certain chunks as the phrases of the phrase list, based on frequencies of occurrence of the chunks within the stream of document text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. An apparatus of processing a stream of document text to form a list of phrases that are indicative of conceptual content of the document, the phrases being used as index terms and search query terms in full text document searching performed after the phrase list is formed, the apparatus comprising:
-
means for partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and means for selecting certain chunks as the phrases of the phrase list, based on frequencies of occurrence of the chunks within the stream of document text. - View Dependent Claims (17, 18, 19)
-
-
20. A computer-readable memory which, when used in conjunction with a computer, can carry out a phrase recognition method to form a phrase list containing phrases that are indicative of conceptual content of a document, the phrases being used as index terms and search query terms in full-text document searching performed after the phrase list is formed, the computer-readable memory comprising:
-
computer-readable code for partitioning document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and computer-readable code for selecting certain chunks as the phrases of the phrase list based on frequencies of occurrence of the chunks within the stream of document text. - View Dependent Claims (21, 22, 23)
-
-
24. A computer-implemented method of full-text, on-line searching, the method comprising:
-
a) receiving and executing a search query to display at least one current document; b) receiving a command to search for documents having similar conceptual content to the current document; c) executing a phrase recognition process to extract phrases allowing full text searches for documents having similar conceptual content to the current document, the phrase recognition process including the steps of; c1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and c2) selecting certain chunks as the phrases, based on frequencies of occurrence of the chunks within the stream of document text; and d) automatically forming a second search query based at least on the phrases determined in the phrase recognition process so as to allow automated searching for documents having similar conceptual content to the current document. - View Dependent Claims (25, 26)
-
-
27. A computer-implemented method of forming a phrase list containing phrases that are indicative of conceptual content of each of a plurality of documents, which phrases are used as index terms or in document search queries formed after the phrase list is formed, the method comprising:
-
a) selecting document text from the plurality of documents; b) executing a phrase recognition process including the steps of; b1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and b2) selecting certain chunks as the phrases, based on frequencies of occurrence of the chunks within the stream of document text; and c) forming the phrase list, wherein the phrase list includes; 1) phrases extracted by the phrase recognition process; and 2) respective frequencies of occurrence of the extracted phrases.
-
-
28. The method of 27, further comprising:
forming a modified phrase list having only those phrases whose respective frequencies of occurrence are greater than a threshold number of occurrences.
-
29. The method of 27, further comprising:
forming a phrase dictionary based on the phrase list formed in the forming step.
-
30. A computer-implemented method of forming phrase lists containing phrases that are indicative of conceptual content of documents, which phrases are used as index terms or in document search queries formed after the phrase list is formed, the method comprising
a) selecting document text from a sampling of documents from among a larger collection of documents; - and
b) executing a phrase recognition process to extract phrases to form a phrase list for each document processed, the phrase recognition process including; b1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and b2) selecting certain chunks as the phrases of the phrase list based on frequencies of occurrence of the chunks within the stream of document text.
- and
Specification