SELF-ORGANIZED CONCEPT SEARCH AND DATA STORAGE METHOD
First Claim
1. A system for indexing and retrieving information regarding a plurality of documents, comprising:
- a plurality of data stores, each having an index and a search engine for finding documents in the data store that meet one or more search criteria;
a plurality of document concepts, each associated with exactly one of the data stores;
a clustering engine that, for each of the plurality of documents;
associates the document with one or more of the concepts; and
adds information about the document to the index of each data store with which the one or more concepts is associated; and
updates organization of the concepts according to one or more predetermined criteria.
0 Assignments
0 Petitions
Accused Products
Abstract
A document search and retrieval system and method stores documents in groups based on content. The documents are self-organized into a hierarchy of conceptual clusters, and branches of the hierarchy are stored separately in distinct physical stores, each having an index. In response to a query, the system finds the concepts (clusters) that best match the search criteria and returns the documents from those content categories. The indexing, clustering, and searching are performed using document themes and/or summaries. Themes are automatically developed by stemming and scoring phrases from the sentences in each document, and clustering the sentences containing the highest-scoring stems. A set of phrases (themes) is taken from each cluster. Document summaries are taken from text segments for each cluster of sentences within a document, then strung together to create a summary.
107 Citations
36 Claims
-
1. A system for indexing and retrieving information regarding a plurality of documents, comprising:
-
a plurality of data stores, each having an index and a search engine for finding documents in the data store that meet one or more search criteria;
a plurality of document concepts, each associated with exactly one of the data stores;
a clustering engine that, for each of the plurality of documents;
associates the document with one or more of the concepts; and
adds information about the document to the index of each data store with which the one or more concepts is associated; and
updates organization of the concepts according to one or more predetermined criteria. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A method of self-organizing and storing a plurality of electronic documents in a plurality of physical storage partitions, including:
-
clustering a plurality of electronic documents so that each document is in at least one of a plurality of concept clusters, the plurality of concept clusters forming a hierarchy and including;
a first concept cluster and a second concept cluster that is not a super-cluster of the first concept cluster;
for each concept cluster in the plurality of concept clusters, storing each document in the concept cluster in one of the one or more physical storage partitions;
whereinall documents in the first concept cluster are stored in a first storage partition;
all documents in the second concept cluster are stored in a second storage partition; and
there is no document that is simultaneously in the second concept cluster, stored in the first storage partition, and not in the first concept cluster. - View Dependent Claims (10, 11, 12)
-
-
13. A method of searching electronic documents, comprising:
-
receiving a query signal that includes one or more search terms;
responsively to receiving the query signal, searching a plurality of concept indexes, each providing an index to a plurality of electronic documents that relate to a common concept, including;
quantifying the relationship between the one or more search terms and each of the concept indexes as a similarity value; and
selecting the concept indexes having a similarity value indicating a relationship closer than a threshold; and
retrieving references to each of the electronic documents in each of the selected concept indexes. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A system for storing and retrieving electronic documents, including:
-
a search string layer that receives a search query;
one or more physical data stores; and
a concept index layer that includes a plurality of indexes, each index being associated with one of the physical data stores, and each index containing data that relates to a plurality of electronic documents;
wherein the system quantifies the closeness of the conceptual relationship between each of the indexes and the search query;
based on the quantification, identifies one or more indexes that best match the search query;
identifies the documents indexed by the one or more identified indexes; and
provides a result signal as a function of the identified documents. - View Dependent Claims (21, 22, 23)
-
-
24. A system for generating a list of one or more themes from an electronic document, comprising a processor and a memory in communication with the processor, the memory being encoded with programming instructions executable by the processor to:
-
identify sentences in the document;
parse the sentences into tokens;
list all phrases in the document having no more than a predetermined number of tokens;
count the frequency of the phrases;
stem the phrases to a predetermined length;
score each stem as a function of the stem'"'"'s length and the frequency of the corresponding phrases in the document;
cluster the sentences based at least in part on the scores of the stems they contain; and
generate a phrase set containing phrases from those sentences that were clustered into a cluster with at least one other sentence. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A system for generating a summary of an electronic document, comprising a processor and a memory in communication with the processor, the memory being encoded with programming instructions executable by the processor to:
-
identify coherent segments of text in an electronic document, each sentence being part of at least one coherent segment;
cluster sentences in the document based on their content;
for each cluster of sentences, generate a passage by;
sorting the sentences in the cluster based on their position in the original document;
selecting a first number of sentences from the beginning of the sorted list; and
for each of the first number of sentences, adding to the passage the smallest coherent segment of which the sentence is a part. - View Dependent Claims (34, 35, 36)
-
Specification