Methods and apparatus for similarity text search based on conceptual indexing
First Claim
Patent Images
1. A method of performing a conceptual similarity search, the method comprising the steps of:
- generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search, wherein at least one of the one or more word-chains comprises a meta-document formed by applying a damping function to a set of the one or more documents, concatenating the document set after application of the damping function, and removing from the meta-document one or more least-weighted words;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index.
1 Assignment
0 Petitions
Accused Products
Abstract
In one aspect of the invention, a method of performing a conceptual similarity search comprises the steps of: generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search; building a conceptual index of documents with the one or more word-chains; and evaluating a similarity query using the conceptual index. The evaluating step preferably returns one or more of the closest documents resulting from the search; one or more matching word-chains in the one or more documents; and one or more matching topical words of the one or more documents.
-
Citations
39 Claims
-
1. A method of performing a conceptual similarity search, the method comprising the steps of:
-
generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search, wherein at least one of the one or more word-chains comprises a meta-document formed by applying a damping function to a set of the one or more documents, concatenating the document set after application of the damping function, and removing from the meta-document one or more least-weighted words;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
for each word-chain, finding the one or more documents with conceptual similarity to the word-chain; and
retaining a list of identities of the one or more documents which have conceptual similarity not less than a predefined threshold value.
-
-
3. The method of claim 2, wherein a document identity comprises a unique integer value.
-
4. The method of claim 1, wherein the evaluating step comprises returning one or more of the closest documents resulting from the search.
-
5. The method of claim 1, wherein the evaluating step comprises returning one or more matching word-chains in the one or more documents.
-
6. The method of claim 1, wherein the evaluating step comprises returning one or more matching topical words of the one or more documents.
-
7. The method of claim 1, wherein the evaluating step comprises the step of generating a conceptual representation of a target document associated with the similarity query.
-
8. The method of claim 7, wherein the evaluating step comprises the step of finding a substantially close match to a target document among a plurality of indexed documents using the conceptual representation of the target document.
-
9. A method of performing a conceptual similarity search, the method comprising the steps of:
-
generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index;
wherein the word-chain generating step comprises the steps of;
initializing one or more word-chains to one or more sets of randomly selected documents;
assigning one or more other documents to the one or more sets of randomly selected documents;
concatenating the one or more documents in each set and removing less frequently occurring words from each word-chain; and
merging the word-chains. - View Dependent Claims (10)
-
-
11. A method of performing a conceptual similarity search, the method comprising the steps of:
-
generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index;
wherein the evaluating step comprises the step of generating a conceptual representation of a target document associated with the similarity query, and wherein the target document conceptual representation generating step comprises the steps of;
calculating a similarity measure between the target document and each conceptual word-chain;
determining whether each similarity measure is not less than a predetermined threshold value; and
generating conceptual strength measures by respectively setting a conceptual strength measure to a similarity measure minus the predetermined threshold value, when the similarity measure is not less than a predetermined threshold value.
-
-
12. A method of performing a conceptual similarity search, the method comprising the steps of:
-
generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index;
wherein the evaluating step comprises the step of generating a conceptual representation of a target document associated with the similarity query and finding a substantially close match to a target document among a plurality of indexed documents using the conceptual representation of the target document, and wherein the finding step comprises the steps of;
finding one or more concepts in the target document;
evaluating an inverted list associated with the indexed documents to find the one or more documents which have at least one concept in common with the target document;
calculating a conceptual cosine of the one or more common concept documents to the target document;
finding the closest document to the target document based on the conceptual cosine; and
reporting an output statistic between the closest matching document and the target document. - View Dependent Claims (13)
reporting concepts which are present in the target document and the closest matching document; and
finding a topical vocabulary which is common to the target document and the closest matching document and matching word-chains.
-
-
14. Apparatus for performing a conceptual similarity search, the apparatus comprising:
-
at least one processor operative to;
(i) generate one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search, wherein at least one of the one or more word-chains comprises a meta-document formed by applying a damping function to a set of the one or more documents, concatenating the document set after application of the damping function, and removing from the meta-document one or more least-weighted words;
(ii) build a conceptual index of documents with the one or more word-chains; and
(iii) evaluate a similarity query using the conceptual index; and
memory, coupled to the at least one processor, for storing at least one of the conceptual word-chains and the conceptual index. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21)
-
-
22. Apparatus for performing a conceptual similarity search, the apparatus comprising:
-
at least one processor operative to;
(i) generate one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
(ii) build a conceptual index of documents with the one or more word-chains; and
(iii) evaluate a similarity query using the conceptual index; and
memory, coupled to the at least one processor, for storing at least one of the conceptual word-chains and the conceptual index;
wherein the processor is further operative to perform the word-chain generating operation by initializing one or more word-chains to one or more sets of randomly selected documents;
assigning one or more other documents to the one or more sets of randomly selected documents;
concatenating the one or more documents in each set and removing less frequently occurring words from each word-chain; and
merging the word-chains.- View Dependent Claims (23)
-
-
24. Apparatus for performing a conceptual similarity search, the apparatus comprising:
-
at least one processor operative to;
(i) generate one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
(ii) build a conceptual index of documents with the one or more word-chains; and
(iii) evaluate a similarity query using the conceptual index; and
memory, coupled to the at least one processor, for storing at least one of the conceptual word-chains and the conceptual index;
wherein the processor is further operative to perform the evaluating operation by generating a conceptual representation of a target document associated with the similarity query, and wherein the processor is further operative to perform the target document conceptual representation generating operation by calculating a similarity measure between the target document and each conceptual word-chain;
determining whether each similarity measure is not less than a predetermined threshold value; and
generating conceptual strength measures by respectively setting a conceptual strength measure to a similarity measure minus the predetermined threshold value, when the similarity measure is not less than a predetermined threshold value.
-
-
25. Apparatus for performing a conceptual similarity search, the apparatus comprising:
-
at least one processor operative to;
(i) generate one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
(ii) build a conceptual index of documents with the one or more word-chains; and
(iii) evaluate a similarity query using the conceptual index; and
memory, coupled to the at least one processor, for storing at least one of the conceptual word-chains and the conceptual index;
wherein the processor is further operative to perform the evaluating operation by generating a conceptual representation of a target document associated with the similarity query and finding a substantially close match to a target document among a plurality of indexed documents using the conceptual representation of the target document, and wherein the processor is further operative to perform the finding operation by finding one or more concepts in the target document;
evaluating an inverted list associated with the indexed documents to find the one or more documents which have at least one concept in common with the target document;
calculating a conceptual cosine of the one or more common concept documents to the target document;
finding the closest document to the target document based on the conceptual cosine; and
reporting an output statistic between the closest matching document and the target document.- View Dependent Claims (26)
-
-
27. An article of manufacture for performing a conceptual similarity search, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
-
generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search, wherein at least one of the one or more word-chains comprises a meta-document formed by applying a damping function to a set of the one or more documents, concatenating the document set after application of the damping function, and removing from the meta-document one or more least-weighted words;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34)
for each word-chain, finding the one or more documents with conceptual similarity to the word-chain; and
retaining a list of identities of the one or more documents which have conceptual similarity not less than a predefined threshold value.
-
-
29. The article of claim 28, wherein a document identity comprises a unique integer value.
-
30. The article of claim 27, wherein the evaluating step comprises returning one or more of the closest documents resulting from the search.
-
31. The article of claim 27, wherein the evaluating step comprises returning one or more matching word-chains in the one or more documents.
-
32. The article of claim 27, wherein the evaluating step comprises returning one or more matching topical words of the one or more documents.
-
33. The article of claim 27, wherein the evaluating step comprises the step of generating a conceptual representation of a target document associated with the similarity query.
-
34. The article of claim 33, wherein the evaluating step comprises the step of finding a substantially close match to a target document among a plurality of indexed documents using the conceptual representation of the target document.
-
35. An article of manufacture for performing a conceptual similarity search, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
-
generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index;
wherein the word-chain generating step comprises the steps of;
initializing one or more word-chains to one or more sets of randomly selected documents;
assigning one or more other documents to the one or more sets of randomly selected documents;
concatenating the one or more documents in each set and removing less frequently occurring words from each word-chain; and
merging the word-chains. - View Dependent Claims (36)
-
-
37. An article of manufacture for performing a conceptual similarity search, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
-
generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index;
wherein the evaluating step comprises the step of generating a conceptual representation of a target document associated with the similarity query, and wherein the target document conceptual representation generating step comprises the steps of;
calculating a similarity measure between the target document and each conceptual word-chain;
determining whether each similarity measure is not less than a predetermined threshold value; and
generating conceptual strength measures by respectively setting a conceptual strength measure to a similarity measure minus the predetermined threshold value, when the similarity measure is not less than a predetermined threshold value.
-
-
38. An article of manufacture for performing a conceptual similarity search, comprising a machine readable medium containing one or more programs which when executed implement the steps of:
-
generating one or more conceptual word-chains from one or more documents to be used in the conceptual similarity search;
building a conceptual index of documents with the one or more word-chains; and
evaluating a similarity query using the conceptual index;
wherein the evaluating step comprises the step of generating a conceptual representation of a target document associated with the similarity query and finding a substantially close match to a target document among a plurality of indexed documents using the conceptual representation of the target document, and wherein the finding step comprises the steps of;
finding one or more concepts in the target document;
evaluating an inverted list associated with the indexed documents to find the one or more documents which have at least one concept in common with the target document;
calculating a conceptual cosine of the one or more common concept documents to the target document;
finding the closest document to the target document based on the conceptual cosine; and
reporting an output statistic between the closest matching document and the target document. - View Dependent Claims (39)
reporting concepts which are present in the target document and the closest matching document; and
finding a topical vocabulary which is common to the target document and the closest matching document and matching word-chains.
-
Specification