ACCESSING DOCUMENTS USING PREDICTIVE WORD SEQUENCES
First Claim
1. A method for accessing documents related to a subject from a document corpus, comprising:
- creating a candidate list of word sequences, respective ones of the word sequences comprising one or more elements derived from the document corpus;
expanding the candidate list by adding one or more new word sequences, wherein each new pattern is created by combining one or more elements derived from the document corpus with one of said word sequences;
determining a predictive power with respect to the subject for respective ones of entries of the candidate list, wherein the entries comprise said word sequences and said new word sequences;
pruning from the candidate list ones of said entries with the determined predictive power less than a predetermined threshold; and
accessing documents from the document corpus based on the pruned candidate list.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and systems for accessing documents in document collections using predictive word sequences are disclosed. A method for accessing documents using predictive word sequences include creating a candidate list of word sequences where respective ones of the word sequences comprise one or more elements derived from the document corpus; expanding the candidate list by adding one or more new word sequences, where each new pattern is created by combining one or more elements derived from the document corpus with one of the word sequences currently in the candidate list; determining a predictive power with respect to the subject for respective ones of entries of the candidate list, where the entries include the word sequences and the new word sequences; pruning from the candidate list ones of said entries with the determined predictive power less than a predetermined threshold; and accessing documents from the document corpus based on the pruned candidate list. The expanding of the candidate list can include creating each new pattern as a gapped sequence, where the gapped sequence comprises one of the word sequences and one of said elements separated by zero or more words. Corresponding system and computer readable media embodiments are also disclosed.
-
Citations
20 Claims
-
1. A method for accessing documents related to a subject from a document corpus, comprising:
-
creating a candidate list of word sequences, respective ones of the word sequences comprising one or more elements derived from the document corpus; expanding the candidate list by adding one or more new word sequences, wherein each new pattern is created by combining one or more elements derived from the document corpus with one of said word sequences; determining a predictive power with respect to the subject for respective ones of entries of the candidate list, wherein the entries comprise said word sequences and said new word sequences; pruning from the candidate list ones of said entries with the determined predictive power less than a predetermined threshold; and accessing documents from the document corpus based on the pruned candidate list. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A system for accessing documents related to a subject from a document corpus, comprising:
-
at least one processor; at least one memory coupled to the processor and configured to store a candidate list of word sequences; and a word sequence determining and document accessing module implemented on the at least one processor, including; a word sequence generator configured to; create a candidate list of word sequences, respective ones of the word sequences comprising one or more elements derived from the document corpus; expand the candidate list by adding one or more new word sequences, wherein each new pattern is created by combining one or more elements derived from the document corpus with one of said word sequences; determine a predictive power with respect to the subject for respective ones of entries of the candidate list, wherein the entries comprise said word sequences and said new word sequences; prune from the candidate list ones of said entries with the determined predictive power less than a predetermined threshold. - View Dependent Claims (19)
-
-
20. A computer readable media storing instructions wherein said instructions when executed are adapted to access documents related to a subject from a document corpus with a method comprising:
-
creating a candidate list of word sequences, respective ones of the word sequences comprising one or more elements derived from the document corpus; expanding the candidate list by adding one or more new word sequences, wherein each new pattern is created by combining one or more elements derived from the document corpus with one of said word sequences; determining a predictive power with respect to the subject for respective ones of entries of the candidate list, wherein the entries comprise said word sequences and said new word sequences; pruning from the candidate list ones of said entries with the determined predictive power less than a predetermined threshold; and accessing documents from the document corpus based on the pruned candidate list.
-
Specification