System and method for identifying passages in electronic documents
First Claim
1. A method for searching an electronic document for passages relating to a concept being searched for, where the concept is expressed as a word or plurality of words, the method comprising:
- deconstructing by a computer processor training electronic texts stored on a computer readable into a stream of features;
storing the stream of features in a data store;
wherein the features include the text of complete sentences, tokens used by the text in each sentence, the sequence of sentences, layout of text and typography of text;
executing by a computer processor a conditional random field algorithm to label sentences in the electronic document as either being relevant to the concept being searched for (“
State A”
) or as background information (“
State B”
) based on the stream of features;
executing by the computer processor a search algorithm which returns those sentences labelled as State A;
wherein the conditional random field algorithm generates a probability of a sentence being relevant to State A;
wherein the probability includes a tolerance for words or portions of words which cannot be resolved into computer-readable text;
wherein, given a document containing multiple sentences S;
={s1, s2, . . . , sm} and the corresponding concept label for each sentence Concept;
={concept1, concept2, . . . , conceptm}, the conditional random field function defining the probability of the Concept applied to S, Pr(Concept|S), is expressed as;
5 Assignments
0 Petitions
Accused Products
Abstract
The methods proposed here deconstructs training sentences into a stream of features that represent both the sentences and tokens used by the text, their sequence and other ancillary features extracted using natural language processing. Then, we use a conditional random field where we represent the concept we are looking for as state A and the background (everything not concept A) as a state B. The model created by this training phase is then used to locate the concept as a sequence of sentences within a document. This has distinct advantages in accuracy and speed over methods that individually classify each sentence and then use a secondary method to group the classified sentences into passages. Furthermore while previous methods were based on searching for the occurrence of tokens only, the use of a wider set of features enables this method to locate relevant passages even though a different terminology is in use.
26 Citations
12 Claims
-
1. A method for searching an electronic document for passages relating to a concept being searched for, where the concept is expressed as a word or plurality of words, the method comprising:
-
deconstructing by a computer processor training electronic texts stored on a computer readable into a stream of features; storing the stream of features in a data store;
wherein the features include the text of complete sentences, tokens used by the text in each sentence, the sequence of sentences, layout of text and typography of text;executing by a computer processor a conditional random field algorithm to label sentences in the electronic document as either being relevant to the concept being searched for (“
State A”
) or as background information (“
State B”
) based on the stream of features;executing by the computer processor a search algorithm which returns those sentences labelled as State A; wherein the conditional random field algorithm generates a probability of a sentence being relevant to State A;
wherein the probability includes a tolerance for words or portions of words which cannot be resolved into computer-readable text;wherein, given a document containing multiple sentences S;
={s1, s2, . . . , sm} and the corresponding concept label for each sentence Concept;
={concept1, concept2, . . . , conceptm}, the conditional random field function defining the probability of the Concept applied to S, Pr(Concept|S), is expressed as; - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for searching an electronic document for passages relating to a concept being searched for, where the concept is expressed as a word or plurality of words, the system comprising:
-
a computer processor deconstructing training electronic texts stored on a computer readable into a stream of features; a data store storing the stream of features;
wherein the features include the text of complete sentences, tokens used by the text in each sentence, the sequence of sentences, layout of text and typography of text;wherein the computer processor executes a conditional random field algorithm to label sentences in the electronic document as either being relevant to the concept being searched for (“
State A”
) or as background information (“
State B”
) based on the stream of features;and wherein the computer processor executes a search algorithm which returns those sentences labelled as State A; wherein the conditional random field algorithm generates a probability of a sentence being relevant to State A;
wherein the probability includes a tolerance for words or portions of words which cannot be resolved into computer-readable text;wherein, given a document containing multiple sentences S;
={s1, s2, . . . , sm} and the corresponding concept label for each sentence Concept;
={concept1, concept2, . . . , conceptm}, the conditional random field function defining the probability of the Concept applied to S, Pr(Concept|S), is expressed as; - View Dependent Claims (8, 9, 10, 11, 12)
-
Specification