System and method for identifying passages in electronic documents

US 9,645,988 B1
Filed: 08/25/2016
Issued: 05/09/2017
Est. Priority Date: 08/25/2016
Status: Active Grant

First Claim

Patent Images

1. A method for searching an electronic document for passages relating to a concept being searched for, where the concept is expressed as a word or plurality of words, the method comprising:

deconstructing by a computer processor training electronic texts stored on a computer readable into a stream of features;

storing the stream of features in a data store;

wherein the features include the text of complete sentences, tokens used by the text in each sentence, the sequence of sentences, layout of text and typography of text;

executing by a computer processor a conditional random field algorithm to label sentences in the electronic document as either being relevant to the concept being searched for (“

State A”

) or as background information (“

State B”

) based on the stream of features;

executing by the computer processor a search algorithm which returns those sentences labelled as State A;

wherein the conditional random field algorithm generates a probability of a sentence being relevant to State A;

wherein the probability includes a tolerance for words or portions of words which cannot be resolved into computer-readable text;

wherein, given a document containing multiple sentences S;

={s₁, s₂, . . . , s_m} and the corresponding concept label for each sentence Concept;

={concept₁, concept₂, . . . , concept_m}, the conditional random field function defining the probability of the Concept applied to S, Pr(Concept|S), is expressed as;

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The methods proposed here deconstructs training sentences into a stream of features that represent both the sentences and tokens used by the text, their sequence and other ancillary features extracted using natural language processing. Then, we use a conditional random field where we represent the concept we are looking for as state A and the background (everything not concept A) as a state B. The model created by this training phase is then used to locate the concept as a sequence of sentences within a document. This has distinct advantages in accuracy and speed over methods that individually classify each sentence and then use a secondary method to group the classified sentences into passages. Furthermore while previous methods were based on searching for the occurrence of tokens only, the use of a wider set of features enables this method to locate relevant passages even though a different terminology is in use.

26 Citations

View as Search Results

12 Claims

1. A method for searching an electronic document for passages relating to a concept being searched for, where the concept is expressed as a word or plurality of words, the method comprising:
- deconstructing by a computer processor training electronic texts stored on a computer readable into a stream of features;
  
  storing the stream of features in a data store;
  
  wherein the features include the text of complete sentences, tokens used by the text in each sentence, the sequence of sentences, layout of text and typography of text;
  
  executing by a computer processor a conditional random field algorithm to label sentences in the electronic document as either being relevant to the concept being searched for (“
  
  State A”
  
  ) or as background information (“
  
  State B”
  
  ) based on the stream of features;
  
  executing by the computer processor a search algorithm which returns those sentences labelled as State A;
  
  wherein the conditional random field algorithm generates a probability of a sentence being relevant to State A;
  
  wherein the probability includes a tolerance for words or portions of words which cannot be resolved into computer-readable text;
  
  wherein, given a document containing multiple sentences S;
  
  ={s₁, s₂, . . . , s_m} and the corresponding concept label for each sentence Concept;
  
  ={concept₁, concept₂, . . . , concept_m}, the conditional random field function defining the probability of the Concept applied to S, Pr(Concept|S), is expressed as;
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein said words which cannot be resolved into computer-readable text have properties selected from the group consisting of being spelled incorrectly, being of poor optical character recognition quality, and being in a foreign language.
  - 3. The method according to claim 2, wherein the conditional random field algorithm is agnostic to the property which cause said words to be unresolvable into computer-readable text.
  - 4. The method according to claim 1, wherein the stream of features are generated, at least in part, from n-gram segments of word vectors within each sentence.
  - 5. The method according to claim 1, wherein each feature in the stream of features is tagged using natural language processing techniques.
  - 6. The method according to claim 1, wherein the stream of features includes grid-based layout information.

7. A system for searching an electronic document for passages relating to a concept being searched for, where the concept is expressed as a word or plurality of words, the system comprising:
- a computer processor deconstructing training electronic texts stored on a computer readable into a stream of features;
  
  a data store storing the stream of features;
  
  wherein the features include the text of complete sentences, tokens used by the text in each sentence, the sequence of sentences, layout of text and typography of text;
  
  wherein the computer processor executes a conditional random field algorithm to label sentences in the electronic document as either being relevant to the concept being searched for (“
  
  State A”
  
  ) or as background information (“
  
  State B”
  
  ) based on the stream of features;
  
  and wherein the computer processor executes a search algorithm which returns those sentences labelled as State A;
  
  wherein the conditional random field algorithm generates a probability of a sentence being relevant to State A;
  
  wherein the probability includes a tolerance for words or portions of words which cannot be resolved into computer-readable text;
  
  wherein, given a document containing multiple sentences S;
  
  ={s₁, s₂, . . . , s_m} and the corresponding concept label for each sentence Concept;
  
  ={concept₁, concept₂, . . . , concept_m}, the conditional random field function defining the probability of the Concept applied to S, Pr(Concept|S), is expressed as;
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system according to claim 7, wherein said words which cannot be resolved into computer-readable text have properties selected from the group consisting of being spelled incorrectly, being of poor optical character recognition quality, and being in a foreign language.
  - 9. The system according to claim 8, wherein the conditional random field algorithm is agnostic to the property which cause said words to be unresolvable into computer-readable text.
  - 10. The system according to claim 7, wherein the stream of features are generated, at least in part, from n-gram segments of word vectors within each sentence.
  - 11. The system according to claim 7, wherein each feature in the stream of features is tagged using natural language processing techniques.
  - 12. The system according to claim 7, wherein the stream of features includes grid-based layout information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kira, Inc., ZUVA Incorporated
Original Assignee
Kira, Inc.
Inventors
Warren, Robert Henry, Hudek, Alexander Karl
Primary Examiner(s)
Uddin, Mohammed R

Application Number

US15/246,659
Time in Patent Office

257 Days
Field of Search

707728, 707730, 707737, 707738, 707999006
US Class Current
CPC Class Codes

G06F 16/3344   using natural language anal...

G06F 40/205   Parsing

G06F 40/284   Lexical analysis, e.g. toke...

System and method for identifying passages in electronic documents

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

26 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying passages in electronic documents

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links