SYSTEM AND METHOD FOR STORING AND SEARCHING DATA EXTRACTED FROM TEXT DOCUMENTS

US 20160275180A1
Filed: 05/20/2015
Published: 09/22/2016
Est. Priority Date: 03/19/2015
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method for storing in a computer system, searching and updating data extracted from text documents, the method comprising:

extracting at least one first information object from a text document;

generating one or more subject-predicate-object triplets for the first information object;

accessing a storage of extracted data that contains a RDF graph comprising a plurality of subject-predicate-object triplets for a plurality of different information objects extracted from different text documents;

searching the storage of extracted data for a second information object related to the same object in real world as the first information object, wherein two information objects are related when said two information objects have at least the subject parameter in common, and wherein searching includes selecting and searching at least one of three types of identifier tables containing one of a double, a triple and a quad search indices, wherein each search index is based on at least two parameters selected from a subject, a predicate, an object and a document;

when at least one second information object related to the same object in real world as the first information object is found, updating the storage of extracted data by adding the at least one subject-predicate-object triplet of the first information object to the RDF graph and updating at least one of the three types of indexes tables.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are system and method for storing, searching and updating extracted data for natural language processing of text. An example method comprises extracting at least one first information object from a text document; generating one or more subject-predicate-object triplets for the first information object; accessing a storage of extracted data that contains a RDF graph comprising a plurality of subject-predicate-object triplets for a plurality of different information objects; searching the storage of extracted data for a second information object related to the first information object, wherein searching includes selecting and searching at least one of three types of N-gram identifier tables containing one of a double, a triple and a quad search indices associated with at least two of a subject, a predicate, an object and a document; when at least one second information object related to the first information object is found, wherein two objects are related when said two objects have at least one of a subject, a predicate and an object in common, updating the storage of extracted data by adding the at least one subject-predicate-object triplet of the first information object to the master RDF graph and associating the first and second information objects with each other.

21 Citations

View as Search Results

26 Claims

1. A computer-implemented method for storing in a computer system, searching and updating data extracted from text documents, the method comprising:
- extracting at least one first information object from a text document;
  
  generating one or more subject-predicate-object triplets for the first information object;
  
  accessing a storage of extracted data that contains a RDF graph comprising a plurality of subject-predicate-object triplets for a plurality of different information objects extracted from different text documents;
  
  searching the storage of extracted data for a second information object related to the same object in real world as the first information object, wherein two information objects are related when said two information objects have at least the subject parameter in common, and wherein searching includes selecting and searching at least one of three types of identifier tables containing one of a double, a triple and a quad search indices, wherein each search index is based on at least two parameters selected from a subject, a predicate, an object and a document;
  
  when at least one second information object related to the same object in real world as the first information object is found, updating the storage of extracted data by adding the at least one subject-predicate-object triplet of the first information object to the RDF graph and updating at least one of the three types of indexes tables.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein selection of a search index is based on type of searched object and its features.
  - 3. The method of claim 1, wherein the lines of each identifier table are sorted lexicographically.
  - 4. The method of claim 1, wherein the double index includes a table with two columns that stores subject (s) and document (d) identifiers.
  - 5. The method of claim 1, wherein the triple index includes one or more tables with three columns that store one or more permutations of subject (s), predicate (p) and object (o) identifiers.
  - 6. The method of claim 1, wherein the quad index includes a table with four columns that stores document (d), subject (s), predicate (p) and object (o) identifiers.
  - 7. The method of claim 1, wherein when at least one second information object related to the same object in real world as the first information object is found in the storage of extracted data, updating the storage further comprises:
    - determining subject identifier of the second information object in the storage; and
      
      adding one or more new features of the first information object to the features of the subject identifier of the second information object in the storage.
  - 8. The method of claim 1, wherein when at least one second information object related to the same object in real world as the first information object is not found in the storage of extracted data, updating the storage further comprises:
    - assigning a new subject identifier to the first information object; and
      
      adding one or more new features of the first information object to the three types of identifier tables.
  - 9. The method of claim 1 further comprisinggenerating an annotation for the first information object that indicates a relationship of the annotated first information object to the text document;
    - marking in the text document the annotated first information object; and
      
      storing in the storage of extracted data the annotation and at least a portion of the text document containing the annotated first information object.

10. A system for storing, searching and updating extracted data, the system comprising:
- a storage of extracted data containing a RDF graph comprising a plurality of subject-predicate-object triplets for a plurality of different information objects;
  
  a hardware processor coupled to the storage, the processor being configured to;
  
  extract at least one first information object from a text document;
  
  generate one or more subject-predicate-object triplets for the first information object;
  
  search the storage of extracted data for a second information object related to the same object in real world as the first information object, wherein two information objects are related when said two information objects have at least the subject parameter in common, and wherein searching includes selecting and searching at least one of three types of N-gram identifier tables containing one of a double, a triple and a quad search indices, wherein each search index is based on at least two parameters selected from a subject, a predicate, an object and a document;
  
  when at least one second information object related to the same object in real world as the first information object is found, update the storage of extracted data by adding the at least one subject-predicate-object triplet of the first information object to the RDF graph and updating at least one of the three types of indexes tables.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10, wherein selection of a search index is based on the type of searched object and its features.
  - 12. The system of claim 10, wherein the lines of each identifier table are sorted lexicographically.
  - 13. The system of claim 10, wherein the double index includes a table with two columns that stores subject (s) and document (d) identifiers.
  - 14. The system of claim 10, wherein the triple index includes one or more tables with three columns that store one or more permutations of subject (s), predicate (p) and object (o) identifiers.
  - 15. The system of claim 10, wherein the quad index includes a table with four columns that stores document (d), subject (s), predicate (p) and object (o) identifiers.
  - 16. The system of claim 10, wherein when at least one second information object related to the same object in real world as the first information object is found in the storage of extracted data, updating the storage of extracted data further comprises:
    - determining subject identifier of a second information object in the storage; and
      
      adding one or more new features of the first information object to the features of the subject identifier of a second information object in the storage.
  - 17. The system of claim 10, wherein when at least one second information object related to the same object in real world as the first information object is not found in the storage of extracted data, updating the storage further comprises:
    - generating an annotation for the first information object that indicates a relationship of the annotated first information object to the text document; and
      
      storing in the storage of extracted data the annotation and at least a portion of the text document containing the annotated first information object.
  - 18. The system of claim 10, wherein the processor further configured to:
    - generate an annotation for the first information object that indicates a relationship of the annotated first information object to the text document;
      
      mark in the text document the annotated first information object; and
      
      stored in the storage of extracted data the annotation and at least a portion of the text document containing the annotated first information object.

19. A computer program product stored on a non-transitory computer-readable storage medium, the computer program product comprising computer-executable instructions for storing, searching and updating extracted data, comprising instructions for:
- extracting at least one first information object from a text document;
  
  generating one or more subject-predicate-object triplets for the first information object;
  
  accessing a storage of extracted data that contains a RDF graph comprising a plurality of subject-predicate-object triplets for a plurality of different information objects;
  
  searching the storage of extracted data for a second information object related to the same object in real world as the first information object, wherein two information objects are related when said two information objects have at least the subject parameter in common, and wherein searching includes selecting and searching at least one of three types of identifier tables containing one of a double, a triple and a quad search indices, wherein each search index is based on at least two parameters selected from a subject, a predicate, an object and a document;
  
  when at least one second information object related to the same object in real world as the first information object is found, updating the storage of extracted data by adding the at least one subject-predicate-object triplet of the first information object to the RDF graph and updating at least one of the three types of indexes tables.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26)
- - 20. The computer program product of claim 19, wherein selection of a search index is based on the type of searched extracted data.
  - 21. The computer program product of claim 19, wherein the lines of each identifier table are sorted lexicographically.
  - 22. The computer program product of claim 19, wherein the double index includes a table with two columns that stores object (o) and document (d) identifiers.
  - 23. The computer program product of claim 19, wherein the triple index includes one or more tables with three columns that store one or more permutations of subject (s), predicate (p) and object (o) identifiers.
  - 24. The computer program product of claim 19, wherein the quad index includes a table with four columns that stores document (d), subject (s), predicate (p) and object (o) identifiers.
  - 25. The computer program product of claim 19, wherein adding the at least one subject-predicate-object triplet of the first information object to the master RDF graph comprises assigning to the first information object a unique global identifier in the storage of extracted data.
  - 26. The computer program product of claim 19 further comprising instructions for:
    - generating an annotation for the first information object that indicates a relationship of the annotated first information object to the text document;
      
      marking in the text document the annotated first information object; and
      
      storing in the storage of extracted data the annotation and at least a portion of the text document containing the annotated first information object.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
ABBYY Production LLC (ABBYY Software)
Original Assignee
ABBYY InfoPoisk LLC
Inventors
Matskevich, Stepan

Application Number

US14/717,647
Publication Number

US 20160275180A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/31   Indexing; Data structures t...

G06F 16/3344   using natural language anal...

G06F 16/93   Document management systems

G06F 40/169   Annotation, e.g. comment da...

SYSTEM AND METHOD FOR STORING AND SEARCHING DATA EXTRACTED FROM TEXT DOCUMENTS

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

21 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR STORING AND SEARCHING DATA EXTRACTED FROM TEXT DOCUMENTS

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links