Systems and methods of semantically annotating documents of different structures

US 8,924,374 B2
Filed: 02/22/2008
Issued: 12/30/2014
Est. Priority Date: 02/22/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

at a computer having memory and one or more processors;

receiving one or more search keywords from a user;

selecting a plurality of candidate document identifiers in accordance with the one or more search keywords, each candidate document identifier corresponding to a respective document at a respective data source;

for a respective candidate document identifier of the plurality of candidate document identifiers;

retrieving a document corresponding to the respective candidate document identifier from a data source, wherein the document has a structure type;

converting the document into a node stream, wherein the document conversion is initiated immediately after retrieving a portion of the document;

generating a customized data model for the document using the node stream in accordance with the structure type of the document;

identifying one or more candidate chunks within the customized data model in accordance with a set of heuristic rules associated with the structure type; and

selecting one or more chunks of the candidate chunks that satisfy the one or more search keywords; and

providing at least one of the selected one or more chunks for display to the user.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer retrieves a document from a data source, wherein the document has a structure type. The computer generates a customized data model for the document in accordance with its structure type. The computer identifies one or more candidate chunks within the customized data model in accordance with a set of heuristic rules associated with the structure type.

Citations

33 Claims

1. A computer-implemented method, comprising:
- at a computer having memory and one or more processors;
  
  receiving one or more search keywords from a user;
  
  selecting a plurality of candidate document identifiers in accordance with the one or more search keywords, each candidate document identifier corresponding to a respective document at a respective data source;
  
  for a respective candidate document identifier of the plurality of candidate document identifiers;
  
  retrieving a document corresponding to the respective candidate document identifier from a data source, wherein the document has a structure type;
  
  converting the document into a node stream, wherein the document conversion is initiated immediately after retrieving a portion of the document;
  
  generating a customized data model for the document using the node stream in accordance with the structure type of the document;
  
  identifying one or more candidate chunks within the customized data model in accordance with a set of heuristic rules associated with the structure type; and
  
  selecting one or more chunks of the candidate chunks that satisfy the one or more search keywords; and
  
  providing at least one of the selected one or more chunks for display to the user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the customized data model is a hierarchical data model.
  - 3. The method of claim 1, wherein the structure type is one selected from the group consisting of structured, semi-structured, and unstructured.
  - 4. The method of claim 1, wherein the data source is a web server.
  - 5. The method of claim 1, wherein the document is an HTML web page.
  - 6. The method of claim 5, wherein the HTML web page includes multiple pairs of HTML tags, further comprising:
    - identifying a first subset of the HTML web page between a first pair of HTML tags as a first candidate chunk if the first pair of HTML tags satisfies one of the set of heuristic rules.
  - 7. The method of claim 6, further comprising:
    - recursively identifying a second subset of the HTML web page within the first subset of the HTML web page between a second pair of HTML tags as a second candidate chunk if the second pair of HTML tags satisfy one of the set of heuristic rules, wherein the second pair of HTML tags is distinct from the first pair of HTML tags.
  - 8. The method of claim 7, including:
    - recursively identifying a third subset of the HTML web page within the second subset of the HTML web page between a third pair of HTML tags as a third candidate chunk if the third pair of HTML tags satisfy one of the set of heuristic rules, wherein the third pair of HTML tags is distinct from the second pair of HTML tags.
  - 9. The method of claim 1, wherein the document is a plain-text document.
  - 10. The method of claim 9, wherein generating the customized data model further includes inserting metadata into the data model that separates the plain-text document into multiple candidate chunks.
  - 11. The method of claim 10, wherein the metadata in the data model is one or more XML tags and the text following at least one of the XML tags is identified as a candidate chunk.

12. A computer system, comprising:
- memory;
  
  one or more processors;
  
  one or more programs stored in the memory and configured for execution by the one or more processors, the one or more programs including instructions for;
  
  receiving one or more search keywords from a user;
  
  selecting a plurality of candidate document identifiers in accordance with the one or more search keywords, each candidate document identifier corresponding to a respective document at a respective data source;
  
  for a respective candidate document identifier of the plurality of candidate document identifiers;
  
  retrieving a document corresponding to the respective candidate document identifier from a data source, wherein the document has a structure type;
  
  converting the document into a node stream, wherein the document conversion is initiated immediately after retrieving a portion of the document;
  
  generating a customized data model for the document using the node stream in accordance with the structure type of the document;
  
  identifying one or more candidate chunks within the customized data model in accordance with a set of heuristic rules associated with the structure type; and
  
  selecting one or more chunks of the candidate chunks that satisfy the one or more search keywords; and
  
  providing at least one of the selected one or more chunks for display to the user.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The computer system of claim 12, wherein the customized data model is a hierarchical data model.
  - 14. The computer system of claim 12, wherein the structure type is one selected from the group consisting of structured, semi-structured, and unstructured.
  - 15. The computer system of claim 12, wherein the data source is a web server.
  - 16. The computer system of claim 12, wherein the document is an HTML web page.
  - 17. The computer system of claim 16, wherein the HTML web page includes multiple pairs of HTML tags, further comprising:
    - instructions for identifying a first subset of the HTML web page between a first pair of HTML tags as a first candidate chunk if the first pair of HTML tags satisfy one of the set of heuristic rules.
  - 18. The computer system of claim 17, further comprising:
    - instructions for recursively identifying a second subset of the HTML web page within the first subset of the HTML web page between a second pair of HTML tags as a second candidate chunk if the second pair of HTML tags satisfy one of the set of heuristic rules, wherein the second pair of HTML tags is distinct from the first pair of HTML tags.
  - 19. The computer system of claim 18, wherein the one or more programs include instructions for:
    - recursively identifying a third subset of the HTML web page within the second subset of the HTML web page between a third pair of HTML tags as a third candidate chunk if the third pair of HTML tags satisfy one of the set of heuristic rules, wherein the third pair of HTML tags is distinct from the second pair of HTML tags.
  - 20. The computer system of claim 12, wherein the document is a plain-text document.
  - 21. The computer system of claim 20, wherein the instructions for generating the customized data model further include instructions for inserting metadata into the data model that separates the plain-text document into multiple candidate chunks.
  - 22. The computer system of claim 21, wherein the metadata in the data model is one or more XML tags and the text following at least one of the XML tags is identified as a candidate chunk.

23. A non-transitory computer readable storage medium having stored therein instructions, which when executed by a computer system cause the computer system to:
- receive one or more search keywords from a user;
  
  select a plurality of candidate document identifiers in accordance with the one or more search keywords, each candidate document identifier corresponding to a respective document at a respective data source;
  
  for a respective candidate document identifier of the plurality of candidate document identifiers;
  
  retrieve a document corresponding to the respective candidate document identifier from a data source, wherein the document has a structure type;
  
  convert the document into a node stream, wherein the document conversion is initiated immediately after retrieving a predefined portion of the document;
  
  generate a customized data model for the document using the node stream in accordance with the structure type of the document;
  
  identify one or more candidate chunks within the customized data model in accordance with a set of heuristic rules associated with the structure type; and
  
  select one or more chunks of the candidate chunks that satisfy the one or more search keywords; and
  
  provide at least one of the selected one or more chunks for display to the user.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 24. The computer readable storage medium of claim 23, wherein the customized data model is a hierarchical data model.
  - 25. The computer readable storage medium of claim 23, wherein the structure type is one selected from the group consisting of structured, semi-structured, and unstructured.
  - 26. The computer readable storage medium of claim 23, wherein the data source is a web server.
  - 27. The computer readable storage medium of claim 23, wherein the document is an HTML web page.
  - 28. The computer readable storage medium of claim 27, wherein the HTML web page includes multiple pairs of HTML tags, further comprising:
    - instructions for identifying a first subset of the HTML web page between a first pair of HTML tags as a first candidate chunk if the first pair of HTML tags satisfy one of the set of heuristic rules.
  - 29. The computer readable storage medium of claim 28, further comprising:
    - instructions for recursively identifying a second subset of the HTML web page within the first subset of the HTML web page between a second pair of HTML tags as a second candidate chunk if the second pair of HTML tags satisfy one of the set of heuristic rules, wherein the second pair of HTML tags is distinct from the first pair of HTML tags.
  - 30. The computer readable storage medium of claim 29, wherein the computer readable storage medium includes instructions for:
    - recursively identifying a third subset of the HTML web page within the second subset of the HTML web page between a third pair of HTML tags as a third candidate chunk if the third pair of HTML tags satisfy one of the set of heuristic rules, wherein the third pair of HTML tags is distinct from the second pair of HTML tags.
  - 31. The computer readable storage medium of claim 23, wherein the document is a plain-text document.
  - 32. The computer readable storage medium of claim 31, wherein the instructions for generating the customized data model further include instructions for inserting metadata into the data model that separates the plain-text document into multiple candidate chunks.
  - 33. The computer readable storage medium of claim 32, wherein the metadata in the data model is one or more XML tags and the text following at least one of the XML tags is identified as a candidate chunk.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Stripe, Inc.
Original Assignee
TigerLogic Corporation
Inventors
Dexter, Jeffrey Matthew
Primary Examiner(s)
WILSON, KIMBERLY LOVEL

Application Number

US12/035,597
Publication Number

US 20090216715A1
Time in Patent Office

2,503 Days
Field of Search

707/999.003, 707/722
US Class Current

707/722
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

G06F 40/169 Annotation, e.g. comment da...

Systems and methods of semantically annotating documents of different structures

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods of semantically annotating documents of different structures

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links