Methods and systems to train models to extract and integrate information from data sources

US 8,805,861 B2
Filed: 05/15/2009
Issued: 08/12/2014
Est. Priority Date: 12/09/2008
Status: Expired due to Fees

First Claim

Patent Images

1. A non-transitory computer readable storage medium storing at least one program configured for execution by at least one processor of a computer system, the at least one program comprising instructions to:

obtain a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar;

receive a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising;

(i) a plurality of user-provided navigational tags, whereina user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and(ii) a plurality of corresponding user-identified tokens in the first source document, whereina user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document;

select a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags;

extract information from a third source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and

transform the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems to model and acquire data from a variety of data and information sources, to integrate the data into a structured database, and to manage the continuing reintegration of updated data from those sources over time. For any given domain, a variety of individual information and data sources that contain information relevant to the schema can be identified. Data elements associated with a schema may be identified in a training source, such as by user tagging. A formal grammar may be induced appropriate to the schema and layout of the training source. A Hidden Markov Model (HMM) corresponding to the grammar may learn where in the sources the elements can be found. The system can automatically mutate its schema into a grammar matching the structure of the source documents. By following an inverse transformation sequence, data that is parsed by the mutated grammar can be fit back into the original grammar structure, matching the original data schema defined through domain modeling. Features disclosed herein may be implemented with respect to web-scraping and data acquisition, and to represent data in support of data-editing and data-merging tasks. A schema may be defined with respect to a graph-based domain model.

Citations

24 Claims

1. A non-transitory computer readable storage medium storing at least one program configured for execution by at least one processor of a computer system, the at least one program comprising instructions to:
- obtain a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar;
  
  receive a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising;
  
  (i) a plurality of user-provided navigational tags, whereina user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and(ii) a plurality of corresponding user-identified tokens in the first source document, whereina user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document;
  
  select a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags;
  
  extract information from a third source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and
  
  transform the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources.
- View Dependent Claims (2, 3, 6, 9, 12, 15, 21, 22)
- - 2. The non-transitory computer readable storage medium of claim 1, wherein the instructions to select the page grammar for the first set of source documents comprises heuristically identifying a first sequence of grammar transformations {(G→
    - G′
      
      )₁, . . . , (G→
      
      G′
      
      )_n} that transforms the domain grammar to the page grammar.
  - 3. The non-transitory computer readable storage medium of claim 2, whereineach respective grammar transformations (G→
    - G′
      
      )_iin the first sequence of grammar transformations is invertible by a corresponding grammar transformation (G′
      
      →
      
      G)_i, in a second sequence of grammar transformations {(G′
      
      →
      
      G)₁, . . . , (G′
      
      →
      
      G)_n}, that undoes an effect of the transformation (G→
      
      G′
      
      )_iwith respect to the domain grammar, andthe instructions to transform information comprise using the second sequence of grammar transformations to structurally transform information extracted from the second set of source documents to the format of the domain grammar.
  - 6. The non-transitory computer readable storage medium of claim 1 wherein the page grammar for the first set of documents is selected by running a Viterbi algorithm on the tag layout.
  - 9. The non-transitory computer readable storage medium of claim 1, wherein a token in the corresponding user-identified tokens comprises a word, a number, a punctuation character, an HTML element, a link or hyperlink, a form button, a control character, an image, an audio file, or a video file in a source document in the first set of source documents.
  - 12. The non-transitory computer readable storage medium of claim 1, wherein a user-provided navigational tag in the plurality of user-provided navigational tags comprises one or more tokens in a source document in the first set of source documents.
  - 15. The non-transitory computer readable storage medium of claim 1, wherein the first information associated with the domain model comprise a database, a spreadsheet, a web service feed, or an external website.
  - 21. The computer-implemented method of claim 2, whereineach respective grammar transformations (G→
    - G′
      
      )_iin the first sequence of grammar transformations is invertible by a corresponding grammar transformation (G′
      
      →
      
      G)_i, in a second sequence of grammar transformations {(G′
      
      →
      
      G)₁, . . . , (G′
      
      →
      
      G)_n}, that undoes an effect of the transformation (G→
      
      G′
      
      )_iwith respect to the domain grammar, andthe instructions to transform information comprise using the second sequence of grammar transformations to structurally transform information extracted from the second set of source documents to the format of the domain grammar.
  - 22. The non-transitory computer readable storage medium of claim 2, wherein a grammar transformation in the first sequence of grammar transformations comprises a lift(R), permute(P), multi-choice-permute(P), factor(P), require(R), unloop(R), set-cardinality(G), choice(P) or interleave(R) grammar operator.

4. A system for extracting and integrating information from one or more sources, comprising:
- at least one processor;
  
  memory; and
  
  at least one program stored in the memory and executable by the at least one processor, the at least one program comprising instructions to;
  
  obtain a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar;
  
  receive a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising;
  
  (i) a plurality of user-provided navigational tags, whereina user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and(ii) a plurality of corresponding user-identified tokens in the first source document, whereina user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document;
  
  select a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags;
  
  extract information from a third of source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and
  
  transform the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources.
- View Dependent Claims (7, 10, 13, 16, 18, 19, 23)
- - 7. The system of claim 4 wherein the page grammar for the first set of documents is selected by running a Viterbi algorithm on the tag layout.
  - 10. The system of claim 4, wherein a token in the corresponding user-identified tokens comprises a word, a number, a punctuation character, an HTML element, a link or hyperlink, a form button, a control character, an image, an audio file, or a video file in a source document in the first set of source documents.
  - 13. The system of claim 4, wherein a user-provided navigational tag in the plurality of user-provided navigational tags comprises one or more tokens in a source document in the first set of source documents.
  - 16. The system of claim 4, wherein the first information associated with the domain model comprise a database, a spreadsheet, a web service feed, or an external website.
  - 18. The system of claim 4, wherein the instructions to select the page grammar for the first set of source documents comprises heuristically identifying a first sequence of grammar transformations {(G→
    - G′
      
      )₁, . . . , (G→
      
      G′
      
      )_n} that transforms the domain grammar to the page grammar.
  - 19. The system of claim 18, whereineach respective grammar transformations (G→
    - G′
      
      )_iin the first sequence of grammar transformations is invertible by a corresponding grammar transformation (G′
      
      →
      
      G)_i, in a second sequence of grammar transformations {(G′
      
      →
      
      G)₁, . . . , (G′
      
      →
      
      G)_n}, that undoes an effect of the transformation (G→
      
      G′
      
      )_iwith respect to the domain grammar, andthe instructions to transform information comprise using the second sequence of grammar transformations to structurally transform information extracted from the second set of source documents to the format of the domain grammar.
  - 23. The system of claim 18, wherein a grammar transformation in the first sequence of grammar transformations comprises a lift(R), permute(P), multi-choice-permute(P), factor(P), require(R), unloop(R), set-cardinality(G), choice(P) or interleave(R) grammar operator.

5. A computer-implemented method for extracting and integrating information from one or more sources, comprising:
- obtaining a domain model comprising a set of entity types having corresponding properties and relationships between entities in a set of entities, wherein the domain model is characterized by a domain grammar;
  
  receiving a first tag layout of a first source document obtained from a first information source associated with the domain model, the first tag layout comprising;
  
  (i) a plurality of user-provided navigational tags, whereina user-provided navigational tag in the plurality of a user-provided navigational tags indicates a navigational position of the first source document relative to a second source document, from the first information source, navigationally connected with the first source document, and(ii) a plurality of corresponding user-identified tokens in the first source document, whereina user-identified token in the plurality of corresponding user-identified tokens includes a portion of content of the first source document;
  
  selecting a page grammar in plurality of page grammars for the first source document in accordance with the plurality of user provided navigational tags;
  
  extracting information from a third of source document having a predefined degree of tag layout similarity to the first source document using the page grammar, wherein the second source document is obtained from a second information source; and
  
  transforming the information extracted from the second source document in accordance with the domain grammar, thereby extracting and integrating information from a plurality of information sources.
- View Dependent Claims (8, 11, 14, 17, 20, 24)
- - 8. A computer-implemented method of claim 5 wherein the page grammar for the first set of documents is selected by running a Viterbi algorithm on the tag layout.
  - 11. The computer-implemented method of claim 5, wherein a token in the corresponding user-identified tokens comprises a word, a number, a punctuation character, an HTML element, a link or hyperlink, a form button, a control character, an image, an audio file, or a video file in a source document in the first set of source documents.
  - 14. The computer-implemented method of claim 5, wherein a user-provided navigational tag in the plurality of user-provided navigational tags comprises one or more tokens in a source document in the first set of source documents.
  - 17. The computer-implemented method of claim 5, wherein the first information associated with the domain model comprise a database, a spreadsheet, a web service feed, or an external website.
  - 20. The computer-implemented method of claim 5, wherein the instructions to select the page grammar for the first set of source documents comprises heuristically identifying a first sequence of grammar transformations {(G→
    - G′
      
      )₁, . . . ,(G→
      
      G′
      
      )_n} that transforms the domain grammar to the page grammar.
  - 24. The computer-implemented method of claim 20, wherein a grammar transformation in the first sequence of grammar transformations comprises a lift(R), permute(P), multi-choice-permute(P), factor(P), require(R), unloop(R), set-cardinality(G), choice(P) or interleave(R) grammar operator.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Boyan, Justin, McDonald, Glenn, Benthall, Margaret, Molnar, Ray
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
Black, Linh

Application Number

US12/467,235
Publication Number

US 20100145902A1
Time in Patent Office

1,915 Days
Field of Search

704/231, 704/234, 704/250, 704/256, 704/256.1, 704/256.2, 704/256.3, 704/256.4, 704/224, 704/242, 706/16, 706/19, 706 12- 13, 706/25, 395/23, 395/77, 707/791, 707/793, 707/797, 707/798, 707/602, 707/756, 707/742, 700 28- 31, 700 97- 99
US Class Current

707/756
CPC Class Codes

G06F 16/958 Organisation or management ...

Methods and systems to train models to extract and integrate information from data sources

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems to train models to extract and integrate information from data sources

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links