Machine learning system for extracting structured records from web pages and other text sources

US 20060123000A1
Filed: 12/02/2005
Published: 06/08/2006
Est. Priority Date: 12/03/2004
Status: Abandoned Application

First Claim

Patent Images

1. A method for extracting a structured record from a document, said structured record including information related to a predetermined subject matter, said information to be organized into categories within said structured record, said method comprising the steps of:

identifying a span of text in said document according to criteria associated with said predetermined subject matter; and

processing said span of text to extract at least one text element associated with at least one of said categories of said structured record from said document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for extracting a structured record (190) from a document (100) is described where the the structured record includes information related to a predetermined subject matter (120), with this information being organized into categories within the structured record. The method comprises the steps of identifying a span of text (130) in the document (100) according to criteria associated with the predetermined subject matter and processing (150) the span of text to extract at least one text element associated with at least one of the categories of the structured record (190) from the document (100).

Citations

27 Claims

1. A method for extracting a structured record from a document, said structured record including information related to a predetermined subject matter, said information to be organized into categories within said structured record, said method comprising the steps of:
- identifying a span of text in said document according to criteria associated with said predetermined subject matter; and
  
  processing said span of text to extract at least one text element associated with at least one of said categories of said structured record from said document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The method for extracting a structured record from a document as claimed in claim 1, wherein said step of processing said span of text further comprises:
    - identifying an entity within said span of text, said entity including at least one entity text element, wherein said entity is associated with at least one of said categories of said structured record.
  - 3. The method for extracting a structured record from a document as claimed in claim 2, wherein said step of processing said span of text further comprises:
    - identifying a sub-entity within said entity, said sub-entity including at least one sub-entity text element, wherein said sub-entity is associated with at least one of said categories of said structured record.
  - 4. The method for extracting a structured record from a document as claimed in claim 3, wherein said step of processing said span of text further comprises:
    - where a plurality of said entity are identified, associating said entities within said span of text, wherein said step of associating said entities includes linking related entities together for storage in a category of said structured record.
  - 5. The method for extracting a structured record from a document as claimed in claim 4, wherein said step of processing said span of text further comprises:
    - normalizing said entities within said span of text, wherein said step of normalizing said entities includes determining whether two or more identified entities refer to the same entity that is to be organized in a category of said structured record.
  - 6. The method for extracting a structured record from a document as claimed in claim 1, wherein said step of identifying a span of text further comprises:
    - dividing said document into a plurality of text nodes, said text nodes each including at least one text element;
      
      generating a text node feature vector for each of said text nodes, said text node feature vector generated in part according to features relevant to said criteria, thereby generating a text node feature vector sequence for said document; and
      
      calculating a text node label sequence corresponding to said text node feature vector sequence, said text node label sequence calculated by a predictive algorithm adapted to generate said text node label sequence from an input text node feature vector sequence, wherein said labels forming said text node label sequence identify a given text node as being associated with said predetermined subject matter, thereby identifying said span of text.
  - 7. The method for extracting a structured record from a document as claimed in claim 6, wherein said predictive model is a classifier based on a Markov model trained on labeled text node feature vector sequences.
  - 8. The method for extracting a structured record from a document as claimed in claim 6, wherein said predictive model is a hand tuned decision tree based procedure.
  - 9. The method for extracting a structured record from a document as claimed in claim 6, wherein said step of processing said span of text further comprises:
    - identifying an entity within said span of text, said entity including at least one entity text element, wherein said entity is associated with at least one of said categories of said structured record.
  - 10. The method for extracting a structured record from a document as claimed in claim 9, wherein said step of identifying an entity within said span of text further comprises:
    - dividing said span of text into a plurality of text elements;
      
      generating an entity feature vector for each of said text elements, said entity feature vector generated in part according to features relevant to said criteria, thereby generating an entity feature vector sequence for said span of text; and
      
      calculating an entity label sequence corresponding to said entity feature vector sequence, said entity label sequence calculated by a predictive algorithm adapted to generate said entity label sequence from an input entity feature vector sequence, wherein said labels forming said entity label sequence identify a given entity text element as being associated with said entity.
  - 11. The method for extracting a structured record from a document as claimed in claim 10, wherein said predictive model is a classifier based on a Markov model trained on labeled entity feature vector sequences.
  - 12. The method for extracting a structured record from a document as claimed in claim 10, wherein said predictive model is a hand tuned decision tree based procedure.
  - 13. The method for extracting a structured record from a document as claimed in claim 10, wherein said step of processing said span of text further comprises:
    - identifying a sub-entity within said entity, said sub-entity including at least one sub-entity text element, wherein said sub-entity is associated with at least one of said categories of said structured record.
  - 14. The method for extracting a structured record from a document as claimed in claim 13, wherein said step of identifying a sub-entity within said entity further comprises:
    - dividing said entity into a plurality of text elements;
      
      generating a sub-entity feature vector for each of said text elements, said sub-entity feature vector generated in part according to features relevant to said criteria, thereby generating a sub-entity feature vector sequence for said entity; and
      
      calculating a sub-entity label sequence corresponding to said sub-entity feature vector sequence, said sub-entity label sequence calculated by a predictive algorithm adapted to generate said sub-entity label sequence from an input entity feature vector sequence, wherein said labels forming said sub-entity label sequence identify a given sub-entity text element as being associated with said sub-entity.
  - 15. The method for extracting a structured record from a document as claimed in claim 14, wherein said predictive model is a classifier based on a Markov model trained on labeled sub-entity feature vector sequences.
  - 16. The method for extracting a structured record from a document as claimed in claim 14, wherein said predictive model is a hand tuned decision tree based procedure.
  - 17. The method for extracting a structured record from a document as claimed in claim 14, wherein said step of processing said span of text further comprises:
    - where a plurality of said entity are identified, associating said entities within said span of text, wherein said step of associating said entities includes linking related entities together for storage in a category of said structured record.
  - 18. The method for extracting a structured record from a document as claimed in claim 17, wherein said step of associating said entities within said span of text further comprises:
    - forming pairs of entities to determine if they are to be associated;
      
      generating an entity pair feature vector for each pair of entities, said entity pair feature vector generated in part according to features relevant to associations between entity pairs;
      
      calculating an association label based on said entity pair feature vector to determine if a given pair of entities are linked, said association label calculated by a predictive algorithm adapted to generate said association label from an input entity pair feature vector.
  - 19. The method for extracting a structured record from a document as claimed in claim 18, wherein said step of forming pairs of entities to determine if they are to be associated further comprises:
    - forming only those pairs of entities which are within a predetermined number of text elements from each other.
  - 20. The method for extracting a structured record from a document as claimed in claim 18, wherein said step of processing said span of text further comprises:
    - normalizing said entities within said span of text, wherein said step of normalizing said entities includes determining whether two or more identified entities refer to the same entity that is to be organized in a category of said structured record.
  - 21. The method for extracting a structured record from a document as claimed in claim 20, wherein said step of normalizing said entities within said span of text further comprises:
    - selecting those associated entities sharing a predetermined number of features; and
      
      normalizing these associated entities to refer to said same entity.

22. A method for training a classifier to classify for text based elements in a collection of text based elements according to a characteristic, said method comprising the steps of:
- forming a feature vector corresponding to each text based element;
  
  forming a sequence of said feature vectors corresponding to each of said text based elements in said collection of text based elements;
  
  labeling each text based element according to said characteristic thereby forming a sequence of labels corresponding to said sequence of feature vectors; and
  
  training a predictive algorithm based on said sequence of labels and said corresponding sequence of said feature vectors, said algorithm trained to generate new label sequences from an input sequence of feature vectors thereby classifying text based elements that form said input sequence of feature vectors.
- View Dependent Claims (23, 24, 25)
- - 23. The method for training a classifier to classify for text based elements in a collection of text based elements according to claim 22, wherein said text based element is a span of text elements and said collection of text based elements is a document.
  - 24. The method for training a classifier to classify for text based elements in a collection of text based elements according to claim 22, wherein said text based element is an entity comprising at least one text element and said collection of entities forms a span of text elements.
  - 25. The method for training a classifier to classify for text based elements in a collection of text based elements according to claim 22, wherein said text based element is a sub-entity comprising at least one text element and said collection of text based elements is an entity.

26. An apparatus adapted for extracting a structured record from a document, said structured record including information related to a predetermined subject matter, said information to be organized into categories within said structured record, said apparatus comprising:
- processor means adapted to operate in accordance with a predetermined instruction set;
  
  said apparatus in conjunction with said instruction set, being adapted to perform the method of;
  
  identifying a span of text in said document according to criteria associated with said predetermined subject matter; and
  
  processing said span of text to extract at least one text element associated with at least one of said categories of said structured record from said document.

27. An apparatus adapted to train a classifier to classify for text based elements in a collection of text based elements according to a characteristic, said apparatus comprising:
- processor means adapted to operate in accordance with a predetermined instruction set;
  
  said apparatus in conjunction with said instruction set, being adapted to perform the method of;
  
  forming a feature vector corresponding to each text based element;
  
  forming a sequence of said feature vectors corresponding to each of said text based elements in said collection of text based elements;
  
  labeling each text based element according to said characteristic thereby forming a sequence of labels corresponding to said sequence of feature vectors; and
  
  training a predictive algorithm based on said sequence of labels and said corresponding sequence of said feature vectors, said algorithm trained to generate new label sequences from an input sequence of feature vectors thereby classifying text based elements that form said input sequence of feature vectors.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Panscient Incorporated
Original Assignee
Panscient Incorporated
Inventors
Baxter, Jonathan, Seymore, Kristie

Application Number

US11/291,740
Publication Number

US 20060123000A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 16/86   Mapping to a database

G06F 2216/03   Data mining

Machine learning system for extracting structured records from web pages and other text sources

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Machine learning system for extracting structured records from web pages and other text sources

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links