HIERARCHICAL CONDITIONAL RANDOM FIELDS FOR WEB EXTRACTION
First Claim
1. A method for labeling observations, the method comprising:
- receiving observations having hierarchical relationships; and
determining a labeling for the observations using a conditional random fields technique that factors in the hierarchical relationships.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for labeling object information of an information page is provided. A labeling system identifies an object record of an information page based on the labeling of object elements within an object record and labels object elements based on the identification of an object record that contains the object elements. To identify the records and label the elements, the labeling system generates a hierarchical representation of blocks of an information page. The labeling system identifies records and elements within the records by propagating probability-related information of record labels and element labels through the hierarchy of the blocks. The labeling system generates a feature vector for each block to represent the block and calculates a probability of a label for a block being correct based on a score derived from the feature vectors associated with related blocks. The labeling system searches for the labeling of records and elements that has the highest probability of being correct.
42 Citations
20 Claims
-
1. A method for labeling observations, the method comprising:
-
receiving observations having hierarchical relationships; and determining a labeling for the observations using a conditional random fields technique that factors in the hierarchical relationships. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for identifying object records and object elements of a page, comprising:
-
a component that identifies blocks of the page; and a component that labels object records and object elements based on a probability of a label for a block being correct that is based on probabilities of labels for blocks within the block being correct. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A computer-readable medium containing instructions for controlling a computer system to identify object records and object elements of a web page, by a method comprising:
-
providing a hierarchical representation of blocks of the web page, each block representing an object record or an object element; and applying a hierarchical conditional random fields technique to jointly identify a set of record labels and element labels for the blocks based on the hierarchical relationship of the blocks of the web page. - View Dependent Claims (18, 19, 20)
-
Specification