CLASSIFYING FUNCTIONS OF WEB BLOCKS BASED ON LINGUISTIC FEATURES
First Claim
1. A method in a computing device for classifying a block of a document based on its function, the method comprising:
- identifying (203) blocks of training documents;
for each identified block,receiving (207) a classification label for the identified block indicating its function; and
generating (206) a feature vector for the identified block, the feature vector including a linguistic feature;
training a classifier using the feature vectors and classification labels; and
classifying a block of a document based on its function by applying the trained classifier to a feature vector for the block.
2 Assignments
0 Petitions
Accused Products
Abstract
A classification system trains a classifier to classify blocks of the web page into various classifications of the function of the block. The classification system trains a classifier using training web pages. To train a classifier, the classification system identifies the blocks of the training web pages, generates feature vectors for the blocks that include a linguistic feature, and inputs classification labels for each block. The classification system learns the coefficients of the classifier using any of a variety of machine learning techniques. The classification system can then use the classifier to classify blocks of web pages.
37 Citations
20 Claims
-
1. A method in a computing device for classifying a block of a document based on its function, the method comprising:
-
identifying (203) blocks of training documents; for each identified block, receiving (207) a classification label for the identified block indicating its function; and generating (206) a feature vector for the identified block, the feature vector including a linguistic feature; training a classifier using the feature vectors and classification labels; and classifying a block of a document based on its function by applying the trained classifier to a feature vector for the block. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computing device generating a classifier for classifying blocks of web pages into functional classifications, comprising:
-
a training data store that includes training web pages, the web pages having blocks; a block identification component that identifies blocks within a web document; a feature generation component that generates a feature vector for a block of a web page, the feature vector including layout features and linguistic features; a labeler component that inputs a classification label for each block of each training web page; and a component that learns coefficients of a classifier using the feature vectors of the training web page and the label classifications and stores the coefficients in a classifier coefficients store. - View Dependent Claims (13, 14, 15, 16, 17, 18)
-
-
19. A computer-readable storage medium encoded with instructions for controlling a computing device to classify blocks of web pages based on their function, by a method comprising:
-
identifying blocks of training web pages; for each identified block, receiving a classification label for the identified block, the classifications including information and non-information; and generating a feature vector for the identified block, the feature vector including a linguistic feature and a layout feature, the linguistic feature based on parts of speech of words within the text of the block; training a classifier using the feature vectors and classification labels; and classifying a block of a web page as information or non-information by applying the trained classifier to a feature vector for the block. - View Dependent Claims (20)
-
Specification