SYSTEM FOR AUTOMATICALLY EXTRACTING BY-LINE INFORMATION
First Claim
1. A processor-implemented system for automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:
- a detagging module for removing formatting tags from said crawled document to create a de-tagged version of said crawled document;
a headline detection module for detecting a set of potential headlines of the document from a title meta-tag of the crawled document;
a headline evaluation module for selecting a candidate headline from the set of potential headlines; and
a by-line extraction module for extracting the by-line information from the de-tagged version of said crawled document using the location of the selected candidate headline.
0 Assignments
0 Petitions
Accused Products
Abstract
A by-line extraction system detects a set of potential headlines from a title meta-tag of a crawled document, selects a candidate headline from the set of potential headlines, and extracts the by-line information from the document using the location of the selected candidate headline. The system constructs the set of potential headlines based on the title meta-tag. The system selects a candidate headline by evaluating the set of potential headlines in order of the lengths of the potential headlines. The system extracts the by-line information from the document by using the location of the selected candidate headline to extract a string representing a date, a name, or a source located within a minimum distance from the location of the potential headline.
-
Citations
22 Claims
-
1. A processor-implemented system for automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:
-
a detagging module for removing formatting tags from said crawled document to create a de-tagged version of said crawled document; a headline detection module for detecting a set of potential headlines of the document from a title meta-tag of the crawled document; a headline evaluation module for selecting a candidate headline from the set of potential headlines; and a by-line extraction module for extracting the by-line information from the de-tagged version of said crawled document using the location of the selected candidate headline. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product having program codes stored on a computer-usable medium for automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:
-
a program code for removing formatting tags from said crawled document to create a de-tagged version of said crawled document; a program code for detecting a set of potential headlines of the crawled document from a title meta-tag of the crawled document; a program code for selecting a candidate headline from the set of potential headlines; and a program code for extracting the by-line information from the detagged version of the crawled document using the location of the selected candidate headline. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A processor-implemented service for automatically extracting by-line information in a crawled document, wherein said crawled document contains a single news article, comprising:
-
receiving the document; invoking an autonomic hardware configuration utility, wherein the document is made available to the autonomic hardware configuration utility for automatically extracting by-line information in the document by; removing formatting tags from said crawled document to create a de-tagged version of said crawled document; detecting a set of potential headlines of the crawled document from a title meta-tag of the document; selecting a candidate headline from the set of potential headlines; and extracting the by-line information from the de-tagged version of said crawled document using the location of the selected candidate headline. - View Dependent Claims (18, 19, 20, 21, 22)
-
Specification