Method for automatically extracting by-line information
First Claim
1. A processor-implemented method of automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:
- inputting the single news article wherein the single news article comprises a single title meta-tag;
removing formatting tags from the single news article to create a de-tagged version of the single news article;
detecting a set of potential headlines of the single news article from among the substrings of the single title meta-tag of the single news article and their bi-grams and n-grams,the detecting further comprising constructing the set of potential headlines based on the title meta-tag and splitting the title meta-tag at all punctuation marks in the title meta-tag, resulting in a set of sub-strings of the title meta-tag;
adding any of a plurality of bi-grams of the sub-strings and a plurality of n-grams of the sub-strings to the set of potential headlines;
selecting a candidate headline from the set of potential headlines; and
extracting the by-line information from the de-tagged version of the single news article using the location of the selected candidate headline.
1 Assignment
0 Petitions
Accused Products
Abstract
A by-line extraction method detects a set of potential headlines from a title meta-tag of a crawled document, selects a candidate headline from the set of potential headlines, and extracts the by-line information from the document using the location of the selected candidate headline. The method constructs the set of potential headlines based on the title meta-tag. The method selects a candidate headline by evaluating the set of potential headlines in order of the lengths of the potential headlines. The method extracts the by-line information from the document by using the location of the selected candidate headline to extract a string representing a date, a name, or a source located within a minimum distance from the location of the potential headline.
18 Citations
5 Claims
-
1. A processor-implemented method of automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:
-
inputting the single news article wherein the single news article comprises a single title meta-tag; removing formatting tags from the single news article to create a de-tagged version of the single news article; detecting a set of potential headlines of the single news article from among the substrings of the single title meta-tag of the single news article and their bi-grams and n-grams, the detecting further comprising constructing the set of potential headlines based on the title meta-tag and splitting the title meta-tag at all punctuation marks in the title meta-tag, resulting in a set of sub-strings of the title meta-tag; adding any of a plurality of bi-grams of the sub-strings and a plurality of n-grams of the sub-strings to the set of potential headlines; selecting a candidate headline from the set of potential headlines; and extracting the by-line information from the de-tagged version of the single news article using the location of the selected candidate headline. - View Dependent Claims (2, 3, 4, 5)
-
Specification