Method for automatically extracting by-line information

US 7,464,078 B2
Filed: 10/25/2005
Issued: 12/09/2008
Est. Priority Date: 10/25/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A processor-implemented method of automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:

inputting the single news article wherein the single news article comprises a single title meta-tag;

removing formatting tags from the single news article to create a de-tagged version of the single news article;

detecting a set of potential headlines of the single news article from among the substrings of the single title meta-tag of the single news article and their bi-grams and n-grams,the detecting further comprising constructing the set of potential headlines based on the title meta-tag and splitting the title meta-tag at all punctuation marks in the title meta-tag, resulting in a set of sub-strings of the title meta-tag;

adding any of a plurality of bi-grams of the sub-strings and a plurality of n-grams of the sub-strings to the set of potential headlines;

selecting a candidate headline from the set of potential headlines; and

extracting the by-line information from the de-tagged version of the single news article using the location of the selected candidate headline.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A by-line extraction method detects a set of potential headlines from a title meta-tag of a crawled document, selects a candidate headline from the set of potential headlines, and extracts the by-line information from the document using the location of the selected candidate headline. The method constructs the set of potential headlines based on the title meta-tag. The method selects a candidate headline by evaluating the set of potential headlines in order of the lengths of the potential headlines. The method extracts the by-line information from the document by using the location of the selected candidate headline to extract a string representing a date, a name, or a source located within a minimum distance from the location of the potential headline.

18 Citations

View as Search Results

5 Claims

1. A processor-implemented method of automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:
- inputting the single news article wherein the single news article comprises a single title meta-tag;
  
  removing formatting tags from the single news article to create a de-tagged version of the single news article;
  
  detecting a set of potential headlines of the single news article from among the substrings of the single title meta-tag of the single news article and their bi-grams and n-grams,the detecting further comprising constructing the set of potential headlines based on the title meta-tag and splitting the title meta-tag at all punctuation marks in the title meta-tag, resulting in a set of sub-strings of the title meta-tag;
  
  adding any of a plurality of bi-grams of the sub-strings and a plurality of n-grams of the sub-strings to the set of potential headlines;
  
  selecting a candidate headline from the set of potential headlines; and
  
  extracting the by-line information from the de-tagged version of the single news article using the location of the selected candidate headline.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein selecting comprises evaluating the potential headlines in order of the lengths of the potential headlines, wherein evaluating comprises:
    - identifying a location of the selected candidate headline being evaluated in a de-tagged version of the single news article;
      
      verifying the selected candidate headline as comprising a complete line at the identified location in the de-tagged content;
      
      verifying the length of the selected candidate headline exceeds a minimum length in the de-tagged content; and
      
      ensuring that the selected candidate headline comprises regular text in the de-tagged version of the single news article.
  - 3. The method of claim 1 wherein extracting comprises extracting a string representing a date located within a minimum distance from the location of the potential headline.
  - 4. The method of claim 1 wherein extracting comprises extracting a string representing a name that is located within a minimum distance from the location of the potential headline.
  - 5. The method of claim 1 wherein extracting comprises extracting a string representing a source of the document that is located within a minimum distance from the location of the potential headline.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Korupolu, Madhukar R., Tomkins, Andrew S., Dill, Stephen
Primary Examiner(s)
Trujillo; James K.
Assistant Examiner(s)
KNEITEL, JUSTIN M

Application Number

US11/259,608
Publication Number

US 20070094232A1
Time in Patent Office

1,141 Days
Field of Search

707/3, 707/1, 707/2, 715/234, 715/249
US Class Current

1/1
CPC Class Codes

G06F 16/345   Summarisation for human users

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Method for automatically extracting by-line information

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

5 Claims

Specification

Solutions

Use Cases

Quick Links

Method for automatically extracting by-line information

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

5 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links