SYSTEM FOR AUTOMATICALLY EXTRACTING BY-LINE INFORMATION

US 20080306941A1
Filed: 08/15/2008
Published: 12/11/2008
Est. Priority Date: 10/25/2005
Status: Active Grant

First Claim

Patent Images

1. A processor-implemented system for automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:

a detagging module for removing formatting tags from said crawled document to create a de-tagged version of said crawled document;

a headline detection module for detecting a set of potential headlines of the document from a title meta-tag of the crawled document;

a headline evaluation module for selecting a candidate headline from the set of potential headlines; and

a by-line extraction module for extracting the by-line information from the de-tagged version of said crawled document using the location of the selected candidate headline.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A by-line extraction system detects a set of potential headlines from a title meta-tag of a crawled document, selects a candidate headline from the set of potential headlines, and extracts the by-line information from the document using the location of the selected candidate headline. The system constructs the set of potential headlines based on the title meta-tag. The system selects a candidate headline by evaluating the set of potential headlines in order of the lengths of the potential headlines. The system extracts the by-line information from the document by using the location of the selected candidate headline to extract a string representing a date, a name, or a source located within a minimum distance from the location of the potential headline.

Citations

22 Claims

1. A processor-implemented system for automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:
- a detagging module for removing formatting tags from said crawled document to create a de-tagged version of said crawled document;
  
  a headline detection module for detecting a set of potential headlines of the document from a title meta-tag of the crawled document;
  
  a headline evaluation module for selecting a candidate headline from the set of potential headlines; and
  
  a by-line extraction module for extracting the by-line information from the de-tagged version of said crawled document using the location of the selected candidate headline.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The system of claim 1 wherein the headline detection module constructs the set of potential headlines based on the title meta-tag.
  - 3. The system of claim 2 wherein the headline detection module constructs the set of potential headlines by splitting the title meta-tag at all punctuation marks in the title meta-tag, resulting in a set of sub-strings of the title meta-tag.
  - 4. The system of claim 3 wherein the headline detection module further adds any of a plurality of bi-grams of the sub-strings and a plurality of n-grams of the sub-strings to the set of potential headlines.
  - 5. The system of claim 1 wherein the headline evaluation module evaluates the potential headlines in order of the lengths of the potential headlines, by:
    - identifying a location of the selected candidate headline being evaluated in a de-tagged version of the crawled document;
      
      verifying the selected candidate headline as comprising a complete line at the identified location in the de-tagged content;
      
      verifying the length of the selected candidate headline exceeds a minimum length in the de-tagged content; and
      
      ensuring that the selected candidate headline comprises regular text in the de-tagged version of said crawled document.
  - 6. The system of claim 1 wherein the by-line extraction module extracts a string representing a date located within a minimum distance from the location of the potential headline.
  - 7. The system of claim 1 wherein the by-line extraction module extracts a string representing a name located within a minimum distance from the location of the potential headline.
  - 8. The system of claim 1 wherein the by-line extraction module extracts a string representing a source of the document that is located within a minimum distance from the location of the potential headline.

9. A computer program product having program codes stored on a computer-usable medium for automatically extracting by-line information in a crawled document, wherein said document contains a single news article, comprising:
- a program code for removing formatting tags from said crawled document to create a de-tagged version of said crawled document;
  
  a program code for detecting a set of potential headlines of the crawled document from a title meta-tag of the crawled document;
  
  a program code for selecting a candidate headline from the set of potential headlines; and
  
  a program code for extracting the by-line information from the detagged version of the crawled document using the location of the selected candidate headline.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer program product of claim 9 wherein the program code for detecting the set of potential headlines constructs the set of potential headlines based on the title meta-tag.
  - 11. The computer program product of claim 10 wherein the program code for detecting the set of potential headlines constructs the set of potential headlines by splitting the title meta-tag at all punctuation marks in the title meta-tag, resulting in a set of sub-strings of the title meta-tag.
  - 12. The computer program product of claim 11 wherein the program code for detecting the set of potential headlines further adds any of a plurality of bi-grams of the sub-strings and a plurality of n-grams of the sub-strings to the set of potential headlines.
  - 13. The computer program product of claim 9 wherein the program code for selecting the candidate headline evaluates the potential headlines in order of the lengths of the potential headlines, by:
    - identifying a location of the selected candidate headline being evaluated in a de-tagged version of the document;
      
      verifying the selected candidate headline as comprising a complete line at the identified location in the de-tagged content;
      
      verifying the length of the selected candidate headline exceeds a minimum length in the de-tagged content; and
      
      ensuring that the selected candidate headline comprises regular text in the de-tagged version of said crawled document.
  - 14. The computer program product of claim 9 wherein the program code for extracting the by-line information extracts a string representing a date located within a minimum distance from the location of the potential headline.
  - 15. The computer program product of claim 9 wherein the program code for extracting the by-line information extracts a string representing a name located within a minimum distance from the location of the potential headline.
  - 16. The computer program product of claim 9 wherein the program code for extracting the by-line information extracts a string representing a source of the document that is located within a minimum distance from the location of the potential headline.

17. A processor-implemented service for automatically extracting by-line information in a crawled document, wherein said crawled document contains a single news article, comprising:
- receiving the document;
  
  invoking an autonomic hardware configuration utility, wherein the document is made available to the autonomic hardware configuration utility for automatically extracting by-line information in the document by;
  
  removing formatting tags from said crawled document to create a de-tagged version of said crawled document;
  
  detecting a set of potential headlines of the crawled document from a title meta-tag of the document;
  
  selecting a candidate headline from the set of potential headlines; and
  
  extracting the by-line information from the de-tagged version of said crawled document using the location of the selected candidate headline.
- View Dependent Claims (18, 19, 20, 21, 22)
- - 18. The service of claim 17 wherein detecting comprises constructing the set of potential headlines based on the title meta-tag.
  - 19. The service of claim 18 wherein constructing comprises splitting the title meta-tag at all punctuation marks in the title meta-tag, resulting in a set of sub-strings of the title meta-tag.
  - 20. The service of claim 19 further comprising adding any of a plurality of bi-grams of the sub-strings and a plurality of n-grams of the sub-strings to the set of potential headlines.
  - 21. The service of claim 17 wherein selecting comprises evaluating the potential headlines in order of the lengths of the potential headlines, wherein evaluating comprises:
    - identifying a location of the selected candidate headline being evaluated in a de-tagged version of the document;
      
      verifying the selected candidate headline as comprising a complete line at the identified location in the de-tagged content;
      
      verifying the length of the selected candidate headline exceeds a minimum length in the de-tagged content; and
      
      ensuring that the selected candidate headline comprises regular text in the de-tagged version of the document.
  - 22. The service of claim 17 wherein extracting comprises extracting a string representing a date located within a minimum distance from the location of the potential headline.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Korupolu, Madhukar R., Tomkins, Andrew S., Dill, Stephen

Granted Patent

US 8,321,396 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/5
CPC Class Codes

G06F 16/345   Summarisation for human users

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

SYSTEM FOR AUTOMATICALLY EXTRACTING BY-LINE INFORMATION

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM FOR AUTOMATICALLY EXTRACTING BY-LINE INFORMATION

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links