×

Article extraction

  • US 7,756,871 B2
  • Filed: 10/13/2004
  • Issued: 07/13/2010
  • Est. Priority Date: 10/13/2004
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method executed by a computer, the method comprising:

  • in a document including multiple articles, determining, by the computer, a layout structure that separates text regions from graphic regions on each page of the document;

    determining, by the computer, where each of the articles begins and each of the articles ends to distinguish the articles from each other;

    extracting, by the computer, the articles from the document using text flow analysis to generate a plurality of reading order alternatives of separate body text regions;

    evaluating, by the computer, topics discussed in the separate body text regions to generate a first score that indicates whether the separate body text regions belong to a same article;

    evaluating, by the computer, punctuation in first and last sentences in the separate body text regions to generate a second score that indicates whether the separate body text regions belong to the same article;

    using, by the computer, a decision combiner to process the first and second scores and the plurality of reading order alternatives to determine which of the separate body text regions belong to the same article; and

    outputting, by the computer, an extracted article from the document.

View all claims
  • 8 Assignments
Timeline View
Assignment View
    ×
    ×