Article extraction
First Claim
Patent Images
1. A method executed by a computer, the method comprising:
- in a document including multiple articles, determining, by the computer, a layout structure that separates text regions from graphic regions on each page of the document;
determining, by the computer, where each of the articles begins and each of the articles ends to distinguish the articles from each other;
extracting, by the computer, the articles from the document using text flow analysis to generate a plurality of reading order alternatives of separate body text regions;
evaluating, by the computer, topics discussed in the separate body text regions to generate a first score that indicates whether the separate body text regions belong to a same article;
evaluating, by the computer, punctuation in first and last sentences in the separate body text regions to generate a second score that indicates whether the separate body text regions belong to the same article;
using, by the computer, a decision combiner to process the first and second scores and the plurality of reading order alternatives to determine which of the separate body text regions belong to the same article; and
outputting, by the computer, an extracted article from the document.
8 Assignments
0 Petitions
Accused Products
Abstract
An article is extracted from a document using a decision combiner to process a plurality of reading order alternatives. The text flow analysis generates the plurality of reading order alternatives of separate body text regions.
238 Citations
28 Claims
-
1. A method executed by a computer, the method comprising:
-
in a document including multiple articles, determining, by the computer, a layout structure that separates text regions from graphic regions on each page of the document; determining, by the computer, where each of the articles begins and each of the articles ends to distinguish the articles from each other; extracting, by the computer, the articles from the document using text flow analysis to generate a plurality of reading order alternatives of separate body text regions; evaluating, by the computer, topics discussed in the separate body text regions to generate a first score that indicates whether the separate body text regions belong to a same article; evaluating, by the computer, punctuation in first and last sentences in the separate body text regions to generate a second score that indicates whether the separate body text regions belong to the same article; using, by the computer, a decision combiner to process the first and second scores and the plurality of reading order alternatives to determine which of the separate body text regions belong to the same article; and outputting, by the computer, an extracted article from the document. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method being executed by a computer, the method comprising:
-
separating, by the computer, a plurality of separate body text regions from a plurality of separate body graphic regions for each page of a document that includes plural different articles; generating, by the computer, a plurality of reading order alternatives of the plurality of separate body text regions with a text flow analysis; evaluating, by the computer, topics discussed in two consecutive separate body text regions to determine whether the two consecutive separate body text regions belong to a same article; comparing, by the computer, punctuation marks between the two consecutive separate body text regions to determine whether the two consecutive separate body text regions belong to the same article; applying, by the computer, a decision combiner to process the plurality of reading order alternatives to extract a reading order for each article in the document; and determining, by the computer, a beginning and an end for the articles to extract different articles from the document. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. Computer executable instructions stored in memory of a computer and executable by the computer to perform a method comprising:
-
separating, by the computer, a plurality of separate body text regions from a plurality of separate body graphic regions on multiple pages of a document to determine a layout structure of the multiple pages; generating, by the computer, a plurality of reading order alternatives from the plurality of separate body text regions for each page of the document that includes multiple different articles; evaluating, by the computer, topics discussed in the separate body text regions to generate a first score that indicates whether the separate body text regions belong to a same article; evaluating, by the computer, punctuation marks of first and last sentences in the separate body text regions to generate a second score that indicates whether the separate body text regions belong to the same article; evaluating, by the computer, the first and second scores to determine a beginning and an end for each article in the document to extract different articles from the document; and processing, by the computer, the plurality of reading order alternatives to extract a reading order for each article in the document. - View Dependent Claims (23, 24)
-
-
25. A computer-readable storage medium being stored on a computer and having instructions for causing the computer to execute a method, the method comprising:
-
a plurality of separate body text regions separated from a plurality of separate body graphic regions generated by a region determination of each page of a document that includes plural different articles; a plurality of reading order alternatives generated by a text flow analysis of the plurality of separate body text regions, wherein the text flow analysis evaluates topics discussed in the separate body text regions to determine whether the separate body text regions belong to a same article, compares end of sentence indicators between two consecutive separate body text regions to determine whether the two consecutive separate body text regions belong to the same article, and determines beginnings and ends for the plural different articles and each of the plurality of reading order alternatives identifies one or more hypothetical article sets; and an algorithm to process each of the plurality of reading order alternatives and the one or more hypothetical article sets to generate a reading order of an article in the document and extract the article from the plural different articles in the document. - View Dependent Claims (26, 27, 28)
-
Specification