Apparatus and method for text extraction
First Claim
Patent Images
1. A method of determining main text in a mark-up document, comprising:
- removing, by a system having a processor, first predetermined mark-up tags from the mark-up document, and replacing second predetermined mark-up tags in the mark-up document with separation elements, wherein the removing and the replacing cause the mark-up document to contain text paragraphs and the separation elements without the first and second predetermined mark-up tags;
determining, by the system, a length of each of the text paragraphs in the mark-up document; and
determining, by the system, one or more main paragraphs of the mark-up document based upon the lengths of the text paragraphs in the mark-up document.
3 Assignments
0 Petitions
Accused Products
Abstract
A method of determining main text in a mark-up document is provided, which comprises determining a length of each paragraph in the mark-up document; and determining one or more main paragraphs of the mark-up document based upon the length of the paragraphs in the mark-up document.
-
Citations
21 Claims
-
1. A method of determining main text in a mark-up document, comprising:
-
removing, by a system having a processor, first predetermined mark-up tags from the mark-up document, and replacing second predetermined mark-up tags in the mark-up document with separation elements, wherein the removing and the replacing cause the mark-up document to contain text paragraphs and the separation elements without the first and second predetermined mark-up tags; determining, by the system, a length of each of the text paragraphs in the mark-up document; and determining, by the system, one or more main paragraphs of the mark-up document based upon the lengths of the text paragraphs in the mark-up document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An apparatus for determining main text in a mark-up document, comprising:
-
a memory having a mark-up document stored therein; and a processor configured to; remove first predetermined mark-up tags from the mark-up document, and replace second predetermined mark-up tags in the mark-up document with separation elements, wherein the removing and the replacing cause the mark-up document to contain text paragraphs and the separation elements without the first and second predetermined mark-up tags; determine lengths of the text paragraphs in the mark-up document; and determine one or more main paragraphs of the mark-up document based upon the lengths of the text paragraphs in the mark-up document. - View Dependent Claims (11, 12)
-
-
13. A non-transitory computer-readable storage medium storing instructions that upon execution cause a system to:
-
remove first predetermined mark-up tags from the mark-up document, and replacing second predetermined mark-up tags in the mark-up document with separation elements, wherein the removing and the replacing cause the mark-up document to contain text paragraphs and the separation elements without the first and second predetermined mark-up tags; determine a length of each of the text paragraphs in the mark-up document; and determine one or more main paragraphs of the mark-up document based upon the lengths of the text paragraphs in the mark-up document. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21)
-
Specification