Text extraction method for HTML pages
First Claim
1. A method of extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, the method comprising:
- identifying layout cells within said document, said layout cells defining a layout of text entities within said document;
calculating statistics parameters of the layout cells, at least one of said statistics parameters being the number of words in said layout cells;
attributing a point value for each of said layout cells using at least one of said statistics parameters;
ranking said layout cells according to said point value;
selecting at least one of said layout cells whose point value is above a predetermined threshold;
extracting a text content of said selected layout cells.
4 Assignments
0 Petitions
Accused Products
Abstract
An object of the present invention is to extract only the relevant information from a document (such as an HTML web page) to facilitate the summarizing of the document. There is provided a method of extracting a portion of text from a document including at least one table and cells within the at least one table, for the purposes of generating a summary of contents of the document. The method comprises: identifying cells within the document; determining a text size of the cells; selecting some of the cells using the text size of the cells; extracting in a text only output a text content of the selected cells; whereby the text only output extracted can be used to produce a summary of a portion of text of the document excluding text from non-selected cells.
97 Citations
29 Claims
-
1. A method of extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, the method comprising:
-
identifying layout cells within said document, said layout cells defining a layout of text entities within said document;
calculating statistics parameters of the layout cells, at least one of said statistics parameters being the number of words in said layout cells;
attributing a point value for each of said layout cells using at least one of said statistics parameters;
ranking said layout cells according to said point value;
selecting at least one of said layout cells whose point value is above a predetermined threshold;
extracting a text content of said selected layout cells. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A text extractor for extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, comprising:
-
a cell identifier for identifying layout cells within said document, said layout cells defining a layout of text entities within said document;
a statistics calculator for calculating statistics parameters of the layout cells, at least one of said statistics parameters being the number of words in said layout cells;
a point value determiner for attributing a point value for each of said layout cells using at least one of said statistics parameters;
a cell ranker for ranking said layout cells according to said point value;
a cell selector for selecting at least one of said layout cells whose point value is above a predetermined threshold;
a text provider for extracting a text content of said selected layout cells. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
-
Specification