Text extraction method for HTML pages

US 20030229854A1
Filed: 04/07/2003
Published: 12/11/2003
Est. Priority Date: 10/19/2000
Status: Abandoned Application

First Claim

Patent Images

1. A method of extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, the method comprising:

identifying layout cells within said document, said layout cells defining a layout of text entities within said document;

calculating statistics parameters of the layout cells, at least one of said statistics parameters being the number of words in said layout cells;

attributing a point value for each of said layout cells using at least one of said statistics parameters;

ranking said layout cells according to said point value;

selecting at least one of said layout cells whose point value is above a predetermined threshold;

extracting a text content of said selected layout cells.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An object of the present invention is to extract only the relevant information from a document (such as an HTML web page) to facilitate the summarizing of the document. There is provided a method of extracting a portion of text from a document including at least one table and cells within the at least one table, for the purposes of generating a summary of contents of the document. The method comprises: identifying cells within the document; determining a text size of the cells; selecting some of the cells using the text size of the cells; extracting in a text only output a text content of the selected cells; whereby the text only output extracted can be used to produce a summary of a portion of text of the document excluding text from non-selected cells.

97 Citations

View as Search Results

29 Claims

1. A method of extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, the method comprising:
- identifying layout cells within said document, said layout cells defining a layout of text entities within said document;
  
  calculating statistics parameters of the layout cells, at least one of said statistics parameters being the number of words in said layout cells;
  
  attributing a point value for each of said layout cells using at least one of said statistics parameters;
  
  ranking said layout cells according to said point value;
  
  selecting at least one of said layout cells whose point value is above a predetermined threshold;
  
  extracting a text content of said selected layout cells.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. A method as claimed in claim 1, wherein said identifying layout cells within said document comprises building a hierarchical tree structure for said document and said calculating statistics parameters comprises using said hierarchical tree structure to determine a depth of said layout cells within said structure and said selecting comprises selecting cells having a large number of words value and a low depth value.
  - 3. A method as claimed in claim 1, wherein said number of words is calculated by determining a number of hyperlinked words contained in said layout cells and subtracting said number of hyperlinked words from a total number of words contained in said layout cells to obtain a number of words of a text content of said layout cells.
  - 4. A method as claimed in claim 1, wherein said number of words is calculated by determining a number of words of an alternate text element contained in said layout cells and adding said number of words of said alternate text element to a total number of words contained in said layout cells to obtain a complete number of words of said layout cells.
  - 5. A method as claimed in claim 1, wherein said identifying layout cells comprises identifying at least one table defining a layout of said document;
    - and identifying at least one layout cell within each said at least one table.
  - 6. A method as claimed in claim 5, wherein said at least one layout cell within each said at least one table comprises at least one sub-table within said at least one layout cell.
  - 7. A method as claimed in claim 5, wherein said calculating statistics parameters comprises determining at least one of a number of words in said table, a number of words in links or images of said table, a number of layout cells in said table, a number of words per layout cell in said table, a depth of said table and a maximum number of words per layout cell;
    - and wherein said selecting comprises;
      
      calculating a score for said table;
      
      if said score is lower than a low threshold value, eliminating said table;
      
      if said score is higher than a high threshold value, selecting said table.
  - 8. A method as claimed in claim 7, wherein said high threshold value is equal to said low threshold value.
  - 9. A method as claimed in claim 5, wherein, for each sub-table included in a layout cell within said selected table, the method further comprises:
    - calculating a sub-score for each said sub-table;
      
      if said sub-score is higher than a sub-table threshold value, selecting said sub-table to be a selected table.
  - 10. A method as claimed in claim 1, wherein said calculating statistics parameters comprises:
    - determining a number of words contained in said layout cells; and
      
      determining a number of a number of words in links or images of said layout cells;
      
      and wherein said attributing comprises calculating a layout cell score value for said layout cells using said number of words in links or images and said number of words.
  - 11. A method as claimed in claim 5, wherein said calculating statistics parameters comprises:
    - determining a number of words contained in each said layout cells of said selected table; and
      
      determining a number of a number of words in links or images of said layout cells of said selected table;
      
      and wherein said attributing comprises;
      
      calculating a layout cell score value for said layout cells of said selected table using said number of words in links or images and said number of words;
      
      and wherein said selecting comprises;
      
      if said layout cell score value is higher than a layout cell threshold value, selecting said layout cell.
  - 12. A method as claimed in claim 1, wherein said document is an HTML source code file and said identifying layout cells within said document comprises using HTML source code from said file to identify said layout cells.
  - 13. A method as claimed in claim 12, wherein said using HTML source code comprises recognizing HTML layout tags identifying layout cells within said document.
  - 14. A computer readable memory for storing programmable instructions for use in the execution in a computer of the method of claim 1 to.
  - 15. A method of extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, the method comprising:
    - receiving a signal, said signal containing text extracted according to the method as defined in claim 1.
  - 16. In a method of extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, a computer data signal embodied in a carrier wave comprising:
    - text extracted according to the method as defined in claim 1.

17. A text extractor for extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, comprising:
- a cell identifier for identifying layout cells within said document, said layout cells defining a layout of text entities within said document;
  
  a statistics calculator for calculating statistics parameters of the layout cells, at least one of said statistics parameters being the number of words in said layout cells;
  
  a point value determiner for attributing a point value for each of said layout cells using at least one of said statistics parameters;
  
  a cell ranker for ranking said layout cells according to said point value;
  
  a cell selector for selecting at least one of said layout cells whose point value is above a predetermined threshold;
  
  a text provider for extracting a text content of said selected layout cells.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 18. A text extractor as claimed in claim 17, wherein said cell identifier comprises a tree builder for building a hierarchical tree structure for said document and wherein said statistics calculator comprises a depth determiner for determining a depth of said layout cells within said structure using said hierarchical tree structure and wherein said cell selector selects some of said layout cells having a large number of words value and a low depth value.
  - 19. A text extractor as claimed in claim 17, wherein said statistics calculator comprises a hyperlinked word calculator for calculating a number of hyperlinked words contained in said layout cells and subtracting said number of hyperlinked words from a total number of words contained in said layout cells to obtain a number of words of a text content of said layout cells.
  - 20. A text extractor as claimed in claim 17, wherein said statistics calculator comprises an alternate text calculator for calculating a number of words of an alternate text element contained in said layout cells and adding said number of words of said alternate text element to a total number of words contained in said layout cells to obtain a complete number of words of said layout cells.
  - 21. A text extractor as claimed in claim 17, wherein said cell identifier identifies at least one table defining a layout of said document;
    - and identifies at least one layout cell within each said at least one table.
  - 22. A text extractor as claimed in claim 21, wherein said at least one layout cell within each said at least one table comprises at least one sub-table within said at least one layout cell.
  - 23. A text extractor as claimed in claim 21, wherein said statistics calculator determines at least one of a number of words in said table, a number of words in links or images of said table, a number of layout cells in said table, a number of words per layout cell in said table, a depth of said table and a maximum number of words per layout cell;
    - and wherein said cell ranker calculates a score for said table;
      
      and wherein said cell selector if said score is lower than a low threshold value, eliminates said table;
      
      if said score is higher than a high threshold value, selects said table.
  - 24. A text extractor as claimed in claim 23, wherein said high threshold value is equal to said low threshold value.
  - 25. A text extractor as claimed in claim 21, wherein said statistics calculator calculates a sub-score for each said sub-table included in a layout cell within said selected table;
    - and said cell selector if said sub-score is higher than a sub-table threshold value, selects said sub-table to be a selected table.
  - 26. A text extractor as claimed in claim 17, wherein said statistics calculator determines a number of words contained in said layout cells;
    - and determines a number of a number of words in links or images of said layout cells and wherein said cell ranker calculates a layout cell score value for said layout cells using said number of words in links or images and said number of words;
      
      and wherein said cell selector if said layout cell score value is higher than a layout cell threshold value, selects said layout cell.
  - 27. A text extractor as claimed in claim 21, wherein said statistics calculator:
    - determines a number of words contained in each said layout cells of said selected table; and
      
      determines a number of a number of words in links or images of said layout cells of said selected table;
      
      and wherein said cell ranker calculates a layout cell score value for said layout cells of said selected table using said number of words in links or images and said number of words;
      
      and wherein said cell selector if said layout cell score value is higher than a layout cell threshold value, selects said layout cell.
  - 28. A text extractor as claimed in claim 17, wherein said document is an HTML source code file and said cell identifier uses HTML source code from said file to identify said layout cells.
  - 29. A text extractor as claimed in claim 28, wherein said cell identifier recognizes HTML layout tags identifying layout cells within said document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Coveo Solutions, Inc.
Original Assignee
Coveo Solutions, Inc.
Inventors
Lemay, Mlchel

Application Number

US10/407,203
Publication Number

US 20030229854A1
Time in Patent Office

Days
Field of Search
US Class Current

715/513
CPC Class Codes

G06F 16/345 Summarisation for human users

G06F 16/9577 Optimising the visualizatio...

Text extraction method for HTML pages

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

97 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Text extraction method for HTML pages

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

97 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links