×

Text extraction method for HTML pages

  • US 20030229854A1
  • Filed: 04/07/2003
  • Published: 12/11/2003
  • Est. Priority Date: 10/19/2000
  • Status: Abandoned Application
First Claim
Patent Images

1. A method of extracting a portion of text from a document including a plurality of layout cells in at least one table defining a layout of said document, the method comprising:

  • identifying layout cells within said document, said layout cells defining a layout of text entities within said document;

    calculating statistics parameters of the layout cells, at least one of said statistics parameters being the number of words in said layout cells;

    attributing a point value for each of said layout cells using at least one of said statistics parameters;

    ranking said layout cells according to said point value;

    selecting at least one of said layout cells whose point value is above a predetermined threshold;

    extracting a text content of said selected layout cells.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×