System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents

  • US 7,756,807 B1
  • Filed: 09/24/2008
  • Issued: 07/13/2010
  • Est. Priority Date: 06/18/2004
  • Status: Active Grant
First Claim
Patent Images

1. A method for automatic article and title extraction without templates, comprising:

  • building a paragraph tree from a page;

    building paragraphs from tree nodes that include a table depth determination for each node;

    determining the paragraphs that contain a href, url reference;

    grouping contiguous href and non-href paragraphs into blocks and categorizing them by size as small, medium and large; and

    declaring the largest medium block as large, if there are no large blocks.

View all claims

    Thank you for your feedback