System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents
DCFirst Claim
1. A method for automatic article and title extraction without templates, comprising:
- building a paragraph tree from a page;
building paragraphs from tree nodes that include a table depth determination for each node;
determining the paragraphs that contain a href, url reference;
grouping contiguous href and non-href paragraphs into blocks and categorizing them by size as small, medium and large; and
declaring the largest medium block as large, if there are no large blocks.
4 Assignments
Litigations
0 Petitions
Accused Products
Abstract
Provided are methods and systems that extract facts of unstructured documents and build an oracle for various domains. The present invention addresses the problem of efficient finding and extraction of facts about a particular subject domain from semi-structured and unstructured documents, makes inferences of new facts from the extracted facts and the ways of verification of the facts, thus becoming a source of knowledge about the domain to be effectively queried. The methods and systems can also extract temporal information from unstructured and semi-structured documents, and can find and extract dynamically generated documents from Deep or Dynamic Web.
20 Citations
23 Claims
-
1. A method for automatic article and title extraction without templates, comprising:
-
building a paragraph tree from a page; building paragraphs from tree nodes that include a table depth determination for each node; determining the paragraphs that contain a href, url reference; grouping contiguous href and non-href paragraphs into blocks and categorizing them by size as small, medium and large; and declaring the largest medium block as large, if there are no large blocks. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for extraction of people names, positions they have with companies, companies names and their quotes from an article comprising:
-
building a list of paragraphs; and applying island grammar to extract quadruples that include at last one of, person, position, company and quote. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23)
-
Specification