System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents
DCFirst Claim
1. A method for crawling the internet to locate pages relevant to an application and thus building a Web Crawler comprising:
- starting from a base set of application-dependent web pages or crystallization points; and
applying breadth-first recursive crawling.
5 Assignments
Litigations
1 Petition
Accused Products
Abstract
Provided are methods and systems that extract facts of unstructured documents and build an oracle for various domains. The present invention addresses the problem of efficient finding and extraction of facts about a particular subject domain from semi-structured and unstructured documents, makes inferences of new facts from the extracted facts and the ways of verification of the facts, thus becoming a source of knowledge about the domain to be effectively queried. The methods and systems can also extract temporal information from unstructured and semi-structured documents, and can find and extract dynamically generated documents from Deep or Dynamic Web.
-
Citations
27 Claims
-
1. A method for crawling the internet to locate pages relevant to an application and thus building a Web Crawler comprising:
-
starting from a base set of application-dependent web pages or crystallization points; and applying breadth-first recursive crawling. - View Dependent Claims (2, 3, 4)
-
-
5. A method for automatic determination of crawling parameters for crystallization points based crawler comprising:
-
applying application-specific ontology to mark relevant page hyperlinks coming out of a page; and applying crawling up to a pre-defined depth over relevant links and up to another pre-defined depth over irrelevant links. - View Dependent Claims (6, 7, 8, 9)
-
-
10. A method for building a deep web crawler, comprising:
-
utilizing scout crawling rules to collect dynamic pages; utilizing an analyzer and extractor to determine underlying structure of queries; generating instructions for a harvester, wherein the harvester provides requests to a server and collects available pages from the server. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
-
Specification