System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents

Provided are methods and systems that extract facts of unstructured documents and build an oracle for various domains. The present invention addresses the problem of efficient finding and extraction of facts about a particular subject domain from semi-structured and unstructured documents, makes inferences of new facts from the extracted facts and the ways of verification of the facts, thus becoming a source of knowledge about the domain to be effectively queried. The methods and systems can also extract temporal information from unstructured and semi-structured documents, and can find and extract dynamically generated documents from Deep or Dynamic Web.

Citations

27 Claims

1. A method for crawling the internet to locate pages relevant to an application and thus building a Web Crawler comprising:
- starting from a base set of application-dependent web pages or crystallization points; and
  
  applying breadth-first recursive crawling.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein a specific dictionary of positive keywords is utilized to mark relevant page hyperlinks.
  - 3. The method of claim 2, wherein crawling is performed to a predefined depth over relevant links and to a pre-defined depth over irrelevant links.
  - 4. The method of claim 1, wherein associated pages are added to the list of crystallization points if they have more than a pre-defined number of relevant links, or less than that number but are connected to an “
    - important”
      
      page, or contains an application-relevant fact.

5. A method for automatic determination of crawling parameters for crystallization points based crawler comprising:
- applying application-specific ontology to mark relevant page hyperlinks coming out of a page; and
  
  applying crawling up to a pre-defined depth over relevant links and up to another pre-defined depth over irrelevant links.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The method of claim 5, wherein application-specific oracles are applied to determine relevant pages.
  - 7. The method of claim 5, wherein a navigation graph consisting of paths leading from crystallization points to the terminal nodes of search is built.
  - 8. The method of claim 7, wherein positive and negative keywords leading to relevant and irrelevant pages correspondingly are determined.
  - 9. The method of claim 7, wherein the navigation graph is used to calculate forced and maximum depth parameters and the navigation rules graph.

10. A method for building a deep web crawler, comprising:
- utilizing scout crawling rules to collect dynamic pages;
  
  utilizing an analyzer and extractor to determine underlying structure of queries;
  
  generating instructions for a harvester, wherein the harvester provides requests to a server and collects available pages from the server.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 11. The method of claim 10, wherein the extractor extracts unstructured and semi-structured information from the collected dynamic pages and converts the collected information into a structured form.
  - 12. The method of claim 10, wherein the scout crawling rules are divided into rules dealing with static pages and rules dealing with dynamic pages.
  - 13. The method of claim 12, wherein a plurality of questions is selected to cover all possible patterns of the dynamic pages produced by a server, to allow the analyzer and the harvester to create exhaustive enumerations of questions that generate all dynamic pages that the server can produce.
  - 14. The method of claim 10, wherein all controls belonging to the same run are mapped to valid controls in a valid form.
  - 15. The method of claim 14, wherein controls are valid if their description contains one of the positive keywords and does not contain any of the negative keywords.
  - 16. The method of claim 14, wherein a mapping of the rules in the same run to the valid controls generates a bipartite graph.
  - 17. The method of claim 16, wherein the scout enumerates all possible one-to-one pairs of the rules and controls in the bipartite graph.
  - 18. The method of claim 17, wherein each map generates random choices of options and inputs for text control.
  - 19. The method of claim 10, wherein the analyzer takes a set of pages created by the scout crawling and builds a set of rules for the harvester.
  - 20. The method of claim 19, wherein pages generated by the scout crawling are pushed through the extractor, facts are extracted from the pages and are stored in a database.
  - 21. The method of claim 20, wherein pages extracted by the scout crawler represent a navigation graph stored in the database.
  - 22. The method of claim 20, wherein the navigation graph is a union of equivalency classes of paths crawled by scout from the form page to the dynamic pages extracted by scout.
  - 23. The method of claim 10, wherein the extractor is a hybrid system.
  - 24. The method of claim 10, wherein the crawl search is organized as a breadth-first search with depth and valences of urls balanced to provide that an overall size of a search graph is limited by a pre-defined number.
  - 25. The method of claim 10, wherein first and second set of links are provided, wherein the first links contain positive keywords and do not contain negative keywords in the url itself or in the description of the ur, and the second links are randomly selected.
  - 26. The method of claim 25, wherein links from the first set are used as soon as the size of the crawl graph is within a limit defined independently of the distance from the CP.
  - 27. The method of claim 25, wherein links from the second set are used when a distance from the CP does not exceed a predefined number.

Specification

Resources

Litigation Campaign Assessment

Litigation Data

Current Assignee
Glenbrook Networks
Original Assignee
Networks Glenbrook
Inventors
Komissarchik, Julia, Komissarchik, Edward
Primary Examiner(s)
Holmes, Michael B

Application Number

US11/152,689
Time in Patent Office

1,254 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/345 Summarisation for human users

System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents

First Claim

5 Assignments

Litigations

1 Petition

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for facts extraction and domain knowledge repository creation from unstructured and semi-structured documents

First Claim

5 Assignments

Subscription Required

Subscription Required

Litigations

1 Petition

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links