Method for learning character patterns to interactively control the scope of a web crawler
First Claim
1. A computer implemented method for searching a network for resources, where each resource has an associated address specified as a character string and resources are connected by links in the form of the addresses, comprising:
- searching the network to locate an initial set of resources in accordance with a defined scope;
receiving data identifying positive and negative examples from the initial set of resources;
inferring a rule from the positive and negative examples to limit the scope wherein the rule comprises patterns of character strings representing addresses; and
performing a subsequent search of the network according to the scope as limited by the inferred rule to locate a subsequent set of resources.
4 Assignments
0 Petitions
Accused Products
Abstract
A method controls a Web search for server computer resources by an end-user Web crawler. Each resource, such as a Web page, is located by a resource address specified as a character string. The end-user defines a scope for an initial Web search by settings. The settings are used to search the Web for resources limited by the scope. The set of resources located during the search are rendered on output device, and positive and negative examples are selected from the set of resources to infer a rule. The rule is displayed, as well as a subset of resources that match on the rule. The selecting, inferring, and rendering steps are repeated while searching until a final rule is obtained. The rule matches resources that the crawler should process and does not match resource that it should avoid.
-
Citations
66 Claims
-
1. A computer implemented method for searching a network for resources, where each resource has an associated address specified as a character string and resources are connected by links in the form of the addresses, comprising:
-
searching the network to locate an initial set of resources in accordance with a defined scope;
receiving data identifying positive and negative examples from the initial set of resources;
inferring a rule from the positive and negative examples to limit the scope wherein the rule comprises patterns of character strings representing addresses; and
performing a subsequent search of the network according to the scope as limited by the inferred rule to locate a subsequent set of resources. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
matching candidate address patterns with addresses of as many as possible negative examples and no positive examples in a second phase.
-
-
7. The method of claim 6 including matching the rule and the patterns to determine correlation scores.
-
8. The method of claim 7 including selecting patterns to maximize the correlation scores.
-
9. The method of claim 8 including partitioning the address of each resource into tokens at transition points.
-
10. The method of claim 9 wherein the characters of each address are assigned to equivalence classes, and transitions points are between characters assigned to different equivalence classes except when a single uppercase character is followed by a sequence of lowercase characters.
-
11. The method of claim 9 wherein the first, last, and all non-alphanumeric tokens are designated preferred tokens whereby the correlation score is increased if a particular pattern includes preferred tokens.
-
12. The method of claim 6 wherein the matching is done character by character.
-
13. The method of claim 6 wherein the matching is done token by token.
-
14. The method of claim 1 wherein the resources are Web pages.
-
15. The method of claim 1, including rendering the rule and information of a subset of the initial set of resources that match the rule, wherein the subset of the initial set of resources includes resources other than the positive examples.
-
16. The method of claim 15, including storing the subsequent set of resources as the initial set of resources and repeating the receiving, inferring, rendering, and performing to allow iterative refinement of the rule.
-
17. The method of claim 15, including specifying a scope modifier for the rule.
-
18. The method of claim 17 wherein the scope modifier is stored as a first integer value to be applied to the positive examples and a second integer value applied to the negative examples.
-
19. The method of claim 17, including storing the subsequent set of resources as the initial set of resources and repeating the receiving, inferring, rendering, specifying, and performing to allow iterative refinement of the rule.
-
20. The method of claim 17, including storing the subsequent set of resources as the initial set of resources and repeating the rendering, specifying, and performing to allow iterative refinement of the rule.
-
21. The method of claim 15, including, prior to performing the subsequent search, enabling a user to edit the rule to produce an edited rule;
- and
wherein the subsequent search is performed according to the scope as limited by the edited rule to locate the subsequent set resources.
- and
-
22. The method of claim 21, wherein the enabling includes enabling the user to add and delete positive examples and negative examples.
-
23. The method of claim 22, including storing the subsequent set of resources as the initial set of resources and repeating the inferring, rendering, enabling, and performing to allow iterative refinement of the rule.
-
24. The method of claim 21, including storing the subsequent set of resources as the initial set of resources and repeating the rendering, enabling, and performing to allow iterative refinement of the rule.
-
25. The method of claim 15 including rendering resource specific information including a title, an address, and a summary of the resource.
-
26. The method of claim 15 wherein the network connects client computers and server computers, and the client computer performs the searching, receiving, inferring, rendering, and performing and the server computers store the resources at the associate addresses.
-
27. A computer program product for use in conjunction with a computer system, the computer program product comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising:
-
search instructions for searching a network to locate an initial set of resources in accordance with a defined scope wherein each resource has an associated address specified as a character string;
rule generating instructions for receiving data identifying positive and negative examples from the initial set of resources, and for inferring a rule from the positive and negative examples to limit the scope wherein the rule comprises patterns of character strings representing addresses; and
subsequent search instructions for performing a subsequent search of the network according to the scope as limited by the inferred rule to locate a subsequent set resources. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
second phase matching instructions for matching candidate address patterns with addresses of as many as possible negative examples and no positive examples.
-
-
33. The computer program product of claim 32 including correlation instructions for matching the rule and the patterns to determine correlation scores and selecting patterns to maximize the correlation scores.
-
34. The computer program product of claim 33 including token instructions for partitioning the address of each resource into tokens at transition points wherein the first, last, and all non-alphanumeric tokens are designated preferred tokens whereby the correlation score is increased if a particular pattern includes preferred tokens.
-
35. The computer program product of claim 32 wherein the first phase matching instructions and the second phase matching instructions match candidate address patterns with addresses character by character.
-
36. The computer program product of claim 32 wherein the first phase matching instructions and the second phase matching instructions match candidate address patterns with addresses token by token.
-
37. The computer program product of claim 27 including rendering instructions for rendering the rule and information of a subset of the initial set of resources that match the rule, wherein the subset of the initial set of resources includes resources other than the positive examples.
-
38. The computer program product of claim 37 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rule generating instructions, rendering instructions and subsequent search instructions.
-
39. The computer program product of claim 37 including modifier instructions for receiving a scope modifier for the rule wherein the scope modifier is stored as a first integer value to be applied to the positive examples and a second integer value applied to the negative examples.
-
40. The computer program product of claim 39 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rule generating instructions, rendering instructions, modifier instructions, and subsequent search instructions.
-
41. The computer program product of claim 39 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rendering instructions, modifier instructions, and subsequent search instructions.
-
42. The computer program product of claim 37 including user editing instructions to allow the user to edit the rule to produce an edited rule;
- and
wherein the subsequent search instructions perform a subsequent search of the network according to the scope as limited by the edited rule to locate a subsequent set resources.
- and
-
43. The computer program product of claim 42 wherein the user editing instructions include instructions to allow the user to add and delete positive examples and negative examples.
-
44. The computer program product of claim 43 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rule generating instructions, rendering instructions, user editing instructions, and subsequent search instructions.
-
45. The computer program product of claim 42 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rendering instructions, user editing instructions, and subsequent search instructions.
-
46. The computer program product of claim 37 wherein the rendering instructions include instructions for rendering resource specific information including a title, an address, and a summary of the resource.
-
47. A computer system comprising:
-
a central processing unit (CPU) for executing instructions; and
program instructions, stored in a memory and executable by the CPU, comprising;
search instructions for searching a network to locate an initial set of resources in accordance with a defined scope wherein each resource has an associated address specified as a character string;
rule generating instructions for receiving data identifying positive and negative examples from the initial set of resources, and for inferring a rule from the positive and negative examples to limit the scope wherein the rule comprises patterns of character strings representing addresses; and
subsequent search instructions for performing a subsequent search of the network according to the scope as limited by the inferred rule to locate a subsequent set resources. - View Dependent Claims (48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66)
second phase matching instructions for matching candidate address patterns with addresses of as many as possible negative examples and no positive examples.
-
-
53. The computer system of claim 52 including correlation instructions for matching the rule and the patterns to determine correlation scores and selecting patterns to maximize the correlation scores.
-
54. The computer system of claim 53 including token instructions for partitioning the address of each resource into tokens at transition points wherein the first, last, and all non-alphanumeric tokens are designated preferred tokens whereby the correlation score is increased if a particular pattern includes preferred tokens.
-
55. The computer system of claim 52 wherein the first phase matching instructions and the second phase matching instructions match candidate address patterns with addresses character by character.
-
56. The computer system of claim 52 wherein the first phase matching instructions and the second phase matching instructions match candidate address patterns with addresses token by token.
-
57. The computer system of claim 47 including a display device and rendering instructions for rendering the rule and information of a subset of the initial set of resources that match the rule on the display device, wherein the subset of the initial set of resources includes resources other than the positive examples.
-
58. The computer system of claim 57 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rule generating instructions, rendering instructions and subsequent search instructions.
-
59. The computer system of claim 57 including modifier instructions for receiving a scope modifier for the rule wherein the scope modifier is stored as a first integer value to be applied to the positive examples and a second integer value applied to the negative examples.
-
60. The computer system of claim 59 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rule generating instructions, rendering instructions, modifier instructions, and subsequent search instructions.
-
61. The computer system of claim 59 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rendering instructions, modifier instructions, and subsequent search instructions.
-
62. The computer system of claim 57 including user editing instructions to allow the user to edit the rule to produce an edited rule;
- and
wherein the subsequent search instructions perform a subsequent search of the network according to the scope as limited by the edited rule to locate a subsequent set resources.
- and
-
63. The computer system of claim 62 wherein the user editing instructions include instructions to allow the user to add and delete positive examples and negative examples.
-
64. The computer system of claim 63 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rule generating instructions, rendering instructions, user editing instructions, and subsequent search instructions.
-
65. The computer system of claim 62 including storing instructions for storing the subsequent set of resources as the initial set of resources and rule refinement instructions for repeating the rendering instructions, user editing instructions, and subsequent search instructions.
-
66. The computer system of claim 57 wherein the rendering instructions include instructions for rendering resource specific information including a title, an address, and a summary of the resource.
Specification