Web forum crawler
First Claim
1. A system with a processor and memory for crawling a site having pages, each page having a reference that identifies the page, each reference having tokens, comprising:
- a grouping component that identifies groups of pages with similar content;
a pattern component that identifies a reference pattern of a group based on the references of the pages of the group, the reference pattern being identified by analyzing the tokens of the references of the pages of the group to identify sequences of tokens indicating a pattern of tokens within the references; and
a decision component that, after encountering a reference that matches a reference pattern when crawling the site, decides whether to access the page of the encountered reference based on characteristics of the pages of the group of the matching reference patternwherein the components are implemented as computer-executable instructions stored in the memory for execution by the processor.
2 Assignments
0 Petitions
Accused Products
Abstract
A crawling system crawls a web site initially in a pattern detection phase and subsequently in a pattern usage phase. The pattern detection phase attempts to identify patterns of references to pages that contain informational content of interest and patterns of references to pages that contain little informational content of interest. During the pattern usage phase, the crawling system crawls the web site. When the crawling system encounters a reference contained on an accessed page, the crawling system determines whether the reference matches a reference pattern. If the reference matches a reference pattern associated with pages that contain informational content of interest, the crawling system accesses the referenced page. If, however, the reference matches a reference pattern of pages with little informational content, then the crawling system discards that reference without accessing the referenced page.
-
Citations
14 Claims
-
1. A system with a processor and memory for crawling a site having pages, each page having a reference that identifies the page, each reference having tokens, comprising:
-
a grouping component that identifies groups of pages with similar content; a pattern component that identifies a reference pattern of a group based on the references of the pages of the group, the reference pattern being identified by analyzing the tokens of the references of the pages of the group to identify sequences of tokens indicating a pattern of tokens within the references; and a decision component that, after encountering a reference that matches a reference pattern when crawling the site, decides whether to access the page of the encountered reference based on characteristics of the pages of the group of the matching reference pattern wherein the components are implemented as computer-executable instructions stored in the memory for execution by the processor. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system with a processor and memory for detecting web pages of a web site having multiple uniform resource locators (URLs) that reference each web page, each URL having tokens, comprising:
-
a grouping component that identifies groups of web pages with similar content; a pattern component that identifies a URL pattern of a group based on the URLs that reference the web pages of the group, the URL pattern being identified by analyzing the tokens of the URLs of the web pages of the group to identify sequences of tokens indicating a pattern of tokens within the URLs; a duplicate detection component that, when multiple URL patterns identify a similar set of web pages, indicates that the URL patterns reference multiple URL web pages wherein only one URL pattern of the multiple URL patterns that reference a multiple URL web page is selected for use when crawling the web site so that URLs that match the non-selected URL patterns are not followed based on the match; and a crawling component that crawls the web site such that, when encountering on a web page of the web site a URL that matches a URL pattern, decides whether to access the web page referenced by the encountered URL based on characteristics of the web pages of the group of the matching URL pattern wherein the components are implemented as computer-executable instructions stored in the memory for execution by the processor. - View Dependent Claims (10, 11)
-
-
12. A system with a processor and memory for crawling a web site having web pages, each web page having a URL reference that identifies the web page, each URL reference having characters, comprising:
-
a group of informational web pages of the web site, the informational web pages being pages that are to be visited when crawling the web site; a pattern component that identifies a URL reference pattern of the group of informational web pages of the web site based on the references of the web pages of the group, the URL reference pattern of a group representing a pattern of characters of URL references of the web pages of the group, the URL reference pattern including a wildcard character; and a crawling component that crawls the web site by encountering URL references on web pages of the web site that reference other web pages of the web site; after encountering a URL reference, determining whether the encountered URL reference matches the identified URL reference pattern based in part on the wildcard character of the URL reference pattern; when the encountered URL reference matches the identified URL reference pattern, following the encountered URL reference to retrieve the web page identified by the encountered URL reference pattern; and when the encountered URL reference does not match the identified URL reference pattern, not following the encountered URL references so that only encountered URL references that match the URL reference pattern of informational web pages are followed wherein the components are implemented as computer-executable instructions stored in the memory for execution by the processor. - View Dependent Claims (13, 14)
-
Specification