Web forum crawler
First Claim
1. A system for crawling a site having pages, each page having a reference that identifies the page, comprising:
- a grouping component that identifies groups of pages with similar content;
a pattern component that identifies a reference pattern of a group based on the references of the pages of the group; and
a decision component that, after encountering a reference that matches a reference pattern, decides whether to access the page of the encountered reference based on characteristics of the pages of the group of the matching reference pattern.
2 Assignments
0 Petitions
Accused Products
Abstract
A crawling system crawls a web site initially in a pattern detection phase and subsequently in a pattern usage phase. The pattern detection phase attempts to identify patterns of references to pages that contain informational content of interest and patterns of references to pages that contain little informational content of interest. During the pattern usage phase, the crawling system crawls the web site. When the crawling system encounters a reference contained on an accessed page, the crawling system determines whether the reference matches a reference pattern. If the reference matches a reference pattern associated with pages that contain informational content of interest, the crawling system accesses the referenced page. If, however, the reference matches a reference pattern of pages with little informational content, then the crawling system discards that reference without accessing the referenced page.
-
Citations
20 Claims
-
1. A system for crawling a site having pages, each page having a reference that identifies the page, comprising:
-
a grouping component that identifies groups of pages with similar content;
a pattern component that identifies a reference pattern of a group based on the references of the pages of the group; and
a decision component that, after encountering a reference that matches a reference pattern, decides whether to access the page of the encountered reference based on characteristics of the pages of the group of the matching reference pattern. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for detecting web pages of a web site having multiple references, comprising:
-
a grouping component that identifies groups of web pages with similar content;
a pattern component that identifies a reference pattern of a group based on the references of the web pages of the group; and
a duplicate detection component that, when reference patterns identify a similar set of web pages, indicates that the reference patterns reference multiple reference web pages. - View Dependent Claims (10, 11, 12)
-
-
13. A system for determining web page type for web pages of a web forum, comprising:
-
a grouping component that identifies groups of web pages with similar content; and
a typing component that indicates that web pages within a group having more than an operational threshold number of web pages are operational web pages and that web pages within a group having not more than an informational threshold number of web pages are informational web pages. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification