Computer method and apparatus for collecting people and organization information from Web sites
First Claim
1. A method for collecting people and organization information from Web sites in a global computer network comprising the steps of:
- accessing a Web site of potential interest, the Web site having a plurality of Web pages;
determining a subset of the plurality of Web pages to process; and
for each Web page in the subset, (i) determining types of contents found on the Web page, and (ii) based on the determined content types, enabling extraction of people and organization information from the Web page.
7 Assignments
0 Petitions
Accused Products
Abstract
Computer processing method and apparatus for searching and retrieving Web pages to collect people and organization information are disclosed. A Web site of potential interest is accessed. A subset of Web pages from the accessed site are determined for processing. According to types of contents found on a subject Web page, extraction of people and organization information is enabled. Internal links of a Web site are collected and recorded in a links-to-visit table. To avoid duplicate processing of Web sites, unique identifiers or Web site signatures are utilized. Respective time thresholds (time-outs) for processing a Web site and for processing a Web page are employed. A database is maintained for storing indications of domain URLs, names of respective owners of the URLs as identified from the corresponding Web sites, type of each Web site, processing frequencies, dates of last processings, outcomes of last processings, size of each domain and number of data items found in the last processing of each Web site.
-
Citations
25 Claims
-
1. A method for collecting people and organization information from Web sites in a global computer network comprising the steps of:
-
accessing a Web site of potential interest, the Web site having a plurality of Web pages;
determining a subset of the plurality of Web pages to process; and
for each Web page in the subset, (i) determining types of contents found on the Web page, and (ii) based on the determined content types, enabling extraction of people and organization information from the Web page. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. Apparatus for collecting people and organization information from Web sites in a global computer network comprising:
-
a domain database storing respective domain names of Web sites of potential interest; and
computer processing means coupled to the domain database, the computer processing means;
(a) obtaining from the domain database, domain name of a Web site of potential interest and accessing the Web site, the Web site having a plurality of Web pages;
(b) determining a subset of the plurality of Web pages to process; and
(c) for each Web page in the subset, the computer processing means (i) determining types of contents found on the Web page, and (ii) based on the determined content types, enabling extraction of people and organization information from the Web page. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
Specification