Computer method and apparatus for collecting people and organization information from Web sites

US 6,983,282 B2
Filed: 03/30/2001
Issued: 01/03/2006
Est. Priority Date: 07/31/2000
Status: Active Grant

First Claim

Patent Images

1. A method for collecting people and organization information from Web sites in a global computer network comprising the steps of:

accessing a Web site of potential interest, the Web site having a plurality of Web pages;

determining a subset of the plurality of Web pages to process; and

for each Web page in the subset, (i) determining types of contents found on the Web page, and (ii) based on the determined content types, enabling extraction of people and organization information from the Web page.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer processing method and apparatus for searching and retrieving Web pages to collect people and organization information are disclosed. A Web site of potential interest is accessed. A subset of Web pages from the accessed site are determined for processing. According to types of contents found on a subject Web page, extraction of people and organization information is enabled. Internal links of a Web site are collected and recorded in a links-to-visit table. To avoid duplicate processing of Web sites, unique identifiers or Web site signatures are utilized. Respective time thresholds (time-outs) for processing a Web site and for processing a Web page are employed. A database is maintained for storing indications of domain URLs, names of respective owners of the URLs as identified from the corresponding Web sites, type of each Web site, processing frequencies, dates of last processings, outcomes of last processings, size of each domain and number of data items found in the last processing of each Web site.

Citations

25 Claims

1. A method for collecting people and organization information from Web sites in a global computer network comprising the steps of:
- accessing a Web site of potential interest, the Web site having a plurality of Web pages;
  
  determining a subset of the plurality of Web pages to process; and
  
  for each Web page in the subset, (i) determining types of contents found on the Web page, and (ii) based on the determined content types, enabling extraction of people and organization information from the Web page.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. A method as claimed in claim 1 wherein the step of accessing includes determining whether the Web site has previously been accessed for searching for people and organization information.
  - 3. A method as claimed in claim 2 wherein the step of determining whether the Web site has previously been accessed includes:
    - obtaining a unique identifier for the Web site; and
      
      comparing the unique identifier to identifiers of past accessed Web sites to determine duplication of accessing a same Web site.
  - 4. A method as claimed in claim 3 wherein the step of obtaining a unique identifier includes forming a signature as a function of home page of the Web site.
  - 5. A method as claimed in claim 1 wherein the step of determining the subset of Web pages to process includes processing a listing of internal links and selecting from remaining internal links as a function of keywords.
  - 6. A method as claimed in claim 5 wherein the step of determining a subset of Web pages to process includes:
    - extracting from a script a quoted phrase ending in “
      
      .ASP”
      
      , “
      
      .HTM”
      
      or “
      
      .HTML”
      
      ; and
      
      treating the extracted phrase as an internal link.
  - 7. A method as claimed in claim 1 wherein the step of determining content types of Web pages includes obtaining the content owner name of the Web site as a whole by using a Bayesian Network and appropriate tests.
  - 8. A method as claimed in claim 1 wherein the step of determining content types of Web pages includes collecting external links that point to other domains and extracting new domain URLs which are added to a domain database.
  - 9. A method as claimed in claim 1 wherein the step of determining the subset of Web pages to process includes determining if a subject Web page contains a listing of press releases, and if so, following each internal link in the listing of press releases.
  - 10. A method as claimed in claim 1 wherein the step of determining the subset of Web pages to process includes determining if a subject Web page contains a listing of news articles, and if so, following each internal link in the listing of news articles.
  - 11. A method as claimed in claim 1 further comprising imposing a time limit for processing a Web site.
  - 12. A method as claimed in claim 1 further comprising imposing a time limit for processing a Web page.
  - 13. A method as claimed in claim 1 further comprising the step of maintaining a domain database storing for each Web site indications of:
    - Web site domain URL;
      
      name of content owner;
      
      site type of the Web site;
      
      frequency at which to access the Web site for processing;
      
      date of last accessing and processing;
      
      outcome of last processing;
      
      number of Web pages processed; and
      
      number of data items found in last processing.

14. Apparatus for collecting people and organization information from Web sites in a global computer network comprising:
- a domain database storing respective domain names of Web sites of potential interest; and
  
  computer processing means coupled to the domain database, the computer processing means;
  
  (a) obtaining from the domain database, domain name of a Web site of potential interest and accessing the Web site, the Web site having a plurality of Web pages;
  
  (b) determining a subset of the plurality of Web pages to process; and
  
  (c) for each Web page in the subset, the computer processing means (i) determining types of contents found on the Web page, and (ii) based on the determined content types, enabling extraction of people and organization information from the Web page.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 15. Apparatus as claimed in claim 14 wherein the computer processing means accessing the Web site includes determining whether the Web site has previously been accessed for searching for people and organization information.
  - 16. Apparatus as claimed in claim 15 wherein the computer processing means determining whether the Web site has previously been accessed includes:
    - obtaining a unique identifier for the Web site; and
      
      comparing the unique identifier to identifiers of past accessed Web sites to determine duplication of accessing a same Web site.
  - 17. Apparatus as claimed in claim 16 wherein the computer processing means obtaining a unique identifier includes forming a signature as a function of home page of the Web site.
  - 18. Apparatus as claimed in claim 14 wherein the computer processing means determining the subset of Web pages to process includes processing a listing of internal links and selecting from remaining internal links as a function of keywords.
  - 19. Apparatus as claimed in claim 18 wherein the computer processing means determining a subset of Web pages to process includes:
    - extracting from a script a quoted phrase ending in “
      
      .ASP”
      
      , “
      
      .HTM”
      
      or “
      
      .HTML”
      
      ; and
      
      treating the extracted phrase as an internal link.
  - 20. Apparatus as claimed in claim 14 wherein the computer processing means determining content types of Web pages includes collecting external links and other domain names, andthe step of obtaining domain names includes receiving the collected external links and other domain names from the step of determining content types.
  - 21. Apparatus as claimed in claim 14 wherein the computer processing means determining the subset of Web pages to process includes determining if a subject Web page contains a listing of press releases, and if so, following each internal link in the listing of press releases.
  - 22. Apparatus as claimed in claim 14 wherein the computer processing means determining the subset of Web pages to process includes determining if a subject Web page contains a listing of news articles, and if so, following each internal link in the listing of news articles.
  - 23. Apparatus as claimed in claim 14 further comprising a time limit by which the computer processing means processes a Web site.
  - 24. Apparatus as claimed in claim 14 further comprising a time limit by which the computer processing means processes a Web page.
  - 25. Apparatus as claimed in claim 14 wherein the domain database further stores for each Web site indications of:
    - name of content owner, site type of the Web site, frequency at which to access the Web site for processing, date of last accessing and processing, outcome of last processing, number of Web pages processed, and number of data items found in last processing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Zoom Information, Inc. (ZoomInfo Technologies, Inc.)
Original Assignee
Zoom Information, Inc. (ZoomInfo Technologies, Inc.)
Inventors
Stern, Jonathan, Rothman-Shore, Jeremy W., Karadimitriou, Kosmas, Decary, Michel
Primary Examiner(s)
Harrell, Robert B.

Application Number

US09/821,908
Publication Number

US 20020052928A1
Time in Patent Office

1,740 Days
Field of Search

706/45, 706/46, 706/59, 706/61, 707/1, 707/3, 707/100, 707/102, 709/201, 709/203, 709/217, 709/218, 709/200
US Class Current

707/805
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

G06F 16/9535   Search customisation based ...

G06F 16/9538   Presentation of query results

Y10S 707/959   Network

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Y10S 707/99937   Sorting

Y10S 707/99943   Generating database or data...

Y10S 707/99945   Object-oriented database st...

Y10S 707/99948   Application of database or ...

Computer method and apparatus for collecting people and organization information from Web sites

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Computer method and apparatus for collecting people and organization information from Web sites

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links