×

System and method for selecting a sub-domain for a specified domain of the web

  • US 7,542,970 B2
  • Filed: 05/11/2006
  • Issued: 06/02/2009
  • Est. Priority Date: 05/11/2006
  • Status: Expired due to Fees
First Claim
Patent Images

1. A selection method, comprising:

  • receiving, by a computing system, a taxonomy of data related to a specified domain of knowledge on the web;

    storing, by said computing system, said taxonomy of data;

    constructing, by a software application within said computing system, a taxonomy tree from said taxonomy;

    receiving, by said computing system, a user selection for a taxonomy sub-tree from said taxonomy tree, said sub tree related to a sub-domain from said specified domain;

    receiving, by said computing system from a user, a first list comprising user expected universal resource locators (URLs) related to said sub-domain, wherein said user selection is associated with a published list of URLs;

    generating, by said software application, a second list comprising topic expressions defining each node of said taxonomy sub-tree;

    receiving, by said software application, a first command for removing a first topic expression of said topic expressions from said second list;

    removing, by said software application in response to said first command, said first topic expression from said second list;

    receiving, by said software application, a second command for adding a second topic expression to said second list;

    adding, by said software application in response to said second command, said second topic expression to said second list;

    after said removing and said adding, generating by said software application, a query based on said second list by applying at least one Boolean operator on said topic expressions on said second list;

    applying, by said software application, said query on an index of URLs, said index generated from a web crawling process;

    generating, by said query, a third list comprising actual URLs located during said query;

    determining, by said software application, a first group (A) of URLs that are listed on and common to said third list and said first list;

    determining, by said software application, a second group (B) of URLs that are listed on only said first list;

    calculating, by said software application, a recall value (R) based on a number of URLs in said first group (NA) and a number of URLs in said second group (NB), wherein R=NA/NB;

    randomly sampling, by said software application, said third list to generate a sampled list (D) of URLs from said third list;

    sending, said sampled list (D) to said user of said computing system;

    receiving, by said computing system, a user selected sub-list (C) of URLs from said sampled list (D), said user selected sublist based on a selection criteria;

    calculating, by said software application, a precision value (P) based on a number of URLs on said user selected sub-list (NC) and a number of URLs on said sampled list (ND) wherein P=NC/ND;

    comparing, by said computing system, said recall value to a predetermined recall value;

    determining, by said computing system based on first results of said comparing said recall value to said predetermined recall value, that said recall value comprises an acceptable recall value;

    comparing, by said computing system, said precision value to a predetermined precision value;

    determining, by said computing system based on second results of said comparing said precision value to said predetermined precision value, that said precision value comprises an acceptable precision value;

    and saving, on said computing system in response to said first results and said second results, said sub-list (C).

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×