×

Method and apparatus for identifying webpage type

  • US 10,311,120 B2
  • Filed: 02/20/2015
  • Issued: 06/04/2019
  • Est. Priority Date: 08/22/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method for identifying webpage type, comprising:

  • at a device having a processor and a screen,reading pre-stored web addresses of a webpage type, obtaining a collection of string components of the web addresses by parsing the web addresses;

    converging web addresses having at least one identical string component into one group according to a pre-defined converging method to generate multiple groups;

    determining that a coverage rate of a group meets a requirement in response to a determination that a total number of webpages in the group is smaller than or equal to a first threshold and determining that an identification accuracy of the group meets the requirement in response to a determination that an entropy is smaller than a second threshold;

    determining the coverage rate and the identification accuracy of the group do not meet the requirement in response to a determination that the total number of webpages in the group is larger than the first threshold or the entropy is larger than or equal to a second threshold;

    wherein the entropy satisfies E=sum(pi*log(pi)), i=1, 2 . . . , n, wherein n is the total number of webpages in the group, pi is a probability of webpages of a same type occurring in the group;

    terminating converging in response to the determination that the coverage rate and the identification accuracy meet the requirement;

    generating a webpage classification rule using the multiple groups and the webpage type, and storing the webpage classification rule into a webpage classification rule base;

    judging whether a web address of a webpage to be classified matches a webpage classification rule;

    determining a type of the webpage to be a type corresponding to a webpage classification rule which matches the web address;

    in response to a judgment that the web address of the webpage to be classified does not match the webpage classification rule, using a classifier trained using a machine learning algorithm based on web addresses to determine the webpage type of the webpage to be classified;

    extracting a content of the webpage selectively according to the webpage type; and

    displaying, on the screen, the content to a user in a pre-defined manner corresponding to the webpage type.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×