Method and apparatus for identifying webpage type
First Claim
Patent Images
1. A method for identifying webpage type, comprising:
- at a device having a processor and a screen,reading pre-stored web addresses of a webpage type, obtaining a collection of string components of the web addresses by parsing the web addresses;
converging web addresses having at least one identical string component into one group according to a pre-defined converging method to generate multiple groups;
determining that a coverage rate of a group meets a requirement in response to a determination that a total number of webpages in the group is smaller than or equal to a first threshold and determining that an identification accuracy of the group meets the requirement in response to a determination that an entropy is smaller than a second threshold;
determining the coverage rate and the identification accuracy of the group do not meet the requirement in response to a determination that the total number of webpages in the group is larger than the first threshold or the entropy is larger than or equal to a second threshold;
wherein the entropy satisfies E=sum(pi*log(pi)), i=1, 2 . . . , n, wherein n is the total number of webpages in the group, pi is a probability of webpages of a same type occurring in the group;
terminating converging in response to the determination that the coverage rate and the identification accuracy meet the requirement;
generating a webpage classification rule using the multiple groups and the webpage type, and storing the webpage classification rule into a webpage classification rule base;
judging whether a web address of a webpage to be classified matches a webpage classification rule;
determining a type of the webpage to be a type corresponding to a webpage classification rule which matches the web address;
in response to a judgment that the web address of the webpage to be classified does not match the webpage classification rule, using a classifier trained using a machine learning algorithm based on web addresses to determine the webpage type of the webpage to be classified;
extracting a content of the webpage selectively according to the webpage type; and
displaying, on the screen, the content to a user in a pre-defined manner corresponding to the webpage type.
1 Assignment
0 Petitions
Accused Products
Abstract
Various embodiments provide a method and an apparatus for identifying webpage type. The method includes: judging whether a web address to be classified matches with a webpage classification rule in at least two webpage classification rules; and determining the type of the webpage to be a type corresponding to a webpage classification rule which matches with the web address.
10 Citations
16 Claims
-
1. A method for identifying webpage type, comprising:
-
at a device having a processor and a screen, reading pre-stored web addresses of a webpage type, obtaining a collection of string components of the web addresses by parsing the web addresses; converging web addresses having at least one identical string component into one group according to a pre-defined converging method to generate multiple groups; determining that a coverage rate of a group meets a requirement in response to a determination that a total number of webpages in the group is smaller than or equal to a first threshold and determining that an identification accuracy of the group meets the requirement in response to a determination that an entropy is smaller than a second threshold; determining the coverage rate and the identification accuracy of the group do not meet the requirement in response to a determination that the total number of webpages in the group is larger than the first threshold or the entropy is larger than or equal to a second threshold;
wherein the entropy satisfies E=sum(pi*log(pi)), i=1, 2 . . . , n, wherein n is the total number of webpages in the group, pi is a probability of webpages of a same type occurring in the group;terminating converging in response to the determination that the coverage rate and the identification accuracy meet the requirement; generating a webpage classification rule using the multiple groups and the webpage type, and storing the webpage classification rule into a webpage classification rule base; judging whether a web address of a webpage to be classified matches a webpage classification rule; determining a type of the webpage to be a type corresponding to a webpage classification rule which matches the web address; in response to a judgment that the web address of the webpage to be classified does not match the webpage classification rule, using a classifier trained using a machine learning algorithm based on web addresses to determine the webpage type of the webpage to be classified; extracting a content of the webpage selectively according to the webpage type; and displaying, on the screen, the content to a user in a pre-defined manner corresponding to the webpage type. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus for identifying webpage type, comprising:
-
at least one processor; a display screen; and memory for storing computer-readable instructions, wherein the at least one processor, when executing the computer-readable instructions, is configured to; read pre-stored web addresses of a webpage type, obtaining a collection of string components of the web addresses by parsing the web addresses; converge web addresses having at least one identical string component into one group according to a pre-defined converging method to generate multiple groups; determine that a coverage rate of a group meets a requirement in response to a determination that a total number of webpages in the group is smaller than or equal to a first threshold and determining that an identification accuracy of the group meets the requirement in response to a determination that an entropy is smaller than a second threshold; determine the coverage rate and the identification accuracy of the group do not meet the requirement in response to a determination that the total number of webpages in the group is larger than the first threshold or the entropy is larger than or equal to a second threshold;
wherein the entropy satisfies E=sum(pi*log(pi)), i=1, 2 . . . , n, wherein n is the total number of webpages in the group, pi is a probability of webpages of a same type occurring in the group;terminate converging in response to the determination that the coverage rate and the identification accuracy meet the requirement; generate a webpage classification rule using the multiple groups and the webpage type, and storing the webpage classification rule into a webpage classification rule base; judge whether a web address of a webpage to be classified matches a webpage classification rule; determine a type of the webpage to be a type corresponding to a webpage classification rule which matches the web address; in response to a judgment that the web address of the webpage to be classified does not matches the webpage classification rule, use a classifier trained using a machine learning algorithm based on web addresses to determine the webpage type of the webpage to be classified; extract a content of the webpage selectively according to the webpage type; and display, on the display screen, the content to a user in a pre-defined manner corresponding to the webpage type. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
Specification