IMAGE PROCESSING OF WEBPAGES
First Claim
1. A system for automated feature extraction of webpages including machine process sable information, the system comprising:
- a markup language engine configured to;
identify a plurality of webpages, andprocess markup language of the plurality of webpages to determine that a subset of the plurality of webpages includes a target characteristic;
a rendering engine configured to;
determine, for a webpage of the subset, that a first image overlaps at least a portion of a second image in the webpage based at least on markup language of the webpage, andgenerate, for the webpage of the subset, an image of the webpage such that the portion of the second image is obscured by the first image; and
a detection engine configured to;
determine, for the webpage of the subset, at least one graphical feature of the webpage by processing the image of the webpage, the at least one graphical feature corresponding to the portion of the second image,determine, for the webpage of the subset, that the at least one graphical feature corresponds to graphical features of images of a different plurality of webpages associated with a target entity, andgenerating, responsive to the determination that the at least one graphical feature corresponds to the graphical features of images of the different plurality of webpages, an association between the webpage and the target entity for storage in a database.
3 Assignments
0 Petitions
Accused Products
Abstract
A web detection system processes webpage information and performs automated feature extraction of webpages including machine processable information. In an embodiment, the web detection system determines a subset of webpages having a target characteristic by processing markup language. For a webpage of the subset, the web detection system determines that a first image overlaps at least a portion of a second image in the webpage. The web detection system generates an image of the webpage such that the portion of the second image is obscured by the first image. The web detection system determines a graphical feature of the webpage by processing the image, e.g., using optical character recognition. Responsive to determining that the graphical feature corresponds to graphical features of images of a different set of webpages associated with a target entity, the web detection system determines that the webpage is also associated with the target entity.
-
Citations
20 Claims
-
1. A system for automated feature extraction of webpages including machine process sable information, the system comprising:
-
a markup language engine configured to; identify a plurality of webpages, and process markup language of the plurality of webpages to determine that a subset of the plurality of webpages includes a target characteristic; a rendering engine configured to; determine, for a webpage of the subset, that a first image overlaps at least a portion of a second image in the webpage based at least on markup language of the webpage, and generate, for the webpage of the subset, an image of the webpage such that the portion of the second image is obscured by the first image; and a detection engine configured to; determine, for the webpage of the subset, at least one graphical feature of the webpage by processing the image of the webpage, the at least one graphical feature corresponding to the portion of the second image, determine, for the webpage of the subset, that the at least one graphical feature corresponds to graphical features of images of a different plurality of webpages associated with a target entity, and generating, responsive to the determination that the at least one graphical feature corresponds to the graphical features of images of the different plurality of webpages, an association between the webpage and the target entity for storage in a database. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for automated feature extraction of webpages including machine process sable information, the method comprising:
-
identifying, by a web detection system, a plurality of webpages; processing, by the web detection system, markup language of the plurality of webpages to determine that a subset of the plurality of webpages includes a target characteristic; responsive to determining that the subset of the plurality of webpages includes the target characteristic, for a webpage of the subset; determining, by the web detection system, that a first object overlaps at least a portion of a second object in the webpage based at least on markup language of the webpage; generating, by the web detection system, an image of the webpage such that the portion of the second object is obscured or altered by the first object; determining, by the web detection system, at least one feature of the webpage by processing the image of the webpage, the at least one feature corresponding to the portion of the second object; determining, by the web detection system, that the at least one feature corresponds to features of images of a different plurality of webpages associated with a target entity; and responsive to determining that the at least one feature corresponds to the features of images of the different plurality of webpages; generating, by the web detection system, an association between the webpage and the target entity for storage in a database. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable storage medium storing instructions for automated feature extraction of webpages including machine processable information, the instructions when executed by a processor causing the processor to:
-
identify a plurality of webpages; process markup language of the plurality of webpages to determine that a subset of the plurality of webpages includes a target characteristic; responsive to determining that the subset of the plurality of webpages includes the target characteristic, for a webpage of the subset; determine that a first image overlaps at least a portion of a second image in the webpage based at least on markup language of the webpage; generate an image of the webpage such that the portion of the second image is obscured by the first image; determine at least one graphical feature of the webpage by processing the image of the webpage, the at least one graphical feature corresponding to the portion of the second image; determine that the at least one graphical feature corresponds to graphical features of images of a different plurality of webpages associated with a target entity; and generate, responsive to the determination that the at least one graphical feature corresponds to the graphical features of images of the different plurality of webpages, an association between the webpage and the target entity for storage in a database. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification