Method, apparatus, and computer program product for classification of documents
First Claim
1. A computer-implemented method for identifying content to represent web pages and creating thumbnails from the content, the computer-implemented method comprising:
- retrieving a web document using a uniform resource locator (URL) contained in a dequeued work item, the dequeued work item parsed using a markup language parser;
determining, from the web document, candidate images for thumbnail creation,wherein the determination of the candidate images for thumbnail creation comprises at least;
identifying a desired thumbnail size and aspect ratio;
extracting data content from the parsed markup to determine one or more candidate images for thumbnail creation; and
utilizing one or more heuristics to discard candidate images having predefined undesirable characteristics, including at least discarding, from among the extracted one or more images, any images failing to meet the desired thumbnail size and aspect ratio; and
creating a thumbnail image, wherein generation of the thumbnail image comprises at least;
cropping a chosen image, the chosen image selected from among the candidate images, to each of one or more predefined sizes and encoding the chosen image with predefined compression settings, each in accordance with an environment in which the thumbnails will be used.
5 Assignments
0 Petitions
Accused Products
Abstract
Provided herein are systems, methods and computer readable media for classification of documents using a location hierarchy. An example method may include receiving a feature vector r that represents occurrence counts of references in a document'"'"'s text to each of a group of named entities, and determining whether the document is associated with the particular location by querying, to determine a query result, using feature vector r, at least one location-specific classifier from a group of location-specific classifiers, wherein the location-specific classifier is associated with the particular location, and wherein the location-specific classifier is configured to generate a positive output value in response to receiving an input feature vector representing occurrence count of at least one reference to the particular named entity and determining that the document is associated with the particular location in an instance in which the query result includes data indicating that the positive output value was generated by the location-specific classifier that is associated with the particular location.
-
Citations
21 Claims
-
1. A computer-implemented method for identifying content to represent web pages and creating thumbnails from the content, the computer-implemented method comprising:
-
retrieving a web document using a uniform resource locator (URL) contained in a dequeued work item, the dequeued work item parsed using a markup language parser; determining, from the web document, candidate images for thumbnail creation, wherein the determination of the candidate images for thumbnail creation comprises at least; identifying a desired thumbnail size and aspect ratio; extracting data content from the parsed markup to determine one or more candidate images for thumbnail creation; and utilizing one or more heuristics to discard candidate images having predefined undesirable characteristics, including at least discarding, from among the extracted one or more images, any images failing to meet the desired thumbnail size and aspect ratio; and creating a thumbnail image, wherein generation of the thumbnail image comprises at least;
cropping a chosen image, the chosen image selected from among the candidate images, to each of one or more predefined sizes and encoding the chosen image with predefined compression settings, each in accordance with an environment in which the thumbnails will be used. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. An apparatus for identifying content to represent web pages and creating thumbnails from the content comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the processor, cause the apparatus to at least:
-
retrieve a web document using a uniform resource locator (URL) contained in a dequeued work item, the dequeued work item parsed using a markup language parser; determine, from the web document, candidate images for thumbnail creation, wherein the computer program code configured to, with the processor, cause the apparatus to determine the candidate images for thumbnail creation, comprises computer program code configured to, with the processor, cause the apparatus to at least; identify a desired thumbnail size and aspect ratio; extract data content from the parsed markup to determine one or more candidate images for thumbnail creation; utilize one or more heuristics to discard candidate images having predefined undesirable characteristics, including computer program code configured to, with the processor, cause the apparatus to at least discard, from among the extracted one or more images, any images failing to meet the desired thumbnail size and aspect ratio; and create a thumbnail image, wherein the computer program code configured to, with the processor, cause the apparatus to generate the thumbnail image comprises computer program code configured to, with the processor, cause the apparatus to at least;
crop a chosen image, the chosen image selected from among the candidate images, to each of one or more predefined sizes and encoding the chosen image with predefined compression settings, each in accordance with an environment in which the thumbnails will be used. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer program product for identifying content to represent web pages and creating thumbnails from the content comprising at least one non-transitory computer-readable storage medium having computer-executable program code instructions stored therein, the computer-executable program code instructions comprising program code instructions for:
-
retrieving a web document using a uniform resource locator (URL) contained in a dequeued work item, the dequeued work item parsed using a markup language parser; determining, from the web document, candidate images for thumbnail creation, wherein the determination of the candidate images for thumbnail creation comprises at least; identifying a desired thumbnail size and aspect ratio; extracting data content from the parsed markup to determine one or more candidate images for thumbnail creation; utilizing one or more heuristics to discard candidate images having predefined undesirable characteristics, including at least discarding, from among the extracted one or more images, any images failing to meet the desired thumbnail size and aspect ratio; and creating a thumbnail image, wherein generation of the thumbnail image comprises at least;
cropping a chosen image, the chosen image selected from among the candidate images, to each of one or more predefined sizes and encoding the chosen image with predefined compression settings, each in accordance with an environment in which the thumbnails will be used.
-
Specification