Matching text to images
First Claim
1. A method performed on at least one computer processor, said method comprising:
- receiving an image to classify, said image being located within a text document;
identifying a plurality of said text documents comprising said image;
identifying a training set of examples from at least one of said text documents, said training set of examples comprising a subset of text within said text documents;
training a classifier using said training set;
classifying said text within said text document using said classifier to identify a group of text associated with said image; and
for each of said plurality of said text documents, analyzing text in said text documents to determine a list of topics within said text document, said topics being used to select at least a portion of said training set.
2 Assignments
0 Petitions
Accused Products
Abstract
Text in web pages or other text documents may be classified based on the images or other objects within the webpage. A system for identifying and classifying text related to an object may identify one or more web pages containing the image or similar images, determine topics from the text of the document, and develop a set of training phrases for a classifier. The classifier may be trained and then used to analyze the text in the documents. The training set may include both positive examples and negative examples of text taken from the set of documents. A positive example may include captions or other elements directly associated with the object, while negative examples may include text taken from the documents, but from a large distance from the object. In some cases, the system may iterate on the classification process to refine the results.
24 Citations
20 Claims
-
1. A method performed on at least one computer processor, said method comprising:
-
receiving an image to classify, said image being located within a text document; identifying a plurality of said text documents comprising said image; identifying a training set of examples from at least one of said text documents, said training set of examples comprising a subset of text within said text documents; training a classifier using said training set; classifying said text within said text document using said classifier to identify a group of text associated with said image; and for each of said plurality of said text documents, analyzing text in said text documents to determine a list of topics within said text document, said topics being used to select at least a portion of said training set. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system comprising:
-
a computer processor; an object classifier operable on said processor, said object classifier that; receives a set of text documents and identifies a common object to classify, said common object being comprised in each of said text documents; identifies a training set of examples from at least one of said text documents, said training set of examples comprising a subset of text within said text documents; trains a classifier using said training set; and classifies said text within said text document using said classifier to identify a group of text associated with said object; and a document object model generator that processes at least one of said text documents to identify a document object model, said document object model defining a set of nodes based on objects within said text document, said document object model being used to select at least one member of said training set. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A method performed on at least one computer processor, said method comprising:
-
receiving an image to classify, said image being located within a first web page; identifying a plurality of web pages comprising said image by transmitting a search request to a search system and returning said plurality of said web pages, the first web page included in said plurality of web pages; identifying a training set of examples from at least one of said web pages, said training set of examples comprising a subset of text within said plurality of web pages, said training set comprising at least one positive example and at least one negative example; training a classifier using said training set, said classifier being a binary classifier that returns a binary result indicating either related or not related; classifying said text within said plurality of web pages using said classifier to identify a group of text associated with said image; creating a document object model; and using said document object model to select at least one member of said training set. - View Dependent Claims (18, 19, 20)
-
Specification