Search with joint image-audio queries
First Claim
1. A computer-implemented method performed by a data processing apparatus, the method comprising:
- receiving, by a data processing apparatus, a joint image-audio query sent to the data processing apparatus from a client device separate from the data processing apparatus, the joint image-audio query including query image data defining a query image and query audio data defining query audio, wherein;
the query image data is an image file;
the query audio data is an audio recording file of speech; and
the query image data and the query audio data are paired as the joint-image audio query at the client device and then sent to the data processing apparatus;
determining, by the data processing apparatus, query image feature data from the query image data included in the received joint image-audio query, the query image feature data describing image features of the query image;
determining, by the data processing apparatus, query audio feature data from the audio data included in the received joint image-audio query, the query audio feature data including text derived from the audio recording of speech;
providing, by the data processing apparatus, the query image feature data and the query audio feature data to a joint image-audio relevance model that i) receives, as input, image feature data and audio feature data, and ii) is trained to generate relevance scores for a plurality of resources based on a combined relevance of the query image feature data to image feature data of the resource and the text derived from the audio recording of speech to text of the resource;
identifying, by the data processing apparatus, resources responsive to the joint image-audio query based, in part, on a corresponding relevance score that was determined by the joint image-audio relevance model, wherein each identified resource includes i) resource image data defining a resource image for the identified resource, and ii) text data defining resource text for the identified resource, and wherein each relevance score for each identified resource is a measure of the relevance of the corresponding resource image data and text data defining the resource text to the query image feature data and the text derived from the audio recording of speech;
ordering, by the data processing apparatus, the identified resources according to the corresponding relevance scores; and
providing, by the data processing apparatus, data defining search results indicating the order of the identified resources to the client device.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for processing joint image-audio queries. In one aspect, a method includes receiving, from a client device, a joint image-audio query including query image data and query audio data. Query image feature data is determined from the query image data. Query audio feature data is determined from the audio data. The query image feature data and the query audio feature data are provided to a joint image-audio relevance model trained to generate relevance scores for a plurality of resources, each resource including resource image data defining a resource image for the resource and text data defining resource text for the resource. Each relevance score is a measure of the relevance of corresponding resource to the joint image-audio query. Data defining search results indicating the order of the resources is provided to the client device.
8 Citations
13 Claims
-
1. A computer-implemented method performed by a data processing apparatus, the method comprising:
-
receiving, by a data processing apparatus, a joint image-audio query sent to the data processing apparatus from a client device separate from the data processing apparatus, the joint image-audio query including query image data defining a query image and query audio data defining query audio, wherein; the query image data is an image file; the query audio data is an audio recording file of speech; and the query image data and the query audio data are paired as the joint-image audio query at the client device and then sent to the data processing apparatus; determining, by the data processing apparatus, query image feature data from the query image data included in the received joint image-audio query, the query image feature data describing image features of the query image; determining, by the data processing apparatus, query audio feature data from the audio data included in the received joint image-audio query, the query audio feature data including text derived from the audio recording of speech; providing, by the data processing apparatus, the query image feature data and the query audio feature data to a joint image-audio relevance model that i) receives, as input, image feature data and audio feature data, and ii) is trained to generate relevance scores for a plurality of resources based on a combined relevance of the query image feature data to image feature data of the resource and the text derived from the audio recording of speech to text of the resource; identifying, by the data processing apparatus, resources responsive to the joint image-audio query based, in part, on a corresponding relevance score that was determined by the joint image-audio relevance model, wherein each identified resource includes i) resource image data defining a resource image for the identified resource, and ii) text data defining resource text for the identified resource, and wherein each relevance score for each identified resource is a measure of the relevance of the corresponding resource image data and text data defining the resource text to the query image feature data and the text derived from the audio recording of speech; ordering, by the data processing apparatus, the identified resources according to the corresponding relevance scores; and providing, by the data processing apparatus, data defining search results indicating the order of the identified resources to the client device. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system, comprising:
-
a data processing apparatus; and a computer storage medium encoded with a computer program, the program comprising instructions that when executed by the data processing apparatus cause the data processing apparatus to perform operations comprising; receiving a joint image-audio query sent to the data processing apparatus from a client device separate from the data processing apparatus, the joint image-audio query including query image data defining a query image and query audio data defining query audio, wherein; the query image data is an image file; the query audio data is an audio recording file of speech; and the query image data and the query audio data are paired as the join-image audio query at the client device and then sent to the data processing apparatus; determining query image feature data from the query image data included in the received joint image-audio query, the query image feature data describing image features of the query image; determining query audio feature data from the audio data included in the received joint image-audio query, the query audio feature data including text derived from the audio recording of speech; providing the query image feature data and the query audio feature data to a joint image-audio relevance model that i) receives, as input, image feature data and audio feature data, and ii) is trained to generate relevance scores for a plurality of resources based on a combined relevance of the query image feature data to image feature data of the resource and the text derived from the audio recording of speech to text of the resource; identifying resources responsive to the joint image-audio query based, in part, on a corresponding relevance score that was determined by the joint image audio relevance model, wherein each identified resource includes resource image data defining a resource image for the identified resource and text data defining resource text for the identified resource, and wherein each relevance score for each identified resource is a measure of the relevance of the corresponding resource image data and text data defining the resource text to the query image feature data and the text derived from the audio recording of speech; ordering the identified resources according to the corresponding relevance scores; and providing data defining search results indicating the order of the identified resources to the client device. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer storage device encoded with a computer program, the program comprising instructions that when executed by a client device cause the client device to perform operations comprising:
-
receiving a joint image-audio query sent to the data processing apparatus from a client device separate from the data processing apparatus, the joint image-audio query including query image data defining a query image and query audio data defining query audio, wherein; the query image data is an image file; the query audio data is an audio recording file of speech; and the query image data and the query audio data are paired as the join-image audio query at the client device and then sent to the data processing apparatus; determining query image feature data from the query image data included in the received joint image-audio query, the query image feature data describing image features of the query image; determining query audio feature data from the audio data included in the received joint image-audio query, the query audio feature data including text derived from the query audio; providing the query image feature data and the query audio feature data to a joint image-audio relevance model that i) receives, as input, image feature data and audio feature data, and ii) is trained to generate relevance scores for a plurality of resources based on a combined relevance of the query image feature data to image feature data of the resource and the text derived from the audio recording of speech to text of the resource; identifying resources responsive to the joint image-audio query based, in part, on a corresponding relevance score that was determined by the joint image audio relevance model, wherein each identified resource includes resource image data defining a resource image for the identified resource and text data defining resource text for the identified resource, and wherein each relevance score for each identified resource is a measure of the relevance of the corresponding resource image data and text data defining the resource text to the query image feature data and the text derived from the audio recording of speech; ordering the identified resources according to the corresponding relevance scores; and providing data defining search results indicating the order of the identified resources to the client device.
-
Specification