Query generation using structural similarity between documents
First Claim
1. A method executed by a data processing apparatus, comprising:
- identifying a seed query for a structured document based on a performance of the seed query with respect to the structured document;
identifying, by the one or more computers, one or more embedded coding fragments for the structured document using the seed query, each identified embedded coding fragment specifying a structure of a portion of the structured document that includes at least one term of the seed query;
generating, by the one or more computers, one or more query templates, each query template corresponding to at least one of the identified embedded coding fragments, the query template including the structure of the corresponding at least one embedded coding fragment and a generative rule to be used in generating one or more synthetic queries;
generating, by the one or more computers, the one or more synthetic queries using the one or more query templates and other structured documents, the generating comprising, for each query template;
identifying a portion of a particular structured document that includes the structure specified by the corresponding embedded coding fragment; and
generating a synthetic query using text contained in the portion of the particular structured document and specified by the generative rule; and
storing, by the one or more computers, at least one of the one or more synthetic queries in a query store.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products, for generating synthetic queries using seed queries and structural similarity between documents are described. In one aspect, a method includes identifying embedded coding fragments (e.g., HTML tag) from a structured document and a seed query; generating one or more query templates, each query template corresponding to at least one coding fragment, the query template including a generative rule to be used in generating candidate synthetic queries; generating the candidate synthetic queries by applying the query templates to other documents that are hosted on the same web site as the document; identifying terms that match structure of the query templates as candidate synthetic queries; measuring a performance for each of the candidate synthetic queries; and designating as synthetic queries the candidate synthetic queries that have performance measurements exceeding a performance threshold.
105 Citations
20 Claims
-
1. A method executed by a data processing apparatus, comprising:
-
identifying a seed query for a structured document based on a performance of the seed query with respect to the structured document; identifying, by the one or more computers, one or more embedded coding fragments for the structured document using the seed query, each identified embedded coding fragment specifying a structure of a portion of the structured document that includes at least one term of the seed query; generating, by the one or more computers, one or more query templates, each query template corresponding to at least one of the identified embedded coding fragments, the query template including the structure of the corresponding at least one embedded coding fragment and a generative rule to be used in generating one or more synthetic queries; generating, by the one or more computers, the one or more synthetic queries using the one or more query templates and other structured documents, the generating comprising, for each query template; identifying a portion of a particular structured document that includes the structure specified by the corresponding embedded coding fragment; and generating a synthetic query using text contained in the portion of the particular structured document and specified by the generative rule; and storing, by the one or more computers, at least one of the one or more synthetic queries in a query store. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system, comprising:
-
a data processing apparatus; and a memory storage apparatus in data communication with the data processing apparatus, the memory storage apparatus storing instructions executable by the data processing apparatus and that upon such execution cause the data processing apparatus to perform operations comprising; identifying a seed query for a structured document based on a performance of the seed query with respect to the structured document; identifying one or more embedded coding fragments for the structured document using the seed query, each identified embedded coding fragment specifying a structure of a portion of the structured document that includes at least one term of the seed query; generating one or more query templates, each query template corresponding to at least one of the identified embedded coding fragments, the query template including the structure of the corresponding at least one embedded coding fragment and a generative rule to be used in generating one or more synthetic queries; generating the one or more synthetic queries using the one or more query templates and other structured documents, the generating comprising, for each query template; identifying a portion of a particular structured document that includes the structure specified by the corresponding embedded coding fragment; and generating a synthetic query using text contained in the portion of the particular structured document and specified by the generative rule; and storing at least one of the one or more synthetic queries in a query store. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer storage medium encoded with a computer program, the program comprising instructions that when executed by a data processing apparatus cause the data processing apparatus to perform operations comprising:
-
identifying a seed query for a structured document based on a performance of the seed query with respect to the structured document; identifying one or more embedded coding fragments for the structured document using the seed query, each identified embedded coding fragment specifying a structure of a portion of the structured document that includes at least one term of the seed query; generating one or more query templates, each query template corresponding to at least one of the identified embedded coding fragments, the query template including the structure of the corresponding at least one embedded coding fragment and a generative rule to be used in generating one or more synthetic queries; generating the one or more synthetic queries using the one or more query templates and other structured documents, the generating comprising, for each query template; identifying a portion of a particular structured document that includes the structure specified by the corresponding embedded coding fragment; and generating a synthetic query using text contained in the portion of the particular structured document and specified by the generative rule; and storing, at least one of the one or more synthetic queries in a query store. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification