Query generation using structural similarity between documents
First Claim
1. A method executed by a data processing apparatus, comprising:
- receiving, by one or more computers, a set of seed queries associated with a structured document, the structured document including embedded coding and being hosted on a website, each seed query including one or more terms;
identifying, by the one or more computers, one or more embedded coding fragments from the structured document for each seed query, each identified embedded coding fragment for a seed query specifying a structure of a portion of the structured document that includes at least one term of the seed query;
generating, by the one or more computers, one or more query templates, each query template corresponding to at least one identified embedded coding fragment, the query template including the structure of the corresponding at least one embedded coding fragment and a generative rule to be used in generating one or more candidate synthetic queries;
generating, by the one or more computers, the one or more candidate synthetic queries using the one or more query templates and other structured documents hosted on the website, the generating comprising, for each query template;
identifying a portion of a particular structured document hosted on the website that includes the structure specified by the corresponding embedded coding fragment; and
generating a candidate synthetic query using text contained in the portion of the particular structured document and specified by the generative rule;
measuring, by the one or more computers, a performance in a search operation of each of the one or more candidate synthetic queries;
designating, by the one or more computers, as a synthetic query a candidate synthetic query that has a performance measurement that exceeds a performance threshold; and
storing, by the one or more computers, the designated synthetic query in a query store.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products, for generating synthetic queries using seed queries and structural similarity between documents are described. In one aspect, a method includes identifying embedded coding fragments (e.g., HTML tag) from a structured document and a seed query; generating one or more query templates, each query template corresponding to at least one coding fragment, the query template including a generative rule to be used in generating candidate synthetic queries; generating the candidate synthetic queries by applying the query templates to other documents that are hosted on the same web site as the document; identifying terms that match structure of the query templates as candidate synthetic queries; measuring a performance for each of the candidate synthetic queries; and designating as synthetic queries the candidate synthetic queries that have performance measurements exceeding a performance threshold.
138 Citations
28 Claims
-
1. A method executed by a data processing apparatus, comprising:
-
receiving, by one or more computers, a set of seed queries associated with a structured document, the structured document including embedded coding and being hosted on a website, each seed query including one or more terms; identifying, by the one or more computers, one or more embedded coding fragments from the structured document for each seed query, each identified embedded coding fragment for a seed query specifying a structure of a portion of the structured document that includes at least one term of the seed query; generating, by the one or more computers, one or more query templates, each query template corresponding to at least one identified embedded coding fragment, the query template including the structure of the corresponding at least one embedded coding fragment and a generative rule to be used in generating one or more candidate synthetic queries; generating, by the one or more computers, the one or more candidate synthetic queries using the one or more query templates and other structured documents hosted on the website, the generating comprising, for each query template; identifying a portion of a particular structured document hosted on the website that includes the structure specified by the corresponding embedded coding fragment; and generating a candidate synthetic query using text contained in the portion of the particular structured document and specified by the generative rule; measuring, by the one or more computers, a performance in a search operation of each of the one or more candidate synthetic queries; designating, by the one or more computers, as a synthetic query a candidate synthetic query that has a performance measurement that exceeds a performance threshold; and storing, by the one or more computers, the designated synthetic query in a query store. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A system, comprising:
one or more computers configured to perform operations comprising; receiving a set of seed queries associated with a structured document, the structured document including embedded coding and being hosted on a website, each seed query including one or more terms; identifying one or more embedded coding fragments from the structured document for each seed query, each identified embedded coding fragment for a seed query specifying a structure of a portion of the structured document that includes at least one term of the seed query; generating one or more query templates, each query template corresponding to at least one identified embedded coding fragment, the query template including the structure of the corresponding at least one embedded coding fragment and a generative rule to be used in generating one or more candidate synthetic queries; generating the one or more candidate synthetic queries using the one or more query templates and other structured documents hosted on the website, the generating comprising, for each query template; identifying a portion of a particular structured document hosted on the website that includes the structure specified by the corresponding embedded coding fragment; and generating a candidate synthetic query using text contained in the portion of the particular structured document and specified by the generative rule; measuring a performance in a search operation of each of the one or more candidate synthetic queries; designating as a synthetic query a candidate synthetic query that has a performance measurement that exceeds a performance threshold; and storing the designated synthetic query in a query store. - View Dependent Claims (18, 19, 20, 21, 22)
-
23. A computer program product, tangibly stored on a storage device, configured to cause one or more processors to perform operations comprising:
-
receiving a set of seed queries associated with a structured document, the structured document including embedded coding and being hosted on a website, each seed query including one or more terms; identifying one or more embedded coding fragments from the structured document for each seed query, each identified embedded coding fragment for a seed query specifying a structure of a portion of the structured document that includes at least one term of the seed query; generating one or more query templates, each query template corresponding to at least one identified embedded coding fragment, the query template including the structure of the corresponding embedded coding fragment and a generative rule to be used in generating one or more candidate synthetic queries; generating the one or more candidate synthetic queries using the one or more query templates and other structured documents hosted on the website, the generating comprising, for each query template; identifying a portion of a particular structured document hosted on the website that includes the structure specified by the corresponding coding fragment; and generating a candidate synthetic query using text contained in the portion of the particular structured document and specified by the generative rule; measuring a performance in a search operation of each of the one or more candidate synthetic queries; designating as a synthetic query a candidate synthetic query that has a performance measurement that exceeds a performance threshold; and storing the designated synthetic query in a query store. - View Dependent Claims (24, 25, 26, 27, 28)
-
Specification