System for automatically generating queries
First Claim
Patent Images
1. A method for automatically generating a query from selected document content, comprising:
- receiving, using a processor, input indicating selection of a subset of a plurality of classes of entities;
each class of entities in the plurality of classes of entities including entities that characterize an enrichment theme;
identifying, using a processor, a set of entities from the selected subset of the plurality of classes of entities that appear in the selected document content for searching additional information related thereto using an information retrieval system;
analyzing, using the processor, a segment of the selected document content that surrounds each entity in the set of entities to extract a set of terms identifying one or more facets in the selected document content concerning each entity;
ranking, using the processor, terms in the set of terms corresponding to each entity in the set of entities by their frequency of occurrence in a reference corpus;
producing, using the processor, an aspect vector by selecting a subset of the set of ranked terms in accordance with a predefined frequency criteria;
formulating, using the processor, the query by augmenting the set of entities with the subset of the set of terms in the aspect vector to contextualize a search at the information retrieval system for information concerning the set of entities.
1 Assignment
0 Petitions
Accused Products
Abstract
A method, system and article of manufacture therefor, are disclosed for automatically generating a query from document content.
-
Citations
20 Claims
-
1. A method for automatically generating a query from selected document content, comprising:
-
receiving, using a processor, input indicating selection of a subset of a plurality of classes of entities;
each class of entities in the plurality of classes of entities including entities that characterize an enrichment theme;identifying, using a processor, a set of entities from the selected subset of the plurality of classes of entities that appear in the selected document content for searching additional information related thereto using an information retrieval system; analyzing, using the processor, a segment of the selected document content that surrounds each entity in the set of entities to extract a set of terms identifying one or more facets in the selected document content concerning each entity; ranking, using the processor, terms in the set of terms corresponding to each entity in the set of entities by their frequency of occurrence in a reference corpus; producing, using the processor, an aspect vector by selecting a subset of the set of ranked terms in accordance with a predefined frequency criteria; formulating, using the processor, the query by augmenting the set of entities with the subset of the set of terms in the aspect vector to contextualize a search at the information retrieval system for information concerning the set of entities.
-
-
2. The method according to claim 1, wherein the number of facets used to contextualize the search at the information retrieval system is limited to a predefined number of facets.
-
3. The method according to claim 1, wherein entities of an enrichment theme are entities of a predefined type.
-
4. The method according to claim 1, wherein the set of terms identifying the one or more facets of document content include proper names.
-
5. The method according to claim 1, wherein the set of terms identifying the one or more facets of document content include phrases.
-
6. The method according to claim 1, wherein the subset of the set of terms identifying the one or more facets of document content include rare phrases, which appear below a defined frequency in the reference corpus.
-
7. The method according to claim 1, wherein the set of terms identifying the one or more facets of document content include dates.
-
8. The method according to claim 1, wherein the set of terms identifying the one or more facets of document content include numbers.
-
9. The method according to claim 1, wherein the set of terms identifying the one or more facets of document content include geographic locations.
-
10. The method according to claim 1, wherein the subset of the set of terms identifying the one or more facets of document content include rare words, which appear below a defined frequency in the reference corpus.
-
11. A system for automatically generating a query from selected document content, comprising:
-
an entity extractor, comprising at least one processor, for (i) receiving input indicating selection of a subset of a plurality of classes of entities;
each class of entities in the plurality of classes of entities including entities that characterize an enrichment theme, and (ii) identifying a set of entities from the selected subset of the plurality of classes of entities that appear in the selected document content for searching additional information related thereto using an information retrieval system;an aspect vector generator, comprising at least one processor, for (i) analyzing a segment of the selected document content that surrounds each entity in the set of entities to extract a set of terms identifying one or more facets in the selected document content concerning each entity, (ii) ranking terms in the set of terms corresponding to each entity in the set of entities by their frequency of occurrence in a reference corpus, and (iii) producing an aspect vector by selecting a subset of the set of ranked terms in accordance with a predefined frequency criteria; a query generator, comprising at least one processor, for formulating the query by augmenting the set of entities with the subset of the set of terms in the aspect vector to contextualize a search at the information retrieval system for information concerning the set of entities.
-
-
12. The system according to claim 11, wherein the number of facets used to contextualize the search at the information retrieval system is limited to a predefined number of facets.
-
13. The system according to claim 11, wherein entities of an enrichment theme are entities of a predefined type.
-
14. The system according to claim 11, wherein the set of terms identifying the one or more facets of document content include proper names.
-
15. The system according to claim 11, wherein the set of terms identifying the one or more facets of document content include phrases.
-
16. The system according to claim 11, wherein the subset of the set of terms identifying the one or more facets of document content include rare phrases, which appear below a defined frequency in the reference corpus.
-
17. The system according to claim 11, wherein the set of terms identifying the one or more facets of document content include dates.
-
18. The system according to claim 11, wherein the set of terms identifying the one or more facets of document content include numbers.
-
19. The system according to claim 11, wherein the set of terms identifying the one or more facets of document content include geographic locations.
-
20. The system according to claim 11, wherein the subset of the set of terms identifying the one or more facets of document content include rare words, which appear below a defined frequency in the reference corpus.
Specification