Method and system for enhanced data searching
First Claim
1. A method in a computer system for transforming a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, comprising:
- receiving a designation of a plurality of entity tags; and
for each sentence, parsing the sentence to generate a parse structure having a plurality of syntactic elements;
determining from the parse structure a set of syntactic elements that correspond to the designated entity tags; and
storing in an enhanced data representation data structure a representation of each association between a syntactic element of the determined set of syntactic elements and the type of the entity tag that corresponds to the syntactic element, the syntactic element representing the value of the corresponding entity tag, such that the sentence is represented in the data structure by at least one entity tag.
5 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for syntactically indexing and searching data sets to achieve more accurate search results and for indexing and searching data sets using entity tags alone or in combination therewith are provided. Example embodiments provide a Syntactic Query Engine (“SQE”) that parses, indexes, and stores a data set, as well as processes natural language queries subsequently submitted against the data set. The SQE comprises a Query Preprocessor, a Data Set Preprocessor, a Query Builder, a Data Set Indexer, an Enhanced Natural Language Parser (“ENLP”), a data set repository, and, in some embodiments, a user interface. After preprocessing the data set, the SQE parses the data set according to a variety of levels of parsing and determines as appropriate the entity tags and syntactic and grammatical roles of each term to generate enhanced data representations for each object in the data set. The SQE indexes and stores these enhanced data representations in the data set repository. Upon subsequently receiving a query, the SQE parses the query also using a variety of parsing levels and searches the indexed stored data set to locate data that contains similar terms used in similar grammatical roles and/or with similar entity tag types as indicated by the query. In this manner, the SQE is able to achieve more contextually accurate search results more frequently than using traditional search engines.
-
Citations
121 Claims
-
1. A method in a computer system for transforming a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, comprising:
-
receiving a designation of a plurality of entity tags; and
for each sentence, parsing the sentence to generate a parse structure having a plurality of syntactic elements;
determining from the parse structure a set of syntactic elements that correspond to the designated entity tags; and
storing in an enhanced data representation data structure a representation of each association between a syntactic element of the determined set of syntactic elements and the type of the entity tag that corresponds to the syntactic element, the syntactic element representing the value of the corresponding entity tag, such that the sentence is represented in the data structure by at least one entity tag. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
25. A computer-readable memory medium containing instructions for controlling a computer processor to transform a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, by:
-
receiving a designation of a plurality of entity tags; and
for each sentence, parsing the sentence to generate a parse structure having a plurality of syntactic elements;
determining from the parse structure a set of syntactic elements that correspond to the designated entity tags; and
storing in an enhanced data representation data structure a representation of each association between a syntactic element of the determined set of syntactic elements and the type of the entity tag that corresponds to the syntactic element, the syntactic element representing the value of the corresponding entity tag, such that the sentence is represented in the data structure by at least one entity tag. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
-
-
49. A syntactic query engine for transforming a document into a canonical representation using entity tags, each entity tag having a type and as associated value, the document having at least one sentence, comprising:
parser that is structured to receive a designation of a plurality of entity tags; and
decompose each sentence to generate a parse structure for the sentence having a plurality of syntactic elements;
determine from the structure of the parse structure a set of syntactic elements that correspond to the designated entity tags; and
store, in an enhanced data representation data structure, a representation of each association between a syntactic element of the determined set of syntactic elements and the corresponding entity tag type, such that the sentence is represented in the data structure by at least one entity tag. - View Dependent Claims (50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
-
73. A method in a computer system for transforming a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, each sentence having a plurality of terms, comprising:
-
receiving a designation of a plurality of entity tags and a designation of at least one grammatical role; and
for each sentence, parsing the sentence to generate a parse structure having a plurality of syntactic elements;
determining a set of meaningful terms of the sentence from these syntactic elements;
determining from the structure of the parse structure and the syntactic elements a grammatical role for each meaningful term;
determining which meaningful terms correspond to the designated entity tags and which meaningful terms correspond to the designated grammatical role; and
storing in an enhanced data representation data structure a representation of an association between the meaningful term that corresponds to the designated grammatical role and an association between a meaningful term and the type of a corresponding designated entity tag, the meaningful term associated with the entity tag type representing the value of the entity tag, such that the sentence is represented by at least one entity tag and one meaningful term having a grammatical role. - View Dependent Claims (74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85)
-
-
86. A computer-readable memory medium containing instructions for controlling a computer processor to transform a document into a canonical representation using entity tags, each entity tag having a type and an associated value, the document having at least one sentence, each sentence having a plurality of terms, by:
-
receiving a designation of a plurality of entity tags and a designation of at least one grammatical role; and
for each sentence, parsing the sentence to generate a parse structure having a plurality of syntactic elements;
determining a set of meaningful terms of the sentence from these syntactic elements;
determining from the structure of the parse structure and the syntactic elements a grammatical role for each meaningful term;
determining which meaningful terms correspond to the designated entity tags and which meaningful terms correspond to the designated grammatical role; and
storing in an enhanced data representation data structure a representation of an association between the meaningful term that corresponds to the designated grammatical role and an association between a meaningful term and the type of a corresponding designated entity tag, the meaningful term associated with the entity tag type representing the value of the entity tag, such that the sentence is represented by at least one entity tag and one meaningful term having a grammatical role. - View Dependent Claims (88)
-
-
87. A syntactic query engine for transforming a document into a canonical representation using entity tags, each entity tag having a type and as associated value, the document having at least one sentence, each sentence having a plurality of terms, comprising:
parser that is structured to receive a designation of a plurality of entity tags and a designation of at least one grammatical role;
decompose each sentence to generate a parse structure for the sentence having a plurality of syntactic elements;
determine a set of meaningful terms of the sentence from the syntactic elements;
determine from the structure of the parse structure and the syntactic elements a grammatical role for each meaningful term;
determine which meaningful terms correspond to the designated entity tags and which meaningful terms correspond to the designated grammatical role; and
store, in an enhanced data representation data structure a representation of an association between the meaningful term that corresponds to the designated grammatical role and an association between a meaningful term and the type of a corresponding designated entity tag, the meaningful term associated with the entity tag type representing the value of the entity tag, such that the sentence is represented by at least one entity tag and one meaningful term having a grammatical role.
-
89. A data processing system comprising a computer processor and a memory, the memory containing structured data that stores a normalized representation of sentence data, the structured data being manipulated by the computer processor under the control of program code and stored in the memory as:
an entity table having a set of entity tag pairs, each pair having a term that is a value of a corresponding entity tag and an indication of an entity tag type of the corresponding entity tag.
-
90. A computer-readable memory medium containing instructions for controlling a computer processor to store a normalized data structure representing a document of a data set, the document having a plurality of sentences, comprising:
for each sentence, determining a set of terms of the sentence that correspond to a designated set of entity tags; and
storing sets of relationships between each determined term and its corresponding entity tag type in the normalized data structure so as to represent the entire sentence as entity tags.
-
91. A computer system for storing a normalized data structure representing a document of a data set, the document having a plurality of sentences, each sentence having a plurality of terms, comprising:
-
enhanced parsing mechanism that determines a set of terms of the sentence that correspond to a designated set of entity tags; and
storage mechanism structured to store sets of relationships between each determined term and its corresponding entity tag type in the normalized data structure so as to represent the entire sentence as entity tags. - View Dependent Claims (92)
-
-
93. A method in a computer system for searching a corpus of documents, each document having a plurality of sentences, the corpus having an index of the plurality of sentences for the documents, comprising:
-
receiving an indication of a plurality of consecutive sentences;
parsing the indicated plurality of consecutive sentences to generate a plurality of search terms for searching the document corpus;
determining a plurality of result sentences in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the sentences in the corpus of documents; and
returning indications of the determined result sentences. - View Dependent Claims (94, 95, 96, 97)
-
-
98. A computer-readable memory medium containing instructions for controlling a computer processor to search a corpus of documents, each document having a plurality of sentences, the corpus having an index of the plurality of sentences for the documents, by:
-
receiving an indication of a plurality of consecutive sentences;
parsing the indicated plurality of consecutive sentences to generate a plurality of search terms for searching the document corpus;
determining a plurality of result sentences in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the sentences in the corpus of documents; and
returning indications of the determined result sentences. - View Dependent Claims (99, 100, 101, 102)
-
-
103. A query engine for searching a corpus of documents, each having a plurality of sentences, the corpus having an index of the plurality of sentences for the documents, comprising:
-
parser that is structured to receive an indication of a plurality of consecutive sentences; and
decompose the indicated plurality of consecutive sentences to generate a plurality of search terms for searching the document corpus; and
postprocessor that is structured to determine a plurality of result sentences in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the sentences in the corpus of documents; and
return indications of the determined result sentences. - View Dependent Claims (104, 105, 106, 107)
-
-
108. A method in a networked computer environment for searching a corpus of documents, comprising:
-
receiving an indication of a plurality of consecutive sentences;
forwarding to a search engine the indicated plurality of consecutive sentences; and
receiving from the search engine indications of a plurality of result sentences from the document corpus that correlate to the indicated plurality of consecutive sentences based upon a latent semantic regression analysis used by the search engine to determine the similarity of terms in the consecutive sentences to terms in the sentences of documents in the corpus. - View Dependent Claims (109, 110, 111, 112)
-
-
113. A method in a computer system for searching a corpus of objects each object having a plurality of units, the corpus having an index of the plurality of units for the objects, comprising:
-
receiving an indication of a plurality of consecutive units;
decomposing the indicated plurality of consecutive units to generate a plurality of search terms for searching the object corpus;
determining a plurality of result units in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the units in the corpus of objects; and
returning indications of the determined result units. - View Dependent Claims (114, 115, 117)
-
-
116. A computer-readable memory medium containing instructions for controlling a computer processor to search a corpus of objects each object having a plurality of units, the corpus having an index of the plurality of units for the objects, by:
-
receiving an indication of a plurality of consecutive units;
decomposing the indicated plurality of consecutive units to generate a plurality of search terms for searching the object corpus;
determining a plurality of result units in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the units in the corpus of objects; and
returning indications of the determined result units. - View Dependent Claims (118)
-
-
119. A search engine for searching a corpus of objects each having a plurality of units, the corpus having an index of the plurality of units for the objects, comprising:
-
parser that is structured to receive an indication of a plurality of consecutive units; and
decompose the indicated plurality of consecutive units to generate a plurality of search terms for searching the object corpus; and
postprocessor that is structured to determine a plurality of result units in the corpus that correlate to the search terms using latent semantic regression techniques to determine the similarity of the search terms to the units in the corpus of objects; and
return indications of the determined result units. - View Dependent Claims (120, 121)
-
Specification