Method and system for extending keyword searching to syntactically and semantically annotated data
First Claim
1. A method in a computer system for preparing a corpus of documents for performing electronic searches, each document having at least one sentence, each sentence having a plurality of terms, comprising:
- for each sentence of each document, parsing the sentence to generate a parse structure having a plurality of syntactic elements that correspond to the terms of the sentence;
normalizing a plurality of the syntactic elements of the generated parse structure to a plurality of tagged terms, each tagged term indicating an association between the term that corresponds to the syntactic element and an associated tag type;
transforming each sentence to an enhanced data structure of terms, wherein the plurality of the tagged terms are treated as additional terms of the sentence, thereby enabling a search engine to determine from the enhanced data structure whether a designated term having an associated tag type is present in the sentence in a similar manner to the manner the search engine uses to determine whether a designated term is present in the sentence.
5 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for extending keyword searching techniques to syntactically and semantically annotated data are provided. Example embodiments provide a Syntactic Query Engine (“SQE”) that parses, indexes, and stores a data set as an enhanced document index with document terms as well as information pertaining to the grammatical roles of the terms and ontological and other semantic information. In one embodiment, the enhanced document index is a form of term-clause index, that indexes terms and syntactic and semantic annotations at the clause level. The enhanced document index permits the use of a traditional keyword search engine to process relationship queries as well as to process standard document level keyword searches. In one embodiment, the SQE comprises a Query Processor, a Data Set Preprocessor, a Keyword Search Engine, a Data Set Indexer, an Enhanced Natural Language Parser (“ENLP”), a data set repository, and, in some embodiments, a user interface or an application programming interface.
-
Citations
249 Claims
-
1. A method in a computer system for preparing a corpus of documents for performing electronic searches, each document having at least one sentence, each sentence having a plurality of terms, comprising:
for each sentence of each document, parsing the sentence to generate a parse structure having a plurality of syntactic elements that correspond to the terms of the sentence;
normalizing a plurality of the syntactic elements of the generated parse structure to a plurality of tagged terms, each tagged term indicating an association between the term that corresponds to the syntactic element and an associated tag type;
transforming each sentence to an enhanced data structure of terms, wherein the plurality of the tagged terms are treated as additional terms of the sentence, thereby enabling a search engine to determine from the enhanced data structure whether a designated term having an associated tag type is present in the sentence in a similar manner to the manner the search engine uses to determine whether a designated term is present in the sentence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47)
-
48. A computer-readable memory medium containing instructions that control a computer processor to index a corpus of documents for electronic searching, each document having at least one sentence, each sentence having a plurality of terms, by:
for each sentence of each document, parsing the sentence to generate a parse structure having a plurality of syntactic elements that correspond to the terms of the sentence;
normalizing a plurality of the syntactic elements of the generated parse structure to a plurality of tagged terms, each tagged term indicating an association between the term that corresponds to the syntactic element and an associated tag type;
transforming each sentence to an enhanced data structure of terms, wherein the plurality of the tagged terms are treated as additional terms of the sentence, thereby enabling a search engine, to determine from the enhanced data structure whether a designated term having an associated tag type is present in the sentence in a similar manner to the manner the search engine uses to determine whether a designated term is present in the sentence. - View Dependent Claims (49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81)
-
82. A computer system that indexes a corpus of documents for electronic searching, each document having at least one sentence, each sentence having a plurality of terms, comprising:
-
a parser that parsers each sentence of each document to generate a dependency structure that specifies a plurality of syntactic elements that correspond to the terms of the sentence and their relationship to each other;
a post processing module that is structured to normalize the dependency structure to a plurality of tagged terms, each tagged term indicating an association between the term that corresponds to the syntactic element and an associated tag type;
a sentence transformation module that is structured to transform the plurality of tagged terms to an enhanced data structure that stores and treats each tagged term as an encoded additional term of the sentence, thereby enabling a search engine, to determine from the enhanced data structure whether a designated term having an associated tag type is present in the sentence in a similar manner to the manner the search engine uses to determine whether a designated term is present in the sentence. - View Dependent Claims (83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107)
-
-
108. A method in a computer system for performing a search of a corpus of documents, each document having at least one sentence, comprising:
-
receiving a search query that designates a desired grammatical relationship between a first entity and at least one of a second entity or an action;
transforming the search query into a Boolean expression;
determining a set of objects that match the Boolean expression using a keyword-style search of a data structure that indexes terms of the documents including grammatical relationship information; and
returning an indication of each matching object in the corpus that encompasses the desired relationship. - View Dependent Claims (109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152)
-
-
153. A computer-readable memory medium containing instructions to control a computer processor to search a corpus of documents, each document having at least one sentence, by:
-
receiving a search query that designates a desired grammatical relationship between a first entity and at least one of a second entity or an action;
transforming the search query into a Boolean expression;
determining a set of objects that match the Boolean expression using a keyword-style search of a data structure that indexes terms of the documents including grammatical relationship information; and
returning an indication of each matching object in the corpus that encompasses the desired relationship. - View Dependent Claims (154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186)
-
-
187. A search engine that searches a corpus of documents, each document having at least one sentence, comprising:
-
a data structure that indexes and stores terms of the documents along with annotations that include relationship information, each annotation associated with at least one term;
a keyword search engine that pattern matches an input string against the data structure and returns an indication of each matching object of the corpus; and
a query processor that is structured to receive a relationship search query that is indicative of at least one syntactically or semantically annotated term;
transform the relationship search query into at least one Boolean expression;
invokes the keyword search engine to determine a set of objects that match the at least one Boolean expression by pattern matching the at least one annotated term indicated by the search query to the data structure, such that each matching object encompasses the relationship specified by the relationship search. - View Dependent Claims (188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213)
-
-
214. A computer-readable memory medium containing structured data that stores a syntactic query, the query executed by a computer processor under the control of a search engine to search a corpus of objects for objects that match the query, comprising:
-
a base component that specifies values for desired relationship parameters;
a prepositional constraint component that specifies a desired value for a prepositional phrase;
a keyword constraint component that specifies desired keyword values; and
a metadata constraint component that specifies desired values of metadata associated with each matching object, whereby, when the search engine causes the search to be executed, objects that match the constraints specified by the base component, the prepositional constraint component, the keyword constraint component, and the metadata constraint component are determined to satisfy the query. - View Dependent Claims (215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 246, 247, 248, 249)
-
-
237. A computer-readable memory medium that contains a reverse index for storing a corpus of documents according to terms present in the documents, the index accessed by a computer processor that is controlled by search engine to match a query against the corpus of documents, the index comprising:
-
a plurality of terms, each term indicating at least one sentence in which the term occurs; and
a plurality of tagged terms, each tagged term specifying a syntactic role that is associated with the term in the at least one sentence and each tagged term indicating the at least one sentence in which the associated term occurs;
such that the search engine can determine, by pattern matching query terms against the reverse index, a set of sentences that match a relationship indicated by the query. - View Dependent Claims (238, 239, 240, 241, 242, 243, 244, 245)
-
Specification