System, method and computer program product for performing unstructured information management and automatic text analysis, including a search operator functioning as a Weighted AND (WAND)
First Claim
1. A data processing system for processing stored data, comprising:
- data storage for storing a collection of data units; and
coupled to the data storage, a search engine responsive to a query for retrieving at least one data unit from said data storage;
wherethe query comprises a search operator comprised of a plurality of search sub-expressions each having an associated weight value, and where said search engine returns a data unit having a weight value sum that exceeds a threshold weight value sum; and
where said search operator comprises a weighted AND function, where varying the threshold weight value varies the operation of the weighted AND function from being substantially a logical OR function to being substantially a logical AND function.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. A search query includes a search operator containing of a plurality of search sub-expressions each having an associated weight value. The search engine returns a document or documents having a weight value sum that exceeds a threshold weight value sum. The search operator is implemented as a Boolean predicate that functions as a Weighted AND (WAND).
-
Citations
30 Claims
-
1. A data processing system for processing stored data, comprising:
-
data storage for storing a collection of data units; and coupled to the data storage, a search engine responsive to a query for retrieving at least one data unit from said data storage;
wherethe query comprises a search operator comprised of a plurality of search sub-expressions each having an associated weight value, and where said search engine returns a data unit having a weight value sum that exceeds a threshold weight value sum; and where said search operator comprises a weighted AND function, where varying the threshold weight value varies the operation of the weighted AND function from being substantially a logical OR function to being substantially a logical AND function. - View Dependent Claims (2, 3)
-
-
4. A data processing system for processing stored data, comprising:
-
data storage for storing a collection of data units; and coupled to the data storage, a search engine responsive to a query for retrieving at least one data unit from said data storage;
wherethe query comprises a search operator comprised of a plurality of search sub-expressions each having an associated weight value, and where said search engine returns a data unit having a weight value sum that exceeds a threshold weight value sum, where said data units comprise documents; and where said data processing system comprises an inverted file system for storing annotations derived from a tokenization of document data, a list comprising occurrences of respective annotations and, for each listed occurrence of a respective annotation, a set comprised of a plurality of token locations spanned by said respected annotation.
-
-
5. A data processing system for processing stored document data, comprising:
-
data storage for storing a collection of document data; and coupled to the data storage, a search engine responsive to a query for retrieving at least one document from said data storage;
wherethe query comprises a Boolean predicate that functions as a Weighted AND (WAND), the WAND taking as arguments a list of Boolean variables X1, X2, . . . , Xk, a list of associated positive weights, w1, w2, . . . , wk, and a threshold θ
, where;(WAND) (X1, w1, . . . Xk, wk, θ
)is true if; where xi is the indicator variable for Xi, where said search engine comprising an output for outputting a result of a search of the collection of document data using the query. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
where the upper bounds of all query terms appearing in a document are summed to determine an upper bound on the document'"'"'s query-dependent score as; and where preliminary scoring involves evaluating, for each document d;
WAND(X1, UB1, X2, UB2, . . . , Xk, UBk, θ
)where Xi is an indicator variable for the presence of query term i in document d, and the threshold θ
is varied during operation based on a minimum score m among the top n results found by said search engine thus far, where n is a number of requested documents.
-
-
13. A data processing system as in claim 5, where the documents in the data storage are represented as inverted files with respect to a particular ordering of the documents in the data storage.
-
14. A data processing system as in claim 5, further comprising at least one iterator over occurrences of terms in documents.
-
15. A data processing system as in claim 5, further comprising at least one iterator for indicating which documents satisfy specific properties.
-
16. A data processing system as in claim 5, where the WAND employs at least one iterator for documents that satisfy the Boolean predicates X_1, X_2, . . . , respectively, and where a WAND operator creates an iterator for indicating which documents satisfy the WAND predicate.
-
17. A data processing system as in claim 16, where the WAND operator maintains a current document variable that represents a first possible document not yet known to not satisfy the WAND predicate, and where a procedure indicates which iterator of a plurality of iterators is to advance if the WAND predicate is not satisfied at a current document variable.
-
18. A computer program product embodied on a computer-readable medium and comprising program code for directing operation of a text intelligence system in cooperation with at least one application, comprising:
-
a computer program segment for storing a collection of data units; and a computer program segment implementing a search engine that is responsive to a query for retrieving at least stored one data unit;
wherethe query comprises a search operator comprised of a plurality of search sub-expressions each having an associated weight value, and where said search engine returns a data unit having a weight value sum that exceeds a threshold weight value sum; and where said search operator comprises a weighted AND function, where varying the threshold weight value varies the operation of the weighted AND function from being substantially a logical OR function to being substantially a logical AND function. - View Dependent Claims (19, 20)
-
-
21. A computer program product embodied on a computer-readable medium and comprising program code for directing operation of a text intelligence system in cooperation with at least one application, comprising:
-
a computer program segment for storing a collection of data units; and a computer program segment implementing a search engine that is responsive to a query for retrieving at least stored one data unit;
wherethe query comprises a search operator comprised of a plurality of search sub-expressions each having an associated weight value, and where said search engine returns a data unit having a weight value sum that exceeds a threshold weight value sum; where said data units comprise documents; and further comprising a computer program segment for implementing an inverted file system for storing annotations derived from a tokenization of document data, a list comprising occurrences of respective annotations and, for each listed occurrence of a respective annotation, a set comprised of a plurality of token locations spanned by said respected annotation.
-
-
22. A computer program product embodied on a computer-readable medium and comprising program code for directing operation of a text intelligence system in cooperation with at least one application, comprising:
-
a computer program segment for storing a collection of data units; and a computer program segment implementing a search engine that is responsive to a query for retrieving at least stored one data unit;
wherethe query comprises a search operator comprised of a plurality of search sub-expressions each having an associated weight value, and where said search engine returns a data unit having a weight value sum that exceeds a threshold weight value sum; where the query comprises a Boolean predicate that functions as a Weighted AND (WAND), the WAND taking as arguments a list of Boolean variables X1, X2, . . . , Xk, a list of associated positive weights, w1, w2, . . . , wk, and a threshold θ
, where;(WAND) (X1, w1, . . . Xk, wk, θ
)is true if; where xi is the indicator variable for Xi, where - View Dependent Claims (23, 24, 25, 26, 27, 28)
where the upper bounds of all query terms appearing in a document data unit are summed to determine an upper bound on the document'"'"'s query-dependent score as; and where preliminary scoring involves evaluating, for each document d;
WAND(X1, UB1, X2, UB2, . . . , Xk, UBk, θ
)where Xi is an indicator variable for the presence of query term i in document data unit d, and the threshold θ
is varied during operation based on a minimum score m among the top n results found by said search engine thus far, where n is a number of requested documents.
-
-
29. A method for processing document data, comprising:
-
receiving a query; and responding to the query for retrieving at least one document from a data storage;
wherethe query comprises a Boolean predicate that functions as a Weighted AND (WAND), the WAND taking as arguments a list of Boolean variables X1, X2, . . . , Xk, a list of associated positive weights, w1, w2, . . . , wk, and a threshold θ
, where;(WAND) (X1, w1, . . . Xk, wk, θ
)is true if; where xi is the indicator variable for Xi, where further comprising outputting as a result of using the query a retrieved at least one document. - View Dependent Claims (30)
-
Specification