Analyzing the ability to find textual content
First Claim
1. A method for document analysis, comprising the steps of:
- designating a subset of relevant documents from a document collection;
using a greedy algorithm to establish a query coverage set of words or terms, wherein at each stage thereof a single word or term from the subset of relevant documents is included in the query coverage set, wherein the single word or term minimizes a distance measurement between the document collection and the query coverage set, wherein the distance measurement is determined by constructing a difficulty model for a topic by computing a plurality of distances comprising a first distance between the query coverage set and the document collection (d(Q,C)), a second distance among the query coverage set (d(Q,Q));
a third distance between the subset of relevant documents and the document collection (d(R,C)), a fourth distance among the subset of relevant documents (d(R,R)), and a fifth distance between the query coverage set and the subset of relevant documents (d(Q,R));
storing the query coverage set in a database;
constructing a set of queries from the query coverage set, each of the queries having a number of terms;
executing the queries in a search engine to generate respective results;
responsively to the respective results determining an average precision for each of the queries by considering the subset of relevant documents as representing the document collection;
categorizing the queries by analyzing the average precision against the number of terms thereof; and
reporting respective abilities of the categorized queries to find information in the subset of relevant documents.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for analyzing a document set (202, 420) are provided. The method includes determining a set of terms (312) from the terms of the document set that minimizes a distance measurement (405) from the given set of documents (420). The method includes using a greedy algorithm to build the set of terms incrementally, at each stage finding a single word that is closest to the document set (202, 420). The set of terms is evaluated to assess the ability to find the document set (202, 420). The set of terms are compared with expected terms to evaluate the ability to find the document set (202, 420). A measure of the ability to find a document set (202, 420) is provided by computing a distance measure (403) between a document set and an entire collection.
25 Citations
15 Claims
-
1. A method for document analysis, comprising the steps of:
-
designating a subset of relevant documents from a document collection; using a greedy algorithm to establish a query coverage set of words or terms, wherein at each stage thereof a single word or term from the subset of relevant documents is included in the query coverage set, wherein the single word or term minimizes a distance measurement between the document collection and the query coverage set, wherein the distance measurement is determined by constructing a difficulty model for a topic by computing a plurality of distances comprising a first distance between the query coverage set and the document collection (d(Q,C)), a second distance among the query coverage set (d(Q,Q));
a third distance between the subset of relevant documents and the document collection (d(R,C)), a fourth distance among the subset of relevant documents (d(R,R)), and a fifth distance between the query coverage set and the subset of relevant documents (d(Q,R));storing the query coverage set in a database; constructing a set of queries from the query coverage set, each of the queries having a number of terms; executing the queries in a search engine to generate respective results; responsively to the respective results determining an average precision for each of the queries by considering the subset of relevant documents as representing the document collection; categorizing the queries by analyzing the average precision against the number of terms thereof; and reporting respective abilities of the categorized queries to find information in the subset of relevant documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
Specification