Analyzing the ability to find textual content

US 7,792,830 B2
Filed: 08/01/2006
Issued: 09/07/2010
Est. Priority Date: 08/01/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A method for document analysis, comprising the steps of:

designating a subset of relevant documents from a document collection;

using a greedy algorithm to establish a query coverage set of words or terms, wherein at each stage thereof a single word or term from the subset of relevant documents is included in the query coverage set, wherein the single word or term minimizes a distance measurement between the document collection and the query coverage set, wherein the distance measurement is determined by constructing a difficulty model for a topic by computing a plurality of distances comprising a first distance between the query coverage set and the document collection (d(Q,C)), a second distance among the query coverage set (d(Q,Q));

a third distance between the subset of relevant documents and the document collection (d(R,C)), a fourth distance among the subset of relevant documents (d(R,R)), and a fifth distance between the query coverage set and the subset of relevant documents (d(Q,R));

storing the query coverage set in a database;

constructing a set of queries from the query coverage set, each of the queries having a number of terms;

executing the queries in a search engine to generate respective results;

responsively to the respective results determining an average precision for each of the queries by considering the subset of relevant documents as representing the document collection;

categorizing the queries by analyzing the average precision against the number of terms thereof; and

reporting respective abilities of the categorized queries to find information in the subset of relevant documents.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for analyzing a document set (202, 420) are provided. The method includes determining a set of terms (312) from the terms of the document set that minimizes a distance measurement (405) from the given set of documents (420). The method includes using a greedy algorithm to build the set of terms incrementally, at each stage finding a single word that is closest to the document set (202, 420). The set of terms is evaluated to assess the ability to find the document set (202, 420). The set of terms are compared with expected terms to evaluate the ability to find the document set (202, 420). A measure of the ability to find a document set (202, 420) is provided by computing a distance measure (403) between a document set and an entire collection.

25 Citations

View as Search Results

15 Claims

1. A method for document analysis, comprising the steps of:
- designating a subset of relevant documents from a document collection;
  
  using a greedy algorithm to establish a query coverage set of words or terms, wherein at each stage thereof a single word or term from the subset of relevant documents is included in the query coverage set, wherein the single word or term minimizes a distance measurement between the document collection and the query coverage set, wherein the distance measurement is determined by constructing a difficulty model for a topic by computing a plurality of distances comprising a first distance between the query coverage set and the document collection (d(Q,C)), a second distance among the query coverage set (d(Q,Q));
  
  a third distance between the subset of relevant documents and the document collection (d(R,C)), a fourth distance among the subset of relevant documents (d(R,R)), and a fifth distance between the query coverage set and the subset of relevant documents (d(Q,R));
  
  storing the query coverage set in a database;
  
  constructing a set of queries from the query coverage set, each of the queries having a number of terms;
  
  executing the queries in a search engine to generate respective results;
  
  responsively to the respective results determining an average precision for each of the queries by considering the subset of relevant documents as representing the document collection;
  
  categorizing the queries by analyzing the average precision against the number of terms thereof; and
  
  reporting respective abilities of the categorized queries to find information in the subset of relevant documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. A method as claimed in claim 1, wherein the document collection and the query coverage set are modelled by probability distributions of the words or terms of the query coverage set and a distance measurement generated between the probability distributions.
  - 3. A method as claimed in claim 1, wherein the terms comprise parts of words or word combinations.
  - 4. A method as claimed in claim 1, including comparing the terms with expected terms to evaluate an ability to find the subset of relevant documents.
  - 5. A method as claimed in claim 1, wherein analyzing the average precision against the number of terms includes clustering results of the queries into categories of behaviour.
  - 6. A method as claimed in claim 5, wherein the categories of behaviour include:
    - easily findable document sets, document sets requiring long queries to be located, and document sets which are not findable.
  - 7. A method as claimed in claim 1, including simulating changes to text of the document collection to improve an ability to find the subset of relevant documents.
  - 8. A method as claimed in claim 1, including:
    - comparing a set of terms for a first document set with sets of terms for one or more other document sets; and
      
      determining similar document sets which may be confusable.
  - 9. A method as claimed in claim 8, wherein comparing is carried out by measuring an overlap of the sets of terms.
  - 10. The method according to claim 1, further comprising the steps of:
    - partitioning the document collection into a plurality of domains;
      
      for each of the domains performing the steps of designating a subset of relevant documents, using a greedy algorithm, storing the query coverage set, constructing a set of queries, and executing the queries;
      
      determining an average overlap among the terms in the queries at a plurality of cut-off points; and
      
      determining from the average overlap the most similar domains.
  - 11. The method according to claim 1, wherein the distance measurement comprises a Jensen-Shannon divergence.
  - 12. The method according to claim 1, wherein reporting respective abilities of the categorized queries to find information comprises estimating clarity of the topic according to the first distance.
  - 13. The method according to claim 12, wherein reporting respective abilities of the categorized queries to find information further comprises estimating clarity of the topic according to the second distance.
  - 14. The method according to claim 1, wherein reporting respective abilities of the categorized queries to find information comprises further comprises the steps of:
    - ranking the words or terms in the query coverage set according to the distance measurement; and
      
      correlating a presence of the ranked words or terms with words or terms in user queries.
  - 15. The method according to claim 14, further comprising the steps of:
    - making a determination that the ranked words or terms have a ranking in the query coverage set that is lower than a predetermined value; and
      
      responsively to the determination expanding documents of the subset of relevant documents so as to improve the correlated presence of the ranked words or terms in the query coverage set.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
LinkedIn Corporation (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Pelleg, Dan, Fine, Shai, Yom-Tov, Elad, Darlow, Adam, Carmel, David
Primary Examiner(s)
Jalil; Neveen Abel
Assistant Examiner(s)
MINCEY, JERMAINE A

Application Number

US11/461,464
Publication Number

US 20080033971A1
Time in Patent Office

1,498 Days
Field of Search

707/3
US Class Current

707/728
CPC Class Codes

G06F 16/334 Query execution G06F16/335 ...

Analyzing the ability to find textual content

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

25 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Analyzing the ability to find textual content

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links