System and method for measuring the quality of document sets

US 8,024,327 B2
Filed: 06/25/2008
Issued: 09/20/2011
Est. Priority Date: 06/26/2007
Status: Active Grant

First Claim

Patent Images

1. In an information retrieval system, a computer-implemented method for information processing, comprising:

accessing, by a computer system, a set of documents obtained from the information retrieval system;

establishing, automatically by the computer system, at least one identifying characteristic within the set of documents;

analyzing, by the computer system, the set of documents to obtain a statistical distribution based on values associated with the set of documents, the set of documents having a given size;

computing a value of a function that measures distinctiveness of the obtained statistical distribution relative to a baseline statistical distribution of values associated with a baseline set of documents;

normalizing the value relative to a distribution of values of the function that measures distinctiveness over a space of document sets, wherein a respective value of the function that measures distinctiveness corresponds to a respective document set within the space of document sets, wherein each document set in the space has a size that is comparable to the given size, and the act of normalizing the value includes an act of performing a computation on the value that accounts for the given size of the set of documents; and

outputting a response derived from the normalized value.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are described that calculate the interestingness of a set of one or more records in a database, either absolutely (i.e., compared to an overall collection of records) or relative to some other set of records. In one embodiment, the measure is a relative entropy value that has been normalized. Various applications of the measure are described in the context of an information retrieval system. These applications include, for example, guiding query interpretation, guiding view selection and summarization, intelligent ranges, event detection, concept triggers and interpreting user actions, hierarchy discovery, and adaptive data mining.

Citations

52 Claims

1. In an information retrieval system, a computer-implemented method for information processing, comprising:
- accessing, by a computer system, a set of documents obtained from the information retrieval system;
  
  establishing, automatically by the computer system, at least one identifying characteristic within the set of documents;
  
  analyzing, by the computer system, the set of documents to obtain a statistical distribution based on values associated with the set of documents, the set of documents having a given size;
  
  computing a value of a function that measures distinctiveness of the obtained statistical distribution relative to a baseline statistical distribution of values associated with a baseline set of documents;
  
  normalizing the value relative to a distribution of values of the function that measures distinctiveness over a space of document sets, wherein a respective value of the function that measures distinctiveness corresponds to a respective document set within the space of document sets, wherein each document set in the space has a size that is comparable to the given size, and the act of normalizing the value includes an act of performing a computation on the value that accounts for the given size of the set of documents; and
  
  outputting a response derived from the normalized value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 2. The method according to claim 1, wherein the set of documents comprises at least one document, wherein the at least one document further comprises a unit of storage of digital data.
  - 3. The method according to claim 2, wherein the at least one document further includes at least one of a data record within a database, textual information, non-textual information, audio files, video files, streaming data, a defined entity, and metadata.
  - 4. The method according to claim 1, wherein the act of normalizing further comprises an act of calculating a mean for an expected statistical distribution of the at least one identifying characteristic.
  - 5. The method according to claim 1, wherein the act of normalizing further comprises an act of calculating a standard deviation of an expected statistical distribution of the at least one identifying characteristic.
  - 6. The method according to claim 1, further comprising the acts of:
    - determining an expected statistical distribution of the at least one identifying characteristic;
      
      generating at least one comparison set; and
      
      determining a statistical distribution of at least one identifying characteristic for the comparison set.
  - 7. The method according to claim 6, wherein the act of generating at least one comparison set includes an act of generating a randomly selected set from a larger group of set members.
  - 8. The method according to claim 7, wherein the size of the at least one comparison set is determined based on the size of the measured set.
  - 9. The method according to claim 1, further comprising an act of calculating a percentile ranking, wherein the acts of normalization occurs using a percentile ranking.
  - 10. The method according to claim 1, wherein the at least one identifying characteristic comprises at least one of at least a portion of:
    - textual information within a document;
      
      metadata associated with a document;
      
      contextual information associated with a document;
      
      non-textual information associated with a document;
      
      record information with a database;
      
      information associated with a composite entity; and
      
      information derivable from a document.
  - 11. The method according to claim 1, further comprising an act of calculating a statistical distribution for each one of at least one of the identifying characteristic.
  - 12. The method according to claim 1, wherein the statistical distribution is determined for multiple dimensions.
  - 13. The method according to claim 1, further comprising an act of determining at least one value associated with at least one set member.
  - 14. The method according to claim 13, wherein the statistical distribution of at least one identifying characteristic is based on a plurality of the at least one values associated with at least one set member, and wherein the plurality of the at least one values comprise a relation.
  - 15. The method according to claim 1, wherein the at least one identifying characteristic comprises at least one facet in a faceted information space.
  - 16. The method according to claim 1, further comprising an act of generating a representation of the set, wherein the representation of the set is adapted to statistical manipulation.
  - 17. The method according to claim 1, wherein the act of analyzing the set to obtain a statistical distribution further comprises an act of approximating the distribution.
  - 18. The method according to claim 17, wherein the act of approximating the distribution includes an act of employing sampling to calculate the statistical distribution for a set of documents.
  - 19. The method according to claim 17, wherein the act of approximating the distribution includes at least one of the acts of permitting modification of the set without recalculating the distribution, examining similar sets for similar distributions, and using previously analyzed sets to generate a statistical distribution, determining a maximal resolution, and determining a minimum threshold about zero.
  - 20. The method according to claim 1, further comprising an act of assigning a weight value associated with at least one set member.
  - 21. The method according to claim 20, wherein the act of computing the value of the function that measures distinctiveness includes an act of accounting for the weight value associated with at least one set member.
  - 22. The method according to claim 20, wherein the weight value comprises a relevance score and the method further comprises an act of determining if the relevance score exceeds a threshold.
  - 23. The method according to claim 20, wherein the weight value comprises a relevance score and the method further comprises acts of:
    - modeling a distribution of relevance scores for relevant documents and a distribution of scores for less relevant documents; and
      
      computing a separation between the modeled distributions.
  - 24. The method according to claim 1, further comprising an act of smoothing the statistical distribution within the set.
  - 25. The method according to claim 1, further comprising an act of calculating the measurement of distinctiveness with at least one function of relative entropy, Kullback-Leibler divergence, Euclidean distance, Manhattan distance, Hellinger distance, diversity difference, cosine difference, Jaccard distance, Jenson-Shannon divergence, and skew divergence.
  - 26. The method according to claim 1, wherein the act of computing the value of the function that measures distinctiveness further comprising acts of:
    - determining a similarity measure; and
      
      inverting the sense of the similarity measure.
  - 27. The method according to claim 26, wherein the similarity measure is calculated using at least one of Pearson correlation coefficient, Dice coefficient, overlap coefficient, and Lin similarity.
  - 28. The method as described in claim 1 wherein the set of documents is obtained as result of a query to the information retrieval system.
  - 29. The method as described in claim 1, wherein the information retrieval system implements a Boolean retrieval model.

30. A system for information processing, the system comprising:
- at least one processor operatively connected to a memory adapted to execute system components, and wherein the system further comprises;
  
  an access component adapted to access a set of documents obtained from an information retrieval system, wherein the access component is further configured to establish, automatically, at least one identifying characteristic within the set of documents;
  
  an analysis component adapted to obtain a statistical distribution based on values associated with the set of documents, the set of documents having a given size;
  
  a measurement component adapted to compute value of a function that measures distinctiveness of the obtained statistical distribution relative to a baseline statistical distribution of values associated with a baseline set of documents;
  
  a normalization component adapted to normalize the value relative to a distribution of values of the function that measures distinctiveness over a space of document sets, wherein a respective value of the function that measures distinctiveness corresponds to a respective document set within the space of document sets, wherein each document set in the space has a size that is comparable to the given size, wherein the normalization component is further adapted to perform a computation on the value that accounts for the given size of the set of documents; and
  
  an output component adapted to generate a response derived from the normalized value.
- View Dependent Claims (31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
- - 31. The system according to claim 30, wherein the normalization component is further adapted to calculate a mean for an expected statistical distribution of the at least one identifying characteristic.
  - 32. The system according to claim 30, wherein the normalization component is further adapted to calculate a standard deviation for an expected statistical distribution of the at least one identifying characteristic.
  - 33. The system according to claim 30, wherein the analysis component is further adapted to determine an expected statistical distribution of the at least one identifying characteristic for the set of documents.
  - 34. The system according to claim 33, further comprising a generation component adapted to generate at least one comparison set;
    - andwherein the analysis component is further adapted to determine a statistical distribution of at least one identifying characteristic for the comparison set.
  - 35. The system according to claim 34, wherein the measurement component is further adapted to generate a measure of distinctiveness for the at least one comparison set.
  - 36. The system according to claim 34, wherein the size of the at least one comparison set is determined based the size of the measured set.
  - 37. The system according to claim 30, wherein the at least one identifying characteristic comprises at least one of at least a portion of:
    - textual information within a document;
      
      metadata associated with a document;
      
      contextual information associated with a document;
      
      non-textual information associated with a document;
      
      record information with a database;
      
      information associated with a composite entity; and
      
      information derivable from a document.
  - 38. The system according to claim 30, wherein the analysis component is further adapted to calculate a statistical distribution for each one of at least one of the identifying characteristics.
  - 39. The system according to claim 30, wherein the statistical distribution is determined for multiple dimensions.
  - 40. The system according to claim 30, further comprising a correlation component adapted to generate at least one value associated with at least one set member.
  - 41. The system according to claim 30, wherein the at least one identifying characteristic comprises at least one facet in a faceted information space.
  - 42. The system according to claim 30, further comprising an approximation component adapted to generate a representation of the set, wherein the representation of the set is adapted to statistical manipulation.
  - 43. The system according to claim 30, wherein the analysis component is further adapted to approximate the distribution.
  - 44. The system according to claim 43, wherein the analysis component is further adapted to sample a set of documents to calculate the statistical distribution for the set of documents.
  - 45. The system according to claim 30, further comprising a weighting component adapted to assign a weight value associated with at least one set member.
  - 46. The system according to claim 45, wherein the measurement component is further adapted to account for the weight value associated with at least one set member in the measurement of distinctiveness.
  - 47. The system according to claim 45, wherein the weight value comprises a relevance score, and the weighting component is further adapted to determine if the relevance score exceeds a threshold.
  - 48. The system according to claim 30, further comprising a smoothing component adapted to smoothing the statistical distribution of the at least one identifying characteristic within the set.
  - 49. The system according to claim 30, wherein the measurement component is further adapted to calculate the measurement of distinctiveness with at least one function of relative entropy, Kullback-Leibler divergence, Euclidean distance, Manhattan distance, Hellinger distance, diversity difference, cosine difference, Jaccard distance, Jenson-Shannon divergence, and skew divergence.
  - 50. The system according to claim 30, wherein the measurement component is further adapted to determine a similarity measure, and invert a sense of the similarity measure.
  - 51. The system according to claim 30, wherein the set of documents comprises at least one document, wherein the at least one document further comprises a unit of storage of digital data.

52. A non-transitory computer-readable medium having computer-readable instructions stored thereon that define instructions that, as a result of being executed by a computer, instruct the computer to perform a method for information processing, the method comprising:
- accessing a set of documents obtained from the information retrieval system;
  
  establishing, automatically, at least one identifying characteristic within the set of documents;
  
  analyzing the set of documents to obtain a statistical distribution based on values associated with the set of documents, the set of documents having a given size;
  
  computing a value of a function that measures distinctiveness of the obtained statistical distribution relative to a baseline statistical distribution of values associated with a baseline set of documents;
  
  normalizing the value relative to a distribution of values of the function that measures distinctiveness over a space of document sets, wherein a respective value of the function that measures distinctiveness corresponds to a respective document set within the space of document sets, wherein each document set in the space has a size that is comparable to the given size, and the act of normalizing the value includes an act of performing a computation on the value that accounts for the given size of the set of documents; and
  
  outputting a response derived from the normalized value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle OTC Subsidiary LLC (Oracle Corporation)
Original Assignee
Endeca Technologies Incorporated (Oracle Corporation)
Inventors
Zelevinsky, Vladimir, Wang, Joyce Jeanpin, Tunkelang, Daniel
Primary Examiner(s)
Breene; John E
Assistant Examiner(s)
PHILLIPS, III, ALBERT M

Application Number

US12/146,185
Publication Number

US 20090006383A1
Time in Patent Office

1,182 Days
Field of Search

707/738, 707/722
US Class Current

707/722
CPC Class Codes

G06F 16/245 Query processing

G06F 16/3331 Query processing

System and method for measuring the quality of document sets

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

52 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for measuring the quality of document sets

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

52 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links