Method and apparatus for comparing scores in a vector space retrieval process
First Claim
1. A method for analyzing documents from a data source, comprising:
- analyzing a reference corpus using a profile to determine reference corpus document scores indicative of content of documents in the reference corpus relative to the profile;
identifying a particular reference corpus document score that corresponds to a particular delivery ratio of documents of the reference corpus based on the analysis of the reference corpus;
assigning threshold scores to a multiplicity N of score threshold levels such that the particular reference corpus document score is assigned to be a threshold score for a given one of the N score threshold levels;
analyzing a data source using the profile and determining raw document scores for documents from the data source relative to the profile based upon the analysis of the data source;
comparing the raw document scores to the threshold scores of the N score threshold levels;
assigning normalized document scores to documents of the data source based on the comparison of the raw document scores to the threshold scores of the N score threshold levels as indicators of document relevancy to the profile; and
selecting a document based upon its normalized document score.
3 Assignments
0 Petitions
Accused Products
Abstract
The delivery ratio of r (which is a fraction between 0 and 1) partitions a stream of documents into a section of top scoring r-fraction of documents and the remainder. This way a set of successively bigger delivery ratios, r1, r2, r3, . . . sections the stream into tiers. Any given document is assigned to a tier according to how many delivery ratio thresholds it matched or surpassed and how many it failed to reach. This creates a scoring structure which reflects the specificity of the document with respect to a profile in terms of density of relevant documents in the stream. In other words, a document in the kth tier is such that it failed to be classified in the top rk ratio of the stream (thus rk fraction of the stream is more relevant to the given profile than the document under consideration). At the same time this document was classified as being in the top rk−1 part of the stream. Thus this mechanism defines a score (let'"'"'s call it σ) for a document depending on how it compares to other documents in the stream when scored against a given profile.
54 Citations
45 Claims
-
1. A method for analyzing documents from a data source, comprising:
-
analyzing a reference corpus using a profile to determine reference corpus document scores indicative of content of documents in the reference corpus relative to the profile; identifying a particular reference corpus document score that corresponds to a particular delivery ratio of documents of the reference corpus based on the analysis of the reference corpus; assigning threshold scores to a multiplicity N of score threshold levels such that the particular reference corpus document score is assigned to be a threshold score for a given one of the N score threshold levels; analyzing a data source using the profile and determining raw document scores for documents from the data source relative to the profile based upon the analysis of the data source; comparing the raw document scores to the threshold scores of the N score threshold levels; assigning normalized document scores to documents of the data source based on the comparison of the raw document scores to the threshold scores of the N score threshold levels as indicators of document relevancy to the profile; and selecting a document based upon its normalized document score. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for analyzing documents from a data source, comprising:
-
analyzing a reference corpus using a first profile and a second profile to determine reference corpus document scores indicative of content of documents in the reference corpus relative to the first and second profiles; identifying a first reference corpus document score that corresponds to a first delivery ratio of documents of the reference corpus based on the analysis of the reference corpus using the first profile; identifying a second reference corpus document score that corresponds to a second delivery ratio of documents of the reference corpus based on the analysis of the reference corpus using the second profile; for the first profile, assigning first threshold scores to a multiplicity N of score threshold levels such that the first reference corpus document score is assigned to be a threshold score for a given one of the N score threshold levels; for the second profile, assigning second threshold scores to the N score threshold levels such that the second reference corpus document score is assigned to be a threshold score for the given one of the N score threshold levels; analyzing a data source using the first and second profiles and determining first and second raw document scores for documents from the data source relative to the first and second profiles, respectively, based upon the analysis of the data source; comparing the first and second raw document scores generated from the analyses of the data source to the respective first and second threshold scores of the N score threshold levels; assigning first normalized document scores to documents of the data source based on the comparison of the first raw document scores to the first threshold scores of the N score threshold levels as indicators of document relevancy to the first profile; assigning second normalized document scores to the documents of the data source based on the comparison of the second raw document scores to the second threshold scores of the N score threshold levels as indicators of document relevancy to the second profile; and classifying a given document of the data source as being relevant to at least one of the first profile and the second profile if at least one of the first normalized document score and the second normalized document score of the given document, respectively, satisfy a relevance threshold. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A system for analyzing documents from a data source, comprising:
-
a memory; and a processing system coupled to the memory, wherein the processing system is configured to; analyze a reference corpus using a profile to determine reference corpus document scores indicative of content of documents in the reference corpus relative to the profile; identify a particular reference corpus document score that corresponds to a particular delivery ratio of documents of the reference corpus based on the analysis of the reference corpus; assign threshold scores to a multiplicity N of score threshold levels such that the particular reference corpus document score is assigned to be a threshold score for a given one of the N score threshold levels; analyze a data source using the profile and determine raw document scores for documents from the data source relative to the profile based upon the analysis of the data source; compare the raw document scores to the threshold scores of the N score threshold levels; assign normalized document scores to documents of the data source based on the comparison of the raw document scores to the threshold scores of the N score threshold levels as indicators of document relevancy to the profile; and select a document based upon its normalized document score. - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
-
23. A system for analyzing documents from a data source, comprising:
-
a memory; and a processing system coupled to the memory, wherein the processing system is configured to; analyze a reference corpus using a first profile and a second profile to determine reference corpus document scores indicative of content of documents in the reference corpus relative to the first and second profiles; identify a first reference corpus document score that corresponds to a first delivery ratio of documents of the reference corpus based on the analysis of the reference corpus using the first profile; identify a second reference corpus document score that corresponds to a second delivery ratio of documents of the reference corpus based on the analysis of the reference corpus using the second profile; for the first profile, assign first threshold scores to a multiplicity N of score threshold levels such that the first reference corpus document score is assigned to be a threshold score for a given one of the N score threshold levels; for the second profile, assign second threshold scores to the N score threshold levels such that the second reference corpus document score is assigned to be a threshold score for the given one of the N score threshold levels; analyze a data source using the first and second profiles and determine first and second raw document scores for documents from the data source relative to the first and second profiles, respectively, based upon the analysis of the data source; compare the first and second raw document scores generated from the analyses of the data source to the respective first and second threshold scores of the N score threshold levels; assign first normalized document scores to documents of the data source based on the comparison of the first raw document scores to the first threshold scores of the N score threshold levels as indicators of document relevancy to the first profile; assign second normalized document scores to the documents of the data source based on the comparison of the second raw document scores to the second threshold scores of the N score threshold levels as indicators of document relevancy to the second profile; and classify a given document of the data source as being relevant to at least one of the first profile and the second profile if at least one of the first normalized document score and the second normalized document score of the given document, respectively, satisfy a relevance threshold. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30)
-
-
31. An article of manufacture comprising a computer readable medium having embodied therein computer readable program code for analyzing documents, the computer readable program code being adapted to cause a processing system to:
-
analyze a reference corpus using a profile to determine reference corpus document scores indicative of content of documents in the reference corpus relative to the profile; identify a particular reference corpus document score that corresponds to a particular delivery ratio of documents of the reference corpus based on the analysis of the reference corpus; assign threshold scores to a multiplicity N of score threshold levels such that the particular reference corpus document score is assigned to be a threshold score for a given one of the N score threshold levels; analyze a data source using the profile and determine raw document scores for documents from the data source relative to the profile based upon the analysis of the data source; compare the raw document scores to the threshold scores of the N score threshold levels; assign normalized document scores to documents of the data source based on the comparison of the raw document scores to the threshold scores of the N score threshold levels as indicators of document relevancy to the profile; and select a document based upon its normalized document score. - View Dependent Claims (32, 33, 34, 35, 36, 37)
-
-
38. An article of manufacture comprising a computer readable medium having embodied therein computer readable program code for analyzing documents, the computer readable program code being adapted to cause a processing system to:
-
analyze a reference corpus using a first profile and a second profile to determine reference corpus document scores indicative of content of documents in the reference corpus relative to the first and second profiles; identify a first reference corpus document score that corresponds to a first delivery ratio of documents of the reference corpus based on the analysis of the reference corpus using the first profile; identify a second reference corpus document score that corresponds to a second delivery ratio of documents of the reference corpus based on the analysis of the reference corpus using the second profile; for the first profile, assign first threshold scores to a multiplicity N of score threshold levels such that the first reference corpus document score is assigned to be a threshold score for a given one of the N score threshold levels; for the second profile, assign second threshold scores to the N score threshold levels such that the second reference corpus document score is assigned to be a threshold score for the given one of the N score threshold levels; analyze a data source using the first and second profiles and determine first and second raw document scores for documents from the data source relative to the first and second profiles, respectively, based upon the analysis of the data source; compare the first and second raw document scores generated from the analyses of the data source to the respective first and second threshold scores of the N score threshold levels; assign first normalized document scores to documents of the data source based on the comparison of the first raw document scores to the first threshold scores of the N score threshold levels as indicators of document relevancy to the first profile; assign second normalized document scores to the documents of the data source based on the comparison of the second raw document scores to the second threshold scores of the N score threshold levels as indicators of document relevancy to the second profile; and classify a given document of the data source as being relevant to at least one of the first profile and the second profile if at least one of the first normalized document score and the second normalized document score of the given document, respectively, satisfy a relevance threshold. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45)
-
Specification