Method and system for determining relevance of terms in text documents
First Claim
1. A computer implemented method comprising:
- receiving a list comprising an entity, the entity having been identified as being associated with an electronic document;
based solely upon a set of characteristics of the document, determining a relevancy score associated with the entity with respect to the document, wherein the set of characteristics includes at least one characteristic from the group consisting of;
a first number representing a number of sentences occurring in the document prior to a first sentence in which the entity is named;
a second number representing a number of sentences between first and last occurrences of the entity within the document; and
a third number representing a uniformity with which the entity occurs within the document; and
storing the relevancy score.
10 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a corpus-independent method for determining relevancy of terms to content of text appearing in a document by analyzing the document itself. Conventional information extraction, or other methods, may be applied to a document to generate a list of terms. The invention analyzes the document using relevancy scoring algorithms to determine a term relevancy score representing the term'"'"'s relevance to the text contained in the document. The scores, including an aggregate score, may be normalized in the process. Based on relevancy scoring, terms are then ranked and further processed. In this manner relevancy is determined based on the subject document itself and by analyzing the occurrences and locations of the terms within the document. Additional techniques may be applied to relate the relevancy scores generated by the present invention to a corpus or collection of documents.
35 Citations
49 Claims
-
1. A computer implemented method comprising:
-
receiving a list comprising an entity, the entity having been identified as being associated with an electronic document; based solely upon a set of characteristics of the document, determining a relevancy score associated with the entity with respect to the document, wherein the set of characteristics includes at least one characteristic from the group consisting of; a first number representing a number of sentences occurring in the document prior to a first sentence in which the entity is named; a second number representing a number of sentences between first and last occurrences of the entity within the document; and a third number representing a uniformity with which the entity occurs within the document; and storing the relevancy score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computer-implemented method comprising:
-
receiving terms extracted from a first electronic document; scoring the extracted terms using two or more term relevancy algorithms based solely upon the first electronic document, wherein the two or more relevancy algorithms are from the group consisting of; determining a first number representing a number of times the term is mentioned in the document; determining a second number representing a proximity of a first occurrence of the term to the beginning of the document; determining a third number representing a proximity of a first occurrence of the term to a last occurrence of the term within the document; and determining a fourth number representing overall changes in a rate of occurrences of the term throughout the document; aggregating for each of the extracted terms the relevancy scores generated by the two or more term relevancy algorithms to produce a term aggregate relevancy score for each of the extracted terms; and ranking each of the extracted terms based on the term aggregate relevancy score assigned to the extracted terms to determine a relevance ranking of the extracted terms to the first electronic document. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. An article comprising a non-transitory machine-readable medium, the medium having stored thereon instructions to be executed by a machine to perform operations, the article comprising instructions for:
-
receiving a list comprising a term, the term having been identified as being associated with an electronic document; based solely upon a set of characteristics of the document, determining and assigning a relevancy score to the term with respect to the document, wherein the set of characteristics includes at least one characteristic from the group consisting of; a first number representing a number of sentences occurring in the document prior to a first sentence in which the term appears; a second number representing a number of sentences between first and last occurrences of the term within the document; and a third number representing a uniformity with which the term occurs within the document; and storing the relevancy score. - View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36)
-
-
37. A computer-based system comprising memory and a processor for executing instructions to perform operations, the system comprising:
-
input adapted to receive a list comprising a term, the term having been identified as being associated with an electronic document; relevancy scoring module adapted to determine a relevancy score associated with the term with respect to the document, the relevancy score being based solely upon a set of characteristics of the document, wherein the set of characteristics includes at least one characteristic from the group consisting of; a first number representing a number of sentences occurring in the document prior to a first sentence in which the term appears; a second number representing a number of sentences between first and last occurrences of the term within the document; and a third number representing a uniformity with which the term occurs within the document; and memory for storing the relevancy score. - View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49)
-
Specification