Text representation and method
First Claim
1. A computer-executed method for representing a natural-language document in a vector form suitable for text manipulation operations, comprising(a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the document, determining a selectivity value calculated as the frequency of occurrence of the term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and(b) representing the document as a vector of terms, where a coefficient assigned to each term is a function of the selectivity value determined for the term.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer method for representing a natural-language document in a vector form suitable for text manipulation operations is disclosed. The method involves determining (a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the document, a selectivity value of the term related to the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively. The document is represented as a vector of terms, where the coefficient assigned to each term includes a function of the selectivity value determined for that term, and optionally related to the inverse document frequency of that word in one or more libraries of texts. Also disclosed are a computer-readable code for carrying out the method, a computer system that employs the code, and a vector produced by the method.
125 Citations
25 Claims
-
1. A computer-executed method for representing a natural-language document in a vector form suitable for text manipulation operations, comprising
(a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the document, determining a selectivity value calculated as the frequency of occurrence of the term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (b) representing the document as a vector of terms, where a coefficient assigned to each term is a function of the selectivity value determined for the term.
-
11. An automated system for representing a natural-language document in a vector form suitable for text manipulation operations, comprising
(1) a computer, (2) accessible by said computer, a database of word records, where each record includes text identifiers of the library texts that contain the word, associated library identifiers for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, (3) a computer readable code which is operable, under the control of said computer, to perform the steps of (a) accessing said database to determine, for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the document, a selectivity value of the term, and (b) representing the document as a vector of terms, where a coefficient assigned to each term is a function of the selectivity value determined for the term.
-
15. Computer readable code for use with an electronic computer and a database word records for representing a natural-language document in a vector form suitable for text manipulation operations, where each record in the word records database includes text identifiers of the library texts that contain the word, an associated library identifier for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, said code being operable, under the control of said computer, to perform the steps of
(a) accessing said database to determine, for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the document, a selectivity value of the term, and (b) representing the document as a vector of terms, where a coefficient assigned to each term is related to the selectivity value determined for the term.
-
19. A vector representation of a natural-language document comprising
a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the document, where each term has an assigned coefficient which includes a function of the selectivity value of the term, where the selectivity value of the term is a term in a library of texts in a field is related to the frequency of occurrence of the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively.
Specification