Text-representation code, system, and method
First Claim
1. A computer-executed method for representing a natural-language document in a vector form suitable for text manipulation operations, comprising (a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), determining a selectivity value calculated as the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for that term.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer method for representing a natural-language document in a vector form suitable for text manipulation operations is disclosed. The method involves determining (a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), a selectivity value of the term related to the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively. The document is represented as a vector of terms, where the coefficient assigned to each term includes a function of the selectivity value determined for that term, and optionally related to the inverse document frequency of that word in one or more libraries of texts. Also disclosed are a computer-readable code for carrying out the method, a computer system that employs the code, and a vector generated by the method.
-
Citations
26 Claims
-
1. A computer-executed method for representing a natural-language document in a vector form suitable for text manipulation operations, comprising
(a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), determining a selectivity value calculated as the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for that term.
-
11. An automated system for representing a natural-language document in a vector form suitable for text manipulation operations, comprising
(1) a computer, (2) accessible by said computer, a database of word records, where each record includes text identifiers of the library texts that contain that word, associated library identifiers for each text, and-optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of that term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, (3) a computer readable code which is operable, under the control of said computer, to perform the steps of (a) accessing said database to determine, for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), a selectivity value of the term, and (b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for that term.
-
15. Computer readable code for use with an electronic computer and a database of word records for representing a natural-language document in a vector form suitable for text manipulation operations, where each record in the word records database includes text identifiers of the library texts that contain that word, an associated library identifier for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of that term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, said code being operable, under the control of said computer, to perform the steps of
(a) accessing said database to determine, for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), and (b) representing the document as a vector of terms, where the coefficient assigned to each term is related to the selectivity value determined for that term.
-
19. A vector representation of a natural-language document comprising
a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), where each term has an assigned coefficient which includes a function of the selectivity value of that term, where the selectivity value of a term is a term in a library of texts in a field is related to the frequency of occurrence of that term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively.
-
26. A computer-executed method for generating a set of proximately arranged word pairs in a natural-language document, comprising
(a) generating a list of proximately arranged word pairs in the document, (b) determining, for each word pair, a selectivity value calculated as the frequency of occurrence of that word pair in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (c) retaining the word pair in the set if the determined selectivity value is above a selected threshold value.
Specification