Code, system and method for representing a natural-language text in a form suitable for text manipulation
First Claim
1. A computer-executed method for representing a natural-language document in a vector form suitable for text manipulation operations, comprising(a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), determining a selectivity value calculated as the frequency of occurrence of said each term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and(b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for said each term.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer method, system and code, for representing a natural-language document in a vector form suitable for text manipulation operations are disclosed. The method involves determining (a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), a selectivity value of the term related to the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively. The document is represented as a vector of terms, where the coefficient assigned to each term includes a function of the selectivity value determined for that term.
-
Citations
26 Claims
-
1. A computer-executed method for representing a natural-language document in a vector form suitable for text manipulation operations, comprising
(a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), determining a selectivity value calculated as the frequency of occurrence of said each term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for said each term.
-
11. An automated system for representing a natural-language document in a vector form suitable for text manipulation operations, comprising
(1) a computer, (2) accessible by said computer, a database of word records, where each record includes text identifiers of the library texts that contain that word, associated library identifiers for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of said term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, (3) a computer readable code which is operable, under the control of said computer, to perform the steps of (a) accessing said database to determine, for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), a selectivity value of the term, and (b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for said each term.
-
15. Computer readable code for use with an electronic computer and a database of word records for representing a natural-language document in a vector form suitable for text manipulation operations, where each record in the word records database includes text identifiers of the library texts that contain that word, an associated library identifier for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of said term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, said code being operable, under the control of said computer, to perform the steps of
(a) accessing said database to determine, for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), and (b) representing the document as a vector of terms, where the coefficient assigned to each term is related to the selectivity value determined for said each term.
-
19. A vector representation of a natural-language document comprising
a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), where each term has an assigned coefficient which includes a function of the selectivity value of said each term, where the selectivity value of a term is a term in a library of texts in a field is related to the frequency of occurrence of said each term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively.
-
26. A computer-executed method for generating a set of proximately arranged word pairs in a natural-language document, comprising
(a) generating a list of proximately arranged word pairs in the document, (b) determining, for each word pair, a selectivity value calculated as the frequency of occurrence of said each word pair in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (c) retaining the word pair in the set if the determined selectivity value is above a selected threshold value.
Specification