Text-representation code, system, and method

US 20040059565A1
Filed: 07/01/2003
Published: 03/25/2004
Est. Priority Date: 07/03/2002
Status: Active Grant

First Claim

Patent Images

1. A computer-executed method for representing a natural-language document in a vector form suitable for text manipulation operations, comprising (a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), determining a selectivity value calculated as the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for that term.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer method for representing a natural-language document in a vector form suitable for text manipulation operations is disclosed. The method involves determining (a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), a selectivity value of the term related to the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively. The document is represented as a vector of terms, where the coefficient assigned to each term includes a function of the selectivity value determined for that term, and optionally related to the inverse document frequency of that word in one or more libraries of texts. Also disclosed are a computer-readable code for carrying out the method, a computer system that employs the code, and a vector generated by the method.

Citations

26 Claims

1. A computer-executed method for representing a natural-language document in a vector form suitable for text manipulation operations, comprising (a) for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), determining a selectivity value calculated as the frequency of occurrence of that term in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for that term.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the selectivity value associated with a term is the greatest selectivity value determined with respect to each of a plurality N≧
    - 2 of libraries of texts in different fields.
  - 3. The method of claim 1, wherein the selectivity value function is a root function.
  - 4. The method of claim 3, wherein the root function is between 2, the square root function, and 3, the cube root function.
  - 5. The method of claim 1, wherein only terms having a selectivity value above a predetermined threshold are included in the vector.
  - 6. The method of claim 1, wherein the terms include words in the document, and the coefficient assigned to each word in the vector is also related to the inverse document frequency of that word in one or more of said libraries of texts.
  - 7. The method of claim 6, wherein the coefficient assigned to each word in the vector is the product of a function of the selectivity value and the inverse document frequency of that word.
  - 8. The method of claim 1, wherein the terms include words in the document, and step (a) includes accessing a database of word records, where each record includes text identifiers of the library texts that contain that word, and associated library identifiers for each text.
  - 9. The method of claim 8, wherein step (a) includes (i) accessing the database to identify text and library identifiers for each non-generic word in the target text, and (ii) using the identified text and library identifiers to calculate one or more selectivity values for that word.
  - 10. The method of claim 9, wherein the terms include word groups in the document, and said database further includes, for each word record, word-position identifiers, and wherein step (a) as applied to word groups includes (i) accessing said database to identify texts and associated library and word-position identifiers associated with that word group, (ii) from the identified texts, library identifiers, and word-position identifiers recorded in step and (i) determining one or more selectivity values for that word group.

11. An automated system for representing a natural-language document in a vector form suitable for text manipulation operations, comprising (1) a computer, (2) accessible by said computer, a database of word records, where each record includes text identifiers of the library texts that contain that word, associated library identifiers for each text, and-optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of that term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, (3) a computer readable code which is operable, under the control of said computer, to perform the steps of (a) accessing said database to determine, for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), a selectivity value of the term, and (b) representing the document as a vector of terms, where the coefficient assigned to each term is a function of the selectivity value determined for that term.
- View Dependent Claims (12, 13, 14)
- - 12. The system of claim 11, wherein the terms include words in the document, and said computer-readable code is further operable to access the database to determine, for each of a plurality of non-generic words, an inverse document frequency for that word in one or more of said libraries of texts.
  - 13. The system of claim 11, wherein the terms include words in the document, and step (a) includes (i) accessing the database to identify text and library identifiers for each non-generic word in the target text, (ii) using the identified text and library identifiers to calculate one or more selectivity values for that word.
  - 14. The system of claim 11, wherein the terms include word groups in the document, and said database further includes, for each word record, word-position identifiers, and wherein step (a) as applied to word groups includes (i) accessing said database to identify texts and associated library and word-position identifiers associated with that word group, (ii) from the identified texts, library identifiers, and word-position identifiers recorded in step and (i) determining one or more selectivity values for that word group.

15. Computer readable code for use with an electronic computer and a database of word records for representing a natural-language document in a vector form suitable for text manipulation operations, where each record in the word records database includes text identifiers of the library texts that contain that word, an associated library identifier for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of that term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, said code being operable, under the control of said computer, to perform the steps of (a) accessing said database to determine, for each of a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), and (b) representing the document as a vector of terms, where the coefficient assigned to each term is related to the selectivity value determined for that term.
- View Dependent Claims (16, 17, 18)
- - 16. The code of claim 15, wherein the terms include words in the document, which is further operable to access the database to determine, for each of a plurality of non-generic words, an inverse document frequency for that word in one or more of said libraries of texts.
  - 17. The code of claim 15, wherein the terms include words in the document, and which is operable, under the control of the computer to perform step (a) by (i) accessing the database to identify text and library identifiers for each non-generic word in the target text, (ii) using the identified text and library identifiers to calculate one or more selectivity values for that word.
  - 18. The code of claim 15, wherein the terms include word groups in the document, and said database further includes, for each word record, word-position identifiers, and which code is operable, under the control of the computer, to perform step (a) as applied to word groups includes by (i) accessing said database to identify texts and associated library and word-position identifiers associated with that word group, (ii) from the identified texts, library identifiers, and word-position identifiers recorded in step and (i) determining one or more selectivity values for that word group.

19. A vector representation of a natural-language document comprising a plurality of terms selected from one of (i) non-generic words in the document, (ii) proximately arranged word groups in the document, and (iii) a combination of (i) and (ii), where each term has an assigned coefficient which includes a function of the selectivity value of that term, where the selectivity value of a term is a term in a library of texts in a field is related to the frequency of occurrence of that term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively.
- View Dependent Claims (20, 21, 22, 23, 24, 25)
- - 20. The vector representation of claim 19, wherein the coefficient assigned to a term is related to the greatest selectivity value determined with respect to each of a plurality N≧
    - 2 of libraries of texts in different fields.
  - 21. The vector representation claim 20, wherein the selectivity value function assigned to a term is a root function.
  - 22. The vector representation of claim 21, wherein the root function is between 2, the square root function, and 3, the cube root function.
  - 23. The vector representation of claim 20, wherein only terms having a selectivity value above a predetermined threshold are included in the vector.
  - 24. The vector representation claim 20, wherein the terms include words in the document, the coefficient assigned to each word in the vector is also related to the inverse document frequency of that word in one or more of said libraries of texts.
  - 25. The vector representation of claim 24 wherein the coefficient assigned to each word in the vector is the product of the inverse document of that word in one or more of said libraries of texts and a function of the selectivity value of that word.

26. A computer-executed method for generating a set of proximately arranged word pairs in a natural-language document, comprising (a) generating a list of proximately arranged word pairs in the document, (b) determining, for each word pair, a selectivity value calculated as the frequency of occurrence of that word pair in a library of texts in one field, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, and (c) retaining the word pair in the set if the determined selectivity value is above a selected threshold value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Word Data Corporation
Original Assignee
Word Data Corporation
Inventors
Chin, Shao, Dehlinger, Peter J.

Granted Patent

US 7,386,442 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/5
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 40/216   using statistical methods

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Text-representation code, system, and method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Text-representation code, system, and method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links