Method, system or memory storing a computer program for document processing

US 20050240394A1
Filed: 04/12/2005
Published: 10/27/2005
Est. Priority Date: 04/22/2004
Status: Active Grant

First Claim

Patent Images

1. A method of retrieving documents having a common topic and of classifying the documents into a first document set having a first set of feature values and a second document set having a second set of feature values, the method comprising:

retrieving a related third document set on the basis of a predetermined term list;

constructing a third set of feature values by calculating feature values for each document in the third documents set; and

classifying documents in the third document set into the first document set and the second document set according to;

(a) a discriminant using the first set of feature values and the third set of feature values, and (b) a discriminant using the second set of feature values and the third set of feature values.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Terms (e.g., words) used in an expert domain that correspond to terms in a naive domain are detected when there are no vocabulary pairs or document pairs available for the expert and naive domains. Documents known to be descriptions of identical topics and written in the expert and naive domains are collected by searching the Internet. The frequencies of terms that occur in these documents are counted. The counts are used to calculate correspondences between the vocabularies of the expert and naive language expressions.

Citations

27 Claims

1. A method of retrieving documents having a common topic and of classifying the documents into a first document set having a first set of feature values and a second document set having a second set of feature values, the method comprising:
- retrieving a related third document set on the basis of a predetermined term list;
  
  constructing a third set of feature values by calculating feature values for each document in the third documents set; and
  
  classifying documents in the third document set into the first document set and the second document set according to;
  
  (a) a discriminant using the first set of feature values and the third set of feature values, and (b) a discriminant using the second set of feature values and the third set of feature values.
- View Dependent Claims (2, 3, 4, 10, 11, 12, 13, 19, 20, 21, 22)
- - 2. The method of claim 1, further including selecting an arbitrary set of items from the following items as the feature value set:
    - the number of content words, the ratio of naive words, the ratio of proper nouns, the ratio of additional proper nouns, the ratio of particles/auxiliary words, a Spearman'"'"'s Correlation Coefficient/Significance calculated from the frequencies of n-gram patterns concerning content words and particles/auxiliary words.
  - 3. The method of claim 2, wherein the retrieval of the third document set further comprises removing documents that belong to at least one of:
    - garbage-type documents, list-type documents, and diary-type documents.
  - 4. The method of claim 1, wherein the retrieval of the third document set further comprises removing documents that belong to at least one of:
    - garbage-type documents, list-type documents, and diary-type documents.
  - 10. A document retrieval and classifying system for performing the method of claim 1.
  - 11. A document retrieval and classifying system for performing the method of claim 2.
  - 12. A document retrieval and classifying system for performing the method of claim 3.
  - 13. A document retrieval and classifying system for performing the method of claim 4.
  - 19. A memory or computer readable storage medium for causing a computer to perform the method of claim 1.
  - 20. A memory or computer readable storage medium for causing a computer to perform the method of claim 2.
  - 21. A memory or computer readable storage medium for causing a computer to perform the method of claim 3.
  - 22. A memory or computer readable storage medium for causing a computer to perform the method of claim 4.

5. A method of detecting (from a first document set having a first set of feature values and a second document set having a second set of feature values) that the first and second document sets have at least one (a) a common topic, (b) terms in the second document set that that correspond to specific terms in the first document set, or (c) terms in the first document set that correspond to specific terms in the second document set, the method comprising:
- retrieving a related third document set on the basis of a predetermined term list;
  
  constructing a third set of feature values by calculating feature values for each document in the third document set;
  
  classifying documents in the third document set into the first document set or the second document set according to a discriminant using the first set of feature values and the third set of feature values, and a discriminant using the second set of feature values and the third set of feature values;
  
  calculating the frequency of each term listed in a first term list compiled from documents that were classified into the first document set, and the frequency of each term listed in a second term list compiled from documents that were classified into the second document set;
  
  detecting terms in the second document set that correspond to specific terms in the first document set on the basis of the frequencies of the terms listed in the first and second term lists; and
  
  detecting terms in the first document set that correspond to specific terms in the second document set on the basis of the first and second term frequencies.
- View Dependent Claims (14, 23)
- - 14. A document processing system for performing the method of claim 5.
  - 23. A memory or computer readable storage medium for causing a computer to perform the method of claim 5.

6. A method of detecting (from a first document set and a second document set having a common topic) (a) terms in the second document set that correspond to specific terms in the first document set, or (b) terms in the first document set that correspond to specific terms in the second document set, comprising:
- calculating the frequency of each term listed in the first term list compiled from the first document set, and the frequency of each term listed in the second term list compiled from the second document set;
  
  detecting terms in the second document set that correspond to specific terms in the first document set on the basis of the frequencies of the terms listed in the first and second term lists; and
  
  detecting terms in the first document set that correspond to specific terms in the second document set on the basis of the frequencies of the terms listed in the first and second term lists.
- View Dependent Claims (15, 24)
- - 15. A document processing system for performing the method of claim 6.
  - 24. A memory or computer readable storage medium for causing a computer to perform the method of claim 6.

7. A method of detecting (from a first document set and a second document set having a common topic, the document sets having been retrieved on the basis of a term list) (a) terms in the second document set that correspond to specific terms in the first document set, and (b) terms in first document set that correspond to specific terms in the second document set, comprising:
- calculating the probability P(A) of the co-occurrence of a specific term pair, which includes a term from the first document set and a term from the second documents set;
  
  calculating the probability P(B) of the lack of co-occurrences of the first term of a term pair in question occurring in the first document set and the second term of said term pair not occurring in the second document set;
  
  calculating a maximum likelihood ratio on the basis of P(A) and P(B);
  
  extracting all term pair combinations having a maximum likelihood ratio that exceeds a predetermined threshold value;
  
  selecting a predetermined number of terms in a descending order of the values of maximum likelihood ratios from the terms in first document set that correspond to a specific term in the second document set, and adopt the selected terms as the candidate terms of the first document set that correspond to specific terms in the second document set; and
  
  selecting a predetermined number of terms in a descending order of maximum likelihood ratios from the terms in the second document set that correspond to a specific term in the first document set, and adopt the selected terms as the candidate terms of the second document set that correspond to the specific terms in the first document set.
- View Dependent Claims (16, 25)
- - 16. A document processing system for performing the method of claim 7.
  - 25. A memory or computer readable storage medium for causing a computer to perform the method of claim 7.

8. A method of detecting (from a first document set and a second document set having a common topic) (a) terms in the second document set that correspond to specific terms in the first document set, and/or (b) terms in the first document set that correspond to specific terms in the second document set, the first and second document sets having been retrieved on the basis of a term list comprising:
- creating a first term matrix from the first document set on the basis of the frequency of each term listed in a first term list;
  
  creating a second term matrix from the second document set on the basis of the frequency of each term listed in a second term list;
  
  calculating a lexical mapping matrix from a product of the first term matrix and the second term matrix;
  
  selecting a predetermined number of terms in a specific row in the lexical mapping matrix in a descending order of the values of elements to adopt the selected terms as terms in the first document set that correspond to the specific terms in the second document set; and
  
  selecting a predetermined number of terms in the specific column in the lexical mapping matrix in the descending order of elements to adopt the selected terms as terms in the second document sets that correspond to the specific terms in the first document sets.
- View Dependent Claims (9, 17, 18, 26, 27)
- - 9. The method according to claim 8, wherein:
    - (a) the number of terms in the term list is s, (b) the number of terms selected from the first document set is n, (c) the first term matrix is represented by an s-by-n matrix P, (d) the frequency of the i-th term in the k-th document of the first document set is Exp(k,i), (e) the overall frequency of the i-th term is Etf(i), and (f) the total number of terms in the k-th document is Ewf(k), elements of the matrix P are;
      
      $\begin{matrix} We (k, ⅈ) = \frac{Exp (k, ⅈ)}{(Etf (ⅈ) * Ewf (k))} & [Equation 1] \end{matrix}$ (g) the number of terms selected from the second document set is m, (h) the second term matrix is represented by an s-by-m matrix Q, and (i) the frequency of the r-th term appearing in the k-th document of the second document set is Naive(k,r), (j) the overall frequency of the r-th term is Ntf(r), and the total number of terms in the k-th document is Nwf(k), elements of the matrix Q are given by $\begin{matrix} Wn (k, i) = \frac{Naive (k, r)}{(Ntf (r) * Nwf (k))} & [Equation 2] \end{matrix}$
  - 17. A document processing system for performing the method of claim 8.
  - 18. A document processing system for performing the method of claim 9.
  - 26. A memory or computer readable storage medium for causing a computer to perform the method of claim 8.
  - 27. A memory or computer readable storage medium for causing a computer to perform the method of claim 9.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Uber Technologies, Inc.
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Oda, Hiromi

Granted Patent

US 7,565,361 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 16/353   into predefined classes

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99945   Object-oriented database st...

Method, system or memory storing a computer program for document processing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Method, system or memory storing a computer program for document processing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links