Method and system for lexical mapping between document sets having a common topic

US 7,565,361 B2
Filed: 04/12/2005
Issued: 07/21/2009
Est. Priority Date: 04/22/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A computer readable storage medium comprising instructions recorded thereon for causing a computer to perform a method of detecting, from a first document set and a second document set having a common topic, at least one of (a) terms in the second document set that correspond to specific terms in the first document set, and (b) terms in the first document set that correspond to specific terms in the second document set, the first and second document sets having been retrieved on the basis of a term list, said instructions comprising:

creating a first term matrix from the first document set on the basis of the frequency of each term listed in a first term list;

creating a second term matrix from the second document set on the basis of the frequency of each term listed in a second term list;

calculating a lexical mapping matrix from a product of the first term matrix and the second term matrix;

selecting a predetermined number of terms in a specific row in the lexical mapping matrix in a descending order of values of elements to adopt the selected terms in the specific row as terms in the first document set that correspond to the specific terms in the second document set; and

selecting a predetermined number of terms in a specific column in the lexical mapping matrix in the descending order of elements to adopt the selected terms in the specific column as terms in the second document set that correspond to the specific terms in the first document set;

wherein;

(a) the number of terms in the term list is s, (b) the number of terms selected from the first document set is n, (c) the first term matrix is represented by an s-by-n matrix P, (d) the frequency of the i-th term in the k-th document of the first document set is Exp(k,i), (e) the overall frequency of the i-th term is Etf(i), and (f) the total number of terms in the k-th document is Ewf(k), each of the elements (We(k,i)) of the matrix P is given by;

$\begin{matrix} We (k, i) = \frac{Exp (k, i)}{(Etf (i) * Ewf (k))} & [Equation 1] \end{matrix}$ (g) the number of terms selected from the second document set is m, (h) the second term matrix is represented by an s-by-m matrix Q, and (i) the frequency of the r-th term appearing in the k-th document of the second document set is Naive(k,r), (j) the overall frequency of the r-th term is Ntf(r), and (k) the total number of terms in the k-th document is Nwf(k), each of the elements (Wn(k,r)) of the matrix Q is given by;

$\begin{matrix} Wn (k, r) = \frac{Naive (k, r)}{(Ntf (r) * Nwf (k))} . & [Equation 2] \end{matrix}$

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Terms (e.g., words) used in an expert domain that correspond to terms in a naive domain are detected when there are no vocabulary pairs or document pairs available for the expert and naive domains. Documents known to be descriptions of identical topics and written in the expert and naive domains are collected by searching the Internet. The frequencies of terms that occur in these documents are counted. The counts are used to calculate correspondences between the vocabularies of the expert and naive language expressions.

11 Citations

View as Search Results

10 Claims

1. A computer readable storage medium comprising instructions recorded thereon for causing a computer to perform a method of detecting, from a first document set and a second document set having a common topic, at least one of (a) terms in the second document set that correspond to specific terms in the first document set, and (b) terms in the first document set that correspond to specific terms in the second document set, the first and second document sets having been retrieved on the basis of a term list, said instructions comprising:
- creating a first term matrix from the first document set on the basis of the frequency of each term listed in a first term list;
  
  creating a second term matrix from the second document set on the basis of the frequency of each term listed in a second term list;
  
  calculating a lexical mapping matrix from a product of the first term matrix and the second term matrix;
  
  selecting a predetermined number of terms in a specific row in the lexical mapping matrix in a descending order of values of elements to adopt the selected terms in the specific row as terms in the first document set that correspond to the specific terms in the second document set; and
  
  selecting a predetermined number of terms in a specific column in the lexical mapping matrix in the descending order of elements to adopt the selected terms in the specific column as terms in the second document set that correspond to the specific terms in the first document set;
  
  wherein;
  
  (a) the number of terms in the term list is s, (b) the number of terms selected from the first document set is n, (c) the first term matrix is represented by an s-by-n matrix P, (d) the frequency of the i-th term in the k-th document of the first document set is Exp(k,i), (e) the overall frequency of the i-th term is Etf(i), and (f) the total number of terms in the k-th document is Ewf(k), each of the elements (We(k,i)) of the matrix P is given by;
  
  $\begin{matrix} We (k, i) = \frac{Exp (k, i)}{(Etf (i) * Ewf (k))} & [Equation 1] \end{matrix}$ (g) the number of terms selected from the second document set is m, (h) the second term matrix is represented by an s-by-m matrix Q, and (i) the frequency of the r-th term appearing in the k-th document of the second document set is Naive(k,r), (j) the overall frequency of the r-th term is Ntf(r), and (k) the total number of terms in the k-th document is Nwf(k), each of the elements (Wn(k,r)) of the matrix Q is given by;
  
  $\begin{matrix} Wn (k, r) = \frac{Naive (k, r)}{(Ntf (r) * Nwf (k))} . & [Equation 2] \end{matrix}$
- View Dependent Claims (2, 3, 4)
- - 2. A computerized document processing system comprising a processor for performing the instructions of claim 1.
  - 3. The computer readable storage medium according to claim 1, further comprising instructions for:
    - calculating weight values indicating strengths of combinations each term to obtain an m-by-n lexical mapping matrix T, which is defined as;
      
      T=Q^tP,
4. A computerized document processing system comprising a processor for performing the instructions of claim 3.

5. A method of detecting, from a first document set and a second document set having a common topic, at least one of (a) terms in the second document set that correspond to specific terms in the first document set, and (b) terms in the first document set that correspond to specific terms in the second document set, the first and second document sets having been retrieved on the basis of a term list, said method comprising steps executed by a processor, the steps comprising:
- creating a first term matrix from the first document set on the basis of the frequency of each term listed in a first term list;
  
  creating a second term matrix from the second document set on the basis of the frequency of each term listed in a second term list;
  
  calculating a lexical mapping matrix from a product of the first term matrix and the second term matrix;
  
  selecting a predetermined number of terms in a specific row in the lexical mapping matrix in a descending order of values of elements to adopt the selected terms in the specific row as terms in the first document set that correspond to the specific terms in the second document set;
  
  selecting a predetermined number of terms in a specific column in the lexical mapping matrix in the descending order of elements to adopt the selected terms in the specific column as terms in the second document set that correspond to the specific terms in the first document set; and
  
  outputting the selected predetermined number of terms in the specific row and the selected predetermined number of terms in the specific column in the lexical mapping matrix,wherein;
  
  (a) the number of terms in the term list is s, (b) the number of terms selected from the first document set is n, (c) the first term matrix is represented by an s-by-n matrix P, (d) the frequency of the i-th term in the k-th document of the first document set is Exp(k,i), (e) the overall frequency of the i-th term is Etf(i), and (f) the total number of terms in the k-th document is Ewf(k), each of the elements (We(k,i)) of the matrix P is given by;
  
  $\begin{matrix} We (k, i) = \frac{Exp (k, i)}{(Etf (i) * Ewf (k))} & [Equation 1] \end{matrix}$ (g) the number of terms selected from the second document set is m, (h) the second term matrix is represented by an s-by-m matrix Q, and (i) the frequency of the r-th term appearing in the k-th document of the second document set is Naive(k,r), (j) the overall frequency of the r-th term is Ntf(r), and (k) the total number of terms in the k-th document is Nwf(k), each of the elements (Wn(k,r)) of the matrix Q is given by;
  
  $\begin{matrix} Wn (k, r) = \frac{Naive (k, r)}{(Ntf (r) * Nwf (k))} . & [Equation 2] \end{matrix}$
- View Dependent Claims (6, 7, 8, 9, 10)
- - 6. The method according to claim 5, said steps further comprising:
    - calculating weight values indicating strengths of combinations each term to obtain an m-by-n lexical mapping matrix T, which is defined as;
      
      T=Q^tP
7. A computerized document processing system comprising a processor for performing the method of claim 5.
8. A computerized document processing system comprising a processor for performing the method of claim 6.
9. A computer readable storage medium comprising a set of instructions stored thereon for causing a computer to perform the method of claim 5.
10. A computer readable storage medium comprising a set of instructions stored thereon for causing a computer to perform the method of claim 6.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Uber Technologies, Inc.
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Oda, Hiromi
Primary Examiner(s)
Mofiz; Apu M
Assistant Examiner(s)
LU, CHARLES EDWARD

Application Number

US11/103,567
Publication Number

US 20050240394A1
Time in Patent Office

1,561 Days
Field of Search

704/2, 704/9, 707/3, 707/5, 707/100, 707/104.1
US Class Current

1/1
CPC Class Codes

G06F 16/353   into predefined classes

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99945   Object-oriented database st...

Method and system for lexical mapping between document sets having a common topic

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

11 Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for lexical mapping between document sets having a common topic

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links