Method and system for lexical mapping between document sets having a common topic

US 8,065,306 B2
Filed: 05/26/2009
Issued: 11/22/2011
Est. Priority Date: 04/22/2004
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of detecting, from a first document set and a second document set having a common topic, the document sets having been retrieved on the basis of a term list (a) terms in the second document set that correspond to specific terms in the first document set, and (b) terms in first document set that correspond to specific terms in the second document set, comprising:

calculating the probability P(A) of the co-occurrence of a specific term pair, which includes a term from the first document set and a term from the second document set;

calculating the probability P(B) of the lack of co-occurrences of the first term of a term pair in question occurring in the first document set and the second term of said term pair not occurring in the second document set;

calculating a maximum likelihood ratio on the basis of P(A) and P(B);

extracting all term pair combinations having a maximum likelihood ratio that exceeds a predetermined threshold value;

selecting, using a processor, a predetermined number of terms in a descending order of values of maximum likelihood ratios from the terms in first document set that correspond to a specific term in the second document set, and adopting the selected terms as the candidate terms of the first document set that correspond to specific terms in the second document set; and

selecting, using the processor, a predetermined number of terms in a descending order of maximum likelihood ratios from the terms in the second document set that correspond to a specific term in the first document set, and adopting the selected terms as the candidate terms of the second document set that correspond to the specific terms in the first document set.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Terms (e.g., words) used in an expert domain that correspond to terms in a naïve domain are detected when there are no vocabulary pairs or document pairs available for the expert and naive domains. Documents known to be descriptions of identical topics and written in the expert and naive domains are collected by searching the Internet. The frequencies of terms that occur in these documents are counted. The counts are used to calculate correspondences between the vocabularies of the expert and naive language expressions.

Citations

3 Claims

1. A computer-implemented method of detecting, from a first document set and a second document set having a common topic, the document sets having been retrieved on the basis of a term list (a) terms in the second document set that correspond to specific terms in the first document set, and (b) terms in first document set that correspond to specific terms in the second document set, comprising:
- calculating the probability P(A) of the co-occurrence of a specific term pair, which includes a term from the first document set and a term from the second document set;
  
  calculating the probability P(B) of the lack of co-occurrences of the first term of a term pair in question occurring in the first document set and the second term of said term pair not occurring in the second document set;
  
  calculating a maximum likelihood ratio on the basis of P(A) and P(B);
  
  extracting all term pair combinations having a maximum likelihood ratio that exceeds a predetermined threshold value;
  
  selecting, using a processor, a predetermined number of terms in a descending order of values of maximum likelihood ratios from the terms in first document set that correspond to a specific term in the second document set, and adopting the selected terms as the candidate terms of the first document set that correspond to specific terms in the second document set; and
  
  selecting, using the processor, a predetermined number of terms in a descending order of maximum likelihood ratios from the terms in the second document set that correspond to a specific term in the first document set, and adopting the selected terms as the candidate terms of the second document set that correspond to the specific terms in the first document set.

2. A document processing system comprising:
- a memory comprising instructions for detecting, from a first document set and a second document set having a common topic, the document sets having been retrieved on the basis of a term list (a) terms in the second document set that correspond to specific terms in the first document set, and (b) terms in first document set that correspond to specific terms in the second document set, said memory comprising instructions to;
  
  calculate the probability P(A) of the co-occurrence of a specific term pair, which includes a term from the first document set and a term from the second document set;
  
  calculate the probability P(B) of the lack of co-occurrences of the first term of a term pair in question occurring in the first document set and the second term of said term pair not occurring in the second document set;
  
  calculate a maximum likelihood ratio on the basis of P(A) and P(B);
  
  extract all term pair combinations having a maximum likelihood ratio that exceeds a predetermined threshold value;
  
  select a predetermined number of terms in a descending order of values of maximum likelihood ratios from the terms in first document set that correspond to a specific term in the second document set, and adopt the selected terms as the candidate terms of the first document set that correspond to specific terms in the second document set; and
  
  select a predetermined number of terms in a descending order of maximum likelihood ratios from the terms in the second document set that correspond to a specific term in the first document set, and adopt the selected terms as the candidate terms of the second document set that correspond to the specific terms in the first document set; and
  
  a processor for executing the instructions.

3. A non-transitory computer readable storage medium on which is stored a computer program for implementing a method of detecting, from a first document set and a second document set having a common topic, the document sets having been retrieved on the basis of a term list (a) terms in the second document set that correspond to specific terms in the first document set, and (b) terms in first document set that correspond to specific terms in the second document set, said computer program comprising a set of instructions to:
- calculate the probability P(A) of the co-occurrence of a specific term pair, which includes a term from the first document set and a term from the second document set;
  
  calculate the probability P(B) of the lack of co-occurrences of the first term of a term pair in question occurring in the first document set and the second term of said term pair not occurring in the second document set;
  
  calculate a maximum likelihood ratio on the basis of P(A) and P(B);
  
  extract all term pair combinations having a maximum likelihood ratio that exceeds a predetermined threshold value;
  
  select a predetermined number of terms in a descending order of values of maximum likelihood ratios from the terms in first document set that correspond to a specific term in the second document set, and adopt the selected terms as the candidate terms of the first document set that correspond to specific terms in the second document set; and
  
  select a predetermined number of terms in a descending order of maximum likelihood ratios from the terms in the second document set that correspond to a specific term in the first document set, and adopt the selected terms as the candidate terms of the second document set that correspond to the specific terms in the first document set.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Uber Technologies, Inc.
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Oda, Hiromi
Primary Examiner(s)
LU, CHARLES EDWARD

Application Number

US12/472,203
Publication Number

US 20090292697A1
Time in Patent Office

910 Days
Field of Search

707/737, 707/748
US Class Current

707/737
CPC Class Codes

G06F 16/353   into predefined classes

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99945   Object-oriented database st...

Method and system for lexical mapping between document sets having a common topic

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

3 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for lexical mapping between document sets having a common topic

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

3 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links