Probabilistic information retrieval based on differential latent semantic space

US 6,654,740 B2
Filed: 05/08/2001
Issued: 11/25/2003
Est. Priority Date: 05/08/2001
Status: Expired due to Term

First Claim

Patent Images

1. A method for setting up an information retrieval system and retrieving text information, comprising the steps of:

preprocessing text including word, noun phrase and stop word identification;

constructing system terms including setting up a term list and global weights;

setting up and normalizing document vectors of all collected documents;

constructing an interior differential term-document matrix D_I^mxn₁such that each column in said interior differential term-document matrix is an interior differential document vector;

decomposing, using SVD algorithm, D_I, such that D_I=USV^T, then with a proper k₁, defining the D_I,k₁=U_k₁S_k₁V_k₁^Tto approximate D_I;

defining an interior document likelihood function, P(x|D_I);

constructing an exterior differential term-document matrix D_E^mxn₁, such that each column in said exterior differential term-document matrix is an exterior differential document vector;

decomposing, using SVD algorithm, D_E, such that D_E=USV^T, then with a proper value of k₂, defining the D_E,k₂=U_k₂S_k₂V_k₂^Tto approximate D_E;

defining an exterior document likelihood function, P(x|D_E); and

defining a posteriori function $P (D_{I}, x) = \frac{P (x  D_{I}) P (D_{I})}{P (x  D_{I}) P (D_{I}) + P (x  D_{E}) P (D_{E})},$ where P(D_I) is set to be an average number of recalls divided by the number of documents in the data base and P(D_E) is set to be 1−

P(D_I).

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-based information search and retrieval system and method for retrieving textual digital objects that makes full use of the projections of the documents onto both the reduced document space characterized by the singular value decomposition-based latent semantic structure and its orthogonal space. The resulting system and method has increased robustness, improving the instability of the traditional keyword search engine due to synonymy and/or polysemy of a natural language, and therefore is particularly suitable for web document searching over a distributed computer network such as the Internet.

Citations

11 Claims

1. A method for setting up an information retrieval system and retrieving text information, comprising the steps of:
- preprocessing text including word, noun phrase and stop word identification;
  
  constructing system terms including setting up a term list and global weights;
  
  setting up and normalizing document vectors of all collected documents;
  
  constructing an interior differential term-document matrix D_I^mxn₁such that each column in said interior differential term-document matrix is an interior differential document vector;
  
  decomposing, using SVD algorithm, D_I, such that D_I=USV^T, then with a proper k₁, defining the D_I,k₁=U_k₁S_k₁V_k₁^Tto approximate D_I;
  
  defining an interior document likelihood function, P(x|D_I);
  
  constructing an exterior differential term-document matrix D_E^mxn₁, such that each column in said exterior differential term-document matrix is an exterior differential document vector;
  
  decomposing, using SVD algorithm, D_E, such that D_E=USV^T, then with a proper value of k₂, defining the D_E,k₂=U_k₂S_k₂V_k₂^Tto approximate D_E;
  
  defining an exterior document likelihood function, P(x|D_E); and
  
  defining a posteriori function $P (D_{I}, x) = \frac{P (x  D_{I}) P (D_{I})}{P (x  D_{I}) P (D_{I}) + P (x  D_{E}) P (D_{E})},$ where P(D_I) is set to be an average number of recalls divided by the number of documents in the data base and P(D_E) is set to be 1−
  
  P(D_I).
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method as set forth in claim 1, wherein the interior document likelihood function, P(x|D_I) is, $P$
    - (x
      
      DI)=n11/2
      
      exp
      
      (-n12
      
      ∑
      
      i=1
      
      k1
      
      
      
      yi2δ
      
      i2)·
      
      exp
      
      (-n1
      
      ɛ
      
      2
      
      (x)2
      
      ρ
      
      1)(2
      
      π
      
      )n1/2
      
      ∏
      
      i=1k1
      
      
      
      δ
      
      i·
      
      ρ
      
      1(r1-k1)/2,
3. The method as set forth in claim 2, wherein, ρ
- ₁is chosen as δ
  
  _k_i₊₁²/2, and r₁is n₁.
4. The method as set forth in claim 1, wherein the exterior document likelihood function, P(x|D_E) is, $P$
- (x
  
  DE)=n11/2
  
  exp
  
  (-n22
  
  ∑
  
  i=1
  
  k2
  
  
  
  yi2δ
  
  i2)·
  
  exp
  
  (-n2
  
  ɛ
  
  2
  
  (x)2
  
  ρ
  
  2)(2
  
  π
  
  )n2/2
  
  ∏
  
  i=1k2
  
  
  
  δ
  
  i·
  
  ρ
  
  2(r2-k2)/2,where $y = U_{k_{2}}^{T} x, ɛ^{2} (x) = { x }^{2} - \sum_{i = 1}^{k_{2}} y_{i}^{2}, ρ_{2} = \frac{1}{r_{2} - k_{2}} \sum_{i = k_{2} + 1}^{r_{2}} δ_{i}^{2},$ r₂is a rank of matrix D_E.
5. The method as set forth in claim 4, wherein ρ
- ₂is chosen as δ
  
  _k₂₊₁²/2, and r₂is n₂.
6. The method as set forth in claim 1, further comprising the steps of:
- setting up a document vector for a query by generating terms as well as frequency of term occurrence, and thereby obtaining a normalized document vector for the query;
  
  given the query, constructing a differential document vector x;
  
  calculating the interior document likelihood function P(x|D_I) and the exterior document likelihood function P(x|D_E) for the document;
  
  calculating the posteriori probability function P(D_I|x); and
  
  selecting documents according to one of P(D_I|x) exceeding a given threshold or N best documents with largest P(D_I|x), those values of P(D_I|x) being shown as scores to rank a match.

7. A method for setting up an information retrieval system and retrieving text information, comprising the steps of:
- preprocessing text;
  
  constructing system terms;
  
  setting up and normalizing document vectors of all collected documents;
  
  constructing an interior differential term-document matrix D_I^mxn₁such that each column in said interior differential term-document matrix is an interior differential document vector;
  
  decomposing D_I, such that D_I=USV^T, then with a proper k₁, defining the D_I,k₁=U_k₁S_k₁V_k₁^Tto approximate D_I;
  
  defining an interior document likelihood function, P(x|D_I);
  
  constructing an exterior differential term-document matrix D_E^mxn₂, such that each column in said exterior differential term-document matrix is an exterior differential document vector;
  
  decomposing D_E, such that D_E=USV^T, then with a proper value of k₂, defining the D_E,k₂=U_k₂S_k₂V_k₂^Tto approximate D_E;
  
  defining an exterior document likelihood function, P(x|D_E);
  
  defining a posteriori function $P (D_{I}, x) = \frac{P (x  D_{I}) P (D_{I})}{P (x  D_{I}) P (D_{I}) + P (x  D_{E}) P (D_{E})},$ where P(D_I) is set to be an average number of recalls divided by the number of documents in the data base and P(D_E) is set to be 1-P(D_I);
  
  setting up a document vector for a query by generating terms as well as frequency of term occurrence, and thereby obtaining a normalized document vector for the query;
  
  given the query, constructing a differential document vector x;
  
  calculating the interior document likelihood function P(x|D_I) and the exterior document likelihood function P(x|D_E) for the document;
  
  calculating the posteriori probability function P(D_I|x); and
  
  selecting documents according to one of P(D_I|x) exceeding a given threshold or N best documents with largest P(D_I|x), those values of P(D_I|x) being shown as scores to rank a match.
- View Dependent Claims (8, 9, 10, 11)
- - 8. The method as set forth in claim 7, wherein the interior document likelihood function, P(x|D_I) is, $P$
    - (x|DI)=n11/2
      
      exp
      
      (-n12
      
      ∑
      
      i=1k1
      
      yi2δ
      
      i2)·
      
      exp
      
      (n1
      
      ɛ
      
      2
      
      (x)2
      
      ρ
      
      1)(2
      
      π
      
      )n1/2
      
      ∏
      
      i=1k1
      
      δ
      
      i·
      
      ρ
      
      i(r1-k1)/2,
9. The method as set forth in claim 8, wherein, ρ
- ₁is chosen as δ
  
  _k_i₊₁²/2, and r₁is n₁.
10. The method as set forth in claim 7, wherein the exterior document likelihood function, P(x|D_E) is, $P$
- (x|DE)=n21/2
  
  exp
  
  (-n22
  
  ∑
  
  i=1k2
  
  yi2δ
  
  i2)·
  
  exp
  
  (-n2
  
  ɛ
  
  2
  
  (x)2
  
  ρ
  
  2)(2
  
  π
  
  )n2/2
  
  ∏
  
  i=1k2
  
  δ
  
  i·
  
  ρ
  
  2(r2-k2)/2,where $y = U_{k_{2}}^{T} x, ɛ^{2} (x) = { x }^{2} - \sum_{i = 1}^{k_{2}} y_{i}^{2}, ρ_{2} = \frac{1}{r_{2} - k_{2}} \sum_{i = k_{2} + 1}^{r_{2}} δ_{i}^{2},$ r₂is a rank of matrix D_E.
11. The method as set forth in claim 10, wherein ρ
- ₂is chosen as δ
  
  _k₂₊₁²/2, and r₂is n₂.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SunFlare Co., Ltd.
Original Assignee
SunFlare Co., Ltd.
Inventors
Sasai, Hiroyuki, Chen, Liang, Tokuda, Naoyuki
Primary Examiner(s)
CHOULES, JACK M

Application Number

US09/849,986
Publication Number

US 20030050921A1
Time in Patent Office

931 Days
Field of Search

707/3, 707/4, 707/5, 707/6, 707/10, 709/9
US Class Current

707/769
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 18/2135   based on approximation crit...

Y10S 707/917   Text

Y10S 707/955   Object-oriented

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Probabilistic information retrieval based on differential latent semantic space

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Probabilistic information retrieval based on differential latent semantic space

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links