Apparatus for retrieving similar documents and apparatus for extracting relevant keywords

US 6,671,683 B2
Filed: 06/28/2001
Issued: 12/30/2003
Est. Priority Date: 06/28/2000
Status: Expired due to Term

First Claim

Patent Images

1. A similar document retrieving apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating a retrieval condition consisting of a document group including at least one document x₁, - - - , x_rselected from said document database D and for retrieving documents similar to said document group of said retrieval condition from said document database D, said similar document retrieving apparatus comprising:

keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence f_dtof each keyword t appearing in each document d stored in said document database D;

document length calculating means for calculating a document length data L which represents a length l_dof said each document d;

keyword weight calculating means for calculating a keyword weight data W which represents a weight w_tof each keyword t of said M kinds of keywords appearing in said document database D;

document profile vector producing means for producing a M-dimensional document profile vector P_dhaving components respectively representing a relative frequency-of-occurrence p_dtof each keyword t in the concerned document d;

document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in said document database D and for obtaining a predefined (K)-dimensional document feature vector U_dcorresponding to said document profile vector P_dfor said each document d; and

similar document retrieving means for receiving said retrieval condition consisting of the document group including at least one document x₁, - - - , x_rselected from said document database D, calculating a similarity between each document d and said retrieval condition based on a document feature vector of said received document group and the document feature vector of each document d in said document database D, and outputting a designated number of similar documents in order of the calculated similarity.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Three kinds of data, i.e., a keyword frequency-of-appearance, a document length, and a keyword weight, are produced. Then, a document profile vector and a keyword profile vector are calculated. Then, by independently performing the weighted principal component analysis considering the document length and the keyword weight, a document feature vector and a keyword feature vectors are obtained. Then, documents and keywords having higher similarity to the feature vectors calculated with reference to the retrieval and extracting conditions are obtained and displayed.

Citations

52 Claims

1. A similar document retrieving apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating a retrieval condition consisting of a document group including at least one document x₁, - - - , x_rselected from said document database D and for retrieving documents similar to said document group of said retrieval condition from said document database D, said similar document retrieving apparatus comprising:
- keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence f_dtof each keyword t appearing in each document d stored in said document database D;
  
  document length calculating means for calculating a document length data L which represents a length l_dof said each document d;
  
  keyword weight calculating means for calculating a keyword weight data W which represents a weight w_tof each keyword t of said M kinds of keywords appearing in said document database D;
  
  document profile vector producing means for producing a M-dimensional document profile vector P_dhaving components respectively representing a relative frequency-of-occurrence p_dtof each keyword t in the concerned document d;
  
  document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in said document database D and for obtaining a predefined (K)-dimensional document feature vector U_dcorresponding to said document profile vector P_dfor said each document d; and
  
  similar document retrieving means for receiving said retrieval condition consisting of the document group including at least one document x₁, - - - , x_rselected from said document database D, calculating a similarity between each document d and said retrieval condition based on a document feature vector of said received document group and the document feature vector of each document d in said document database D, and outputting a designated number of similar documents in order of the calculated similarity.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The similar document retrieving apparatus in accordance with claim 1, wherein said similar document retrieving means calculates the similarity between each document d and said retrieval condition based on an inner product between the document feature vector of said received document group and said document feature vector of each document d.
  - 3. The similar document retrieving apparatus in accordance with claim 1, wherein said similar document retrieving means calculates the similarity between each document d and said retrieval condition based on a distance between the document feature vector of said received document group and said document feature vector of each document d.
  - 4. The similar document retrieving apparatus in accordance with claim 1, wherein said document principal component analyzing means calculates the inner product between two document profile vectors P_aand P_bof two documents a and b contained in the document database D by using a product-sum of weighted components reflecting said keyword weight data W and a degree of dispersion (i.e., an evaluation value of standard deviation) of components P_atand P_btof said document profile vectors P_aand p_b, and performs said principal component analysis on the assumption that the document profile vectors of said document d of the document length l_dare contained in the document profile vector group by the number proportional to a ratio g_d/l_d, where g_drepresents the total number of all keywords appearing in the document d and l_drepresents the document length of the document d.
  - 5. The similar document retrieving apparatus in accordance with claim 4, wherein said document principal component analyzing means obtains said document feature vector on the assumption that the degree of dispersion of the component P_dtcorresponding to the keyword t, of the document profile vector P_dof each document d in the document database D, is expressed by a square root of h_t/f, where h_trepresents an overall frequency-of-occurrence value of the keyword t and f represents a sum of frequency-of-occurrence values of all keywords.
  - 6. The similar document retrieving apparatus in accordance with claim 4, wherein said document principal component analyzing means calculates the inner product between two document profile vectors P_aand P_bof two documents a and b contained in the document database D by dividing each of the components p_atand P_btcorresponding to the keyword t of the document profile vectors P_aand P_bby the degree of dispersion of respective components and then multiplying the divisions thus obtained each other, and then multiplying the resultant value with the keyword weight data w_t, and then obtaining a sum of the thus weighted value for all of the keywords t.
  - 7. The similar document retrieving apparatus in accordance with claim 1, wherein said document length calculating means compares a character number of the concerned document d with a predetermined threshold l₀and stores l₀as the length of said concerned document d when the character number of the concerned document d is less than l₀and stores a δ
    - -th (δ
      
      is a nonnegative integer) root of said character number when the character number of the concerned document d is equal to or larger than l₀.
  - 8. The similar document retrieving apparatus in accordance with claim 1, wherein said document length calculating means compares a total number of keywords appearing in the concerned document d with a predetermined threshold l₀and stores l₀as the length of said concerned document d when the total number of keywords is less than l₀and stores a δ
    - -th (δ
      
      is a nonnegative integer) root of said total number of keywords when the character total number of keywords is equal to or larger than l₀.
  - 9. The similar document retrieving apparatus in accordance with claim 1, wherein said keyword weight calculating means calculates the weight w_tof the concerned keyword t according to the following formula
- 10. The similar document retrieving apparatus in accordance with claim 1, wherein said keyword weight calculating means calculates the weight w_tof the concerned keyword t according to the following formula
- 11. The similar document retrieving apparatus in accordance with claim 1, wherein said document profile vector producing means calculates the relevant frequency-of-occurrence p_dtof each keyword t in the concerned document d by dividing the frequency-of-occurrence f_dtof each keyword t in the concerned document d by a sum Σ
  - f_djof the frequency-of-occurrence value of all keywords j appearing in the concerned document d.

12. A similar document retrieving apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating a retrieval condition consisting of a keyword group including at least one keyword y₁, - - - , y_sselected from said document database D and for retrieving documents relevant to said retrieval condition from said document database D, said similar document retrieving apparatus comprising:
- keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence f_dtof each keyword t appearing in each document d stored in said document database D;
  
  document length calculating means for calculating a document length data L which represents a length l_dof said each document d;
  
  keyword weight calculating means for calculating a keyword weight data W which represents a weight w_tof each keyword t of said M kinds of keywords appearing in said document database D;
  
  document profile vector producing means for producing a M-dimensional document profile vector P_dhaving components respectively representing a relative frequency-of-occurrence p_dtof each keyword t in the concerned document d;
  
  keyword profile vector producing means for producing a N-dimensional keyword profile vector Q_thaving components respectively representing a relative frequency-of-occurrence q_dtof the concerned keyword t in each document d;
  
  document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in said document database D and for obtaining a predefined (K)-dimensional document feature vector U_dcorresponding to said document profile vector P_dfor said each document d;
  
  keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in said document database D and for obtaining a predefined (K)-dimensional keyword feature vector V_tcorresponding to said keyword profile vector Q_tfor said each keyword t, said keyword feature vector having the same dimension as that of said document feature vector, as well as for obtaining a keyword contribution factor (i.e., eigenvalue of a correlation matrix) θ
  
  _jof each dimension j;
  
  retrieval condition feature vector calculating means for receiving said retrieval condition consisting of keyword group including at least one keyword y₁, - - - , y_s, and for calculating a retrieval condition feature vector corresponding to said retrieval condition based on said keyword weight data of the received keyword group, said keyword feature vector and said keyword contribution factor; and
  
  similar document retrieving means for calculating a similarity between each document d and said retrieval condition based on the calculated retrieval condition feature vector and a document feature vector of said each document d, and outputting a designated number of similar documents in order of the calculated similarity.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 13. The similar document retrieving apparatus in accordance with claim 12, wherein said similar document retrieving means calculates the similarity between each document d and said retrieval condition based on an inner product between said retrieval condition feature vector and said document feature vector of each document d.
  - 14. The similar document retrieving apparatus in accordance with claim 12, wherein said similar document retrieving means calculates the similarity between each document d and said retrieval condition based on a distance between the retrieval condition feature vector and said document feature vector of each document d.
  - 15. The similar document retrieving apparatus in accordance with claim 12, wherein said document principal component analyzing means calculates the inner product between two document profile vectors P_aof P_bof two documents a and b contained in the document database D by using a product-sum of weighted components reflecting said keyword weight data W and a degree of dispersion (i.e., an evaluation value of standard deviation) of components p_atand p_btof said document profile vectors P_aand P_b, and performs said principal component analysis on the assumption that the document profile vectors of said document d of the document length l_dare contained in the document profile vector group by the number proportional to a ratio g_d/l_d, where g_drepresents the total number of all keywords appearing in the document d and l_drepresents the document length of the document d.
  - 16. The similar document retrieving apparatus in accordance with claim 15, wherein said document principal component analyzing means obtains said document feature vector on the assumption that the degree of dispersion of the component p_dtcorresponding to the keyword t, of the document profile vector P_dof each document d in the document database D, is expressed by a square root of h_t/f, where h_trepresents an overall frequency-of-occurrence value of the keyword t and f represents a sum of frequency-of-occurrence values of all keywords.
  - 17. The similar document retrieving apparatus or the relevant keyword extracting apparatus in accordance with claim 15, wherein said document principal component analyzing means calculates the inner product between two document profile vectors P_aand P_bof two documents a and b contained in the document database D by dividing each of the components p_atand p_btcorresponding to the keyword t of the document profile vectors P_aand P_bby the degree of dispersion of respective components and then multiplying the divisions thus obtained each other, and then multiplying the resultant value with the keyword weight data w_t, and then obtaining a sum of the thus weighted value for all of the keywords t.
  - 18. The similar document retrieving apparatus in accordance with claim 12, wherein said keyword principal component analyzing means calculates the inner product between two keyword profile vectors Q_aand Q_bof two keywords Ka and Kb contained in the document database D by using a product-sum of weighted components reflecting said document length data L and a degree of dispersion (i.e., an evaluation value of standard deviation) of components q_adand q_bdof said keyword profile vectors Q_aand Q_b, and performs said principal component analysis on the assumption that the keyword profile vectors of said keyword t of the keyword weight w_tare contained in the keyword profile vector group by the number proportional to h_t*w_t, where h_trepresents an overall frequency-of-occurrence value of keyword t and w_trepresents the keyword weight of the keyword t.
  - 19. The similar document retrieving apparatus in accordance with claim 18, wherein said keyword principal component analyzing means obtains said keyword feature vector on the assumption that the degree of dispersion of the component q_tdcorresponding to the document d, of the keyword profile vector Q_tof each keyword t in the document database D, is expressed by a square root of g_d/f, where g_drepresents the total number of all keywords appearing in the document d and f represents a sum of frequency-of-occurrence values of all keywords.
  - 20. The similar document retrieving apparatus in accordance with claim 18, wherein said keyword principal component analyzing means calculates the inner product between two keyword profile vectors Q_aand Q_bof two keywords Ka and Kb contained in the document database D by dividing each of the components q_adand q_bdcorresponding to the document d of the keyword profile vectors Q_aand Q_bby the degree of dispersion of respective components and then multiplying the divisions thus obtained each other, and then dividing the resultant value by the document length l_d, and then obtaining a sum of the thus weighted value for all of the documents d.
  - 21. The similar document retrieving apparatus in accordance with claim 12, wherein said document length calculating means compares a character number of the concerned document d with a predetermined threshold l₀and stores l₀as the length of said concerned document d when the character number of the concerned document d is less than l₀and stores a δ
    - -th (δ
      
      is a nonnegative integer) root of said character number when the character number of the concerned document d is equal to or larger than l₀.
  - 22. The similar document retrieving apparatus in accordance with claim 12, wherein said document length calculating means compares a total number of keywords appearing in the concerned document d with a predetermined threshold l₀and stores l₀as the length of said concerned document d when the total number of keywords is less than l₀and stores a δ
    - -th (δ
      
      is a nonnegative integer) root of said total number of keywords when the character total number of keywords is equal to or larger than l₀.
  - 23. The similar document retrieving apparatus in accordance with claim 12, wherein said keyword weight calculating means calculates the weight w_tof the concerned keyword t according to the following formula
- 24. The similar document retrieving apparatus in accordance with claim 12, wherein said keyword weight calculating means calculates the weight w_tof the concerned keyword t according to the following formula
- 25. The similar document retrieving apparatus in accordance with claim 12, wherein said document profile vector producing means calculates the relevant frequency-of-occurrence p_dtof each keyword t in the concerned document d by dividing the frequency-of-occurrence f_dtof each keyword t in the concerned document d by a sum Σ
  - f_djof the frequency-of-occurrence value of all keywords j appearing in the concerned document d.
- 26. The similar document retrieving apparatus in accordance with claim 12, wherein said keyword profile vector producing means calculates the relevant frequency-of-occurrence q_dtof the concerned keyword t in each document d by dividing the frequency-of-occurrence f_dtof the concerned keyword t in said each document d by a sum Σ
  - f_itof the frequency-of-occurrence value of the concerned keywords t in all documents j containing the concerned keyword t.

27. A relevant keyword extracting apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating an extracting condition consisting of a keyword group including at least one keyword y₁, - - - , y_sselected from said document database D and for extracting keywords relevant to said keyword group of said extracting condition from said document database D, said relevant keyword extracting apparatus comprising:
- keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence f_dtof each keyword t appearing in each document d stored in said document database D;
  
  document length calculating means for calculating a document length data L which represents a length l_dof said each document d;
  
  keyword weight calculating means for calculating a keyword weight data W which represents a weight w_tof each keyword t of said M kinds of keywords appearing in said document database D;
  
  keyword profile vector producing means for producing a N-dimensional keyword profile vector Q_thaving components respectively representing a relative frequency-of-occurrence q_dtof the concerned keyword t in each document d;
  
  keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in said document database D and for obtaining a predefined (K)-dimensional keyword feature vector V_tcorresponding to said keyword profile vector Q_tfor said each keyword t; and
  
  relevant keyword extracting means for receiving said extracting condition consisting of the keyword group including at least one keyword y₁, - - - , y_sselected from said document database D, calculating a relevancy between each keyword t and said extracting condition based on a keyword feature vector of said received keyword group and the keyword feature vector of each keyword t in said document database D, and outputting a designated number of relevant keywords in order of the calculated relevancy.
- View Dependent Claims (28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
- - 28. The relevant keyword extracting apparatus in accordance with claim 27, wherein said relevant keyword extracting means calculates the relevancy between each keyword t and said extracting condition based on an inner product between the keyword feature vector of said received keyword group and said keyword feature vector of each keyword t.
  - 29. The relevant keyword extracting apparatus in accordance with claim 27, wherein said relevant keyword extracting means calculates the relevancy between each keyword t and said extracting condition based on a distance between the keyword feature vector of said received keyword group and said keyword feature vector of each keyword t.
  - 30. The relevant keyword extracting apparatus in accordance with claim 27, wherein said keyword principal component analyzing means calculates the inner product between two keyword profile vectors Q_aand Q_bof two keywords Ka and Kb contained in the document database D by using a product-sum of weighted components reflecting said document length data L and a degree of dispersion (i.e., an evaluation value of standard deviation) of components q_adand q_bdof said keyword profile vectors Q_aand Q_b, and performs said principal component analysis on the assumption that the keyword profile vectors of said keyword t of the keyword weight w_tare contained in the keyword profile vector group by the number proportional to h_t*w_t, where h_trepresents an overall frequency-of-occurrence value of keyword t and w_trepresents the keyword weight of the keyword t.
  - 31. The relevant keyword extracting apparatus in accordance with claim 30, wherein said keyword principal component analyzing means obtains said keyword feature vector on the assumption that the degree of dispersion of the component q_tdcorresponding to the document d, of the keyword profile vector Q_tof each keyword t in the document database D, is expressed by a square root of g_d/f, where g_drepresents the total number of all keywords appearing in the document d and f represents a sum of frequency-of-occurrence values of all keywords.
  - 32. The relevant keyword extracting apparatus in accordance with claim 30, wherein said keyword principal component analyzing means calculates the inner product between two keyword profile vectors Q_aand Q_bof two keywords Ka and Kb contained in the document database D by dividing each of the components q_adand q_bdcorresponding to the document d of the keyword profile vectors Q_aand Q_bby the degree of dispersion of respective components and then multiplying the divisions thus obtained each other, and then dividing the resultant value by the document length l_d, and then obtaining a sum of the thus weighted value for all of the documents d.
  - 33. The relevant keyword extracting apparatus in accordance with claim 27, wherein said document length calculating means compares a character number of the concerned document d with a predetermined threshold l₀and stores l₀as the length of said concerned document d when the character number of the concerned document d is less than l₀and stores δ
    - -th (δ
      
      is a nonnegative integer) root of said character number when the character number of the concerned document d is equal to or larger than l₀.
  - 34. The relevant keyword extracting apparatus in accordance with claim 27, wherein said document length calculating means compares a total number of keywords appearing in the concerned document d with a predetermined threshold l₀and stores l₀as the length of said concerned document d when the total number of keywords is less than l₀and stores a δ
    - -th (δ
      
      is a nonnegative integer) root of said total number of keywords when the character total number of keywords is equal to or larger than l₀.
  - 35. The relevant keyword extracting apparatus in accordance with claim 27, wherein said keyword weight calculating means calculates the weight w_tof the concerned keyword t according to the following formula
- 36. The relevant keyword extracting apparatus in accordance with claim 27, wherein said keyword weight calculating means calculates the weight w_tof the concerned keyword t according to the following formula
- 37. The relevant keyword extracting apparatus in accordance with claim 27, wherein said keyword profile vector producing means calculates the relevant frequency-of-occurrence q_dtof the concerned keyword t in each document d by dividing the frequency-of-occurrence f_dtof the concerned keyword t in said each document d by a sum Σ
  - f_itof the frequency-of-occurrence value of the concerned keywords t in all documents j containing the concerned keyword t.

38. A relevant keyword extracting apparatus applicable to a document database D which stores N document data containing a total of M kinds of keywords and is machine processible, for designating an extracting condition consisting of a document group including at least one document x₁, - - - , x_rselected from said document database D and for extracting keywords relevant to the document group of said extracting condition from said document database D, said relevant keyword extracting apparatus comprising:
- keyword frequency-of-occurrence calculating means for calculating a keyword frequency-of-occurrence data F which represents a frequency-of-occurrence f_dtof each keyword t appearing in each document d stored in said document database D;
  
  document length calculating means for calculating a document length data L which represents a length l_dof said each document d;
  
  keyword weight calculating means for calculating a keyword weight data W which represents a weight w_tof each keyword t of said M kinds of keywords appearing in said document database D;
  
  document profile vector producing means for producing a M-dimensional document profile vector P_dhaving components respectively representing a relative frequency-of-occurrence p_dtof each keyword t in the concerned document d;
  
  keyword profile vector producing means for producing a N-dimensional keyword profile vector Q_thaving components respectively representing a relative frequency-of-occurrence q_dtof the concerned keyword t in each document d;
  
  document principal component analyzing means for performing a principal component analysis on a document profile vector group of a document group in said document database D and for obtaining a predefined (K)-dimensional document feature vector U_dcorresponding to said document profile vector P_dfor said each document d as well as for obtaining a document contribution factor (i.e., eigenvalue of a correlation matrix) λ
  
  _jof each dimension j;
  
  keyword principal component analyzing means for performing a principal component analysis on a keyword profile vector group of a keyword group in said document database D and for obtaining a predefined (K)-dimensional keyword feature vector V_tcorresponding to said keyword profile vector Q_tfor said each keyword t, said keyword feature vector having the same dimension as that of said document feature vector;
  
  extracting condition feature vector calculating means for receiving said extracting condition consisting of the document group including at least one document x₁, - - - , X_r, and for calculating an extracting condition feature vector corresponding to said extracting condition based on said document length data of the received document group, said document feature vector and said document contribution factor; and
  
  relevant keyword extracting means for calculating a relevancy between each keyword t and said extracting condition based on the calculated extracting condition feature vector and a keyword feature vector of each keyword t, and outputting a designated number of relevant keywords in order of the calculated relevancy.
- View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52)
- - 39. The relevant keyword extracting apparatus in accordance with claim 38, wherein said relevant keyword extracting means calculates the relevancy between each keyword t and said extracting condition based on an inner product between said extracting condition feature vector and said keyword feature vector of each keyword t.
  - 40. The relevant keyword extracting apparatus in accordance with claim 38, wherein said relevant keyword extracting means calculates the relevancy between each keyword t and said extracting condition based on a distance between said extracting condition feature vector and said keyword feature vector of each keyword t.
  - 41. The relevant keyword extracting apparatus in accordance with claim 38, wherein said document principal component analyzing means calculates the inner product between two document profile vectors P_aand P_bof two documents a and b contained in the document database D by using a product-sum of weighted components reflecting said keyword weight data W and a degree of dispersion (i.e., an evaluation value of standard deviation) of components p_atand p_btof said document profile vectors P_aand P_b, and performs said principal component analysis on the assumption that the document profile vectors of said document d of the document length l_dare contained in the document profile vector group by the number proportional to a ratio g_d/l_d, where g_drepresents the total number of all keywords appearing in the document d and l_drepresents the document length of the document d.
  - 42. The relevant keyword extracting apparatus in accordance with claim 41, wherein said document principal component analyzing means obtains said document feature vector on the assumption that the degree of dispersion of the component p_dtcorresponding to the keyword t, of the document profile vector P_dof each document d in the document database D, is expressed by a square root of h_t/f, where h_trepresents an overall frequency-of-occurrence value of the keyword t and f represents a sum of frequency-of-occurrence values of all keywords.
  - 43. The relevant keyword extracting apparatus in accordance with claim 41, wherein said document principal component analyzing means calculates the inner product between two document profile vectors P_aand P_bof two documents a and b contained in the document database D by dividing each of the components p_atand p_btcorresponding to the keyword t of the document profile vectors P_aand P_bby the degree of dispersion of respective components and then multiplying the divisions thus obtained each other, and then multiplying the resultant value with the keyword weight data w_t, and then obtaining a sum of the thus weighted value for all of the keywords t.
  - 44. The relevant keyword extracting apparatus in accordance with claim 38, wherein said keyword principal component analyzing means calculates the inner product between two keyword profile vectors Q_aand Q_bof two keywords Ka and Kb contained in the document database D by using a product-sum of weighted components reflecting said document length data L and a degree of dispersion (i.e., an evaluation value of standard deviation) of components q_adand q_bdof said keyword profile vectors Q_aand Q_b, and performs said principal component analysis on the assumption that the keyword profile vectors of said keyword t of the keyword weight w_tare contained in the keyword profile vector group by the number proportional to h_t*w_t, where h_trepresents an overall frequency-of-occurrence value of keyword t and w_trepresents the keyword weight of the keyword t.
  - 45. The relevant keyword extracting apparatus in accordance with claim 44, wherein said keyword principal component analyzing means obtains said keyword feature vector on the assumption that the degree of dispersion of the component q_tdcorresponding to the document d, of the keyword profile vector Q_tof each keyword t in the document database D, is expressed by a square root of g_d/f, where g_drepresents the total number of all keywords appearing in the document d and f represents a sum of frequency-of-occurrence values of all keywords.
  - 46. The relevant keyword extracting apparatus in accordance with claim 44, wherein said keyword principal component analyzing means calculates the inner product between two keyword profile vectors Q_aand Q_bof two keywords Ka and Kb contained in the document database D by dividing each of the components q_adand q_bdcorresponding to the document d of the keyword profile vectors Q_aand Q_bby the degree of dispersion of respective components and then multiplying the divisions thus obtained each other, and then dividing the resultant value by the document length l_d, and then obtaining a sum of the thus weighted value for all of the documents d.
  - 47. The relevant keyword extracting apparatus in accordance with claim 38, wherein said document length calculating means compares a character number of the concerned document d with a predetermined threshold l₀and stores l₀as the length of said concerned document d when the character number of the concerned document d is less than l₀and stores a δ
    - -th (δ
      
      is a nonnegative integer) root of said character number when the character number of the concerned document d is equal to or larger than l₀.
  - 48. The relevant keyword extracting apparatus in accordance with claim 38, wherein said document length calculating means compares a total number of keywords appearing in the concerned document d with a predetermined threshold l₀and stores l₀as the length of said concerned document d when the total number of keywords is less than l₀and stores a δ
    - -th (δ
      
      is a nonnegative integer) root of said total number of keywords when the character total number of keywords is equal to or larger than l₀.
  - 49. The relevant keyword extracting apparatus in accordance with claim 38, wherein said keyword weight calculating means calculates the weight w_tof the concerned keyword t according to the following formula
- 50. The relevant keyword extracting apparatus in accordance with claim 38, wherein said keyword weight calculating means calculates the weight w_tof the concerned keyword t according to the following formula
- 51. The relevant keyword extracting apparatus in accordance with claim 38, wherein said document profile vector producing means calculates the relevant frequency-of-occurrence p_dtof each keyword t in the concerned document d by dividing the frequency-of-occurrence f_dtof each keyword t in the concerned document d by a sum Σ
  - f_djof the frequency-of-occurrence value of all keywords j appearing in the concerned document d.
- 52. The relevant keyword extracting apparatus in accordance with claim 38, wherein said keyword profile vector producing means calculates the relevant frequency-of-occurrence q_dtof the concerned keyword t in each document d by dividing the frequency-of-occurrence f_dtof the concerned keyword t in said each document d by a sum Σ
  - f_itof the frequency-of-occurrence value of the concerned keywords t in all documents j containing the concerned keyword t.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rakuten Group, Inc.
Original Assignee
Matsushita Electric Industrial Company Limited (Panasonic Holdings Corporation)
Inventors
Kanno, Yuji
Primary Examiner(s)
Amsbury, Wayne
Assistant Examiner(s)
AL HASHEMI, SANA A

Application Number

US09/892,700
Publication Number

US 20020016787A1
Time in Patent Office

915 Days
Field of Search

707/5, 707/3, 707/2, 707/6, 717/1, 704/5
US Class Current

1/1
CPC Class Codes

G06F 16/30 of unstructured textual dat...

Y10S 707/99935 Query augmenting and refini...

Apparatus for retrieving similar documents and apparatus for extracting relevant keywords

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

52 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus for retrieving similar documents and apparatus for extracting relevant keywords

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

52 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links