Method for calculating relevance between words based on document set and system for executing the method
First Claim
Patent Images
1. A method, using a processor, of calculating relevance among words based on a relevance of each word in a document, the method comprising:
- generating statistical information associated with relevance among words by calculating a crossing frequency of words associated with a number of times of each of cross-word being appeared in a document, an appearance frequency of a word, or a word-word combination frequency associated with an appearance and a non-appearance of a combination of a first word and a second word, wherein the appearance frequency is a number of times that a word appears and frequency information is generated based on one of the appearance frequency or the crossing frequency, or the word-word combination frequency to provide the statistical information, the calculation being performed by the processor according to word-word or word-document classification;
standardizing the statistical information by applying a parameter to the calculated statistical information, wherein the standardizing the statistical information comprises generating a combination probability distribution of a random variable corresponding to a pair of words and standardizing the statistical information based on the word-word combination frequency, wherein the word-word combination frequency associated with the pair of words is a number of documents that include all words in the pair, a number of documents that do not include any word in the pair, and a number of documents that include one of the words in the pair, and wherein the random variable is defined in a point space of columns and rows that comprise appearance or non-appearance points of the word;
determining, by the processor, the relevance among the words as a numerical value based on the standardization; and
providing the numerical value associated with the relevance among words to a search system.
5 Assignments
0 Petitions
Accused Products
Abstract
A method and system for calculating a relevance between words using a document set is provided. The method of calculating the relevance between words based on a document set, includes: obtaining statistical information about the words based on at least one of the words, documents, a word classification of the words, and a document classification of the documents, wherein the words and the documents are included in the document set; standardizing the statistical information; and calculating the relevance between the words based on the standardized statistical information.
-
Citations
23 Claims
-
1. A method, using a processor, of calculating relevance among words based on a relevance of each word in a document, the method comprising:
-
generating statistical information associated with relevance among words by calculating a crossing frequency of words associated with a number of times of each of cross-word being appeared in a document, an appearance frequency of a word, or a word-word combination frequency associated with an appearance and a non-appearance of a combination of a first word and a second word, wherein the appearance frequency is a number of times that a word appears and frequency information is generated based on one of the appearance frequency or the crossing frequency, or the word-word combination frequency to provide the statistical information, the calculation being performed by the processor according to word-word or word-document classification; standardizing the statistical information by applying a parameter to the calculated statistical information, wherein the standardizing the statistical information comprises generating a combination probability distribution of a random variable corresponding to a pair of words and standardizing the statistical information based on the word-word combination frequency, wherein the word-word combination frequency associated with the pair of words is a number of documents that include all words in the pair, a number of documents that do not include any word in the pair, and a number of documents that include one of the words in the pair, and wherein the random variable is defined in a point space of columns and rows that comprise appearance or non-appearance points of the word; determining, by the processor, the relevance among the words as a numerical value based on the standardization; and providing the numerical value associated with the relevance among words to a search system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method, using a processor, of calculating relevance among words based on a relevance of each word in a document, the method comprising:
-
generating statistical information associated with relevance among words by calculating a crossing frequency of words associated with a number of times of each of cross-word being appeared in a document, an appearance frequency of a word, or a word-word combination frequency associated with an appearance and a non-appearance of a combination of a first word and a second word, wherein the appearance frequency is a number of times that a word appears and frequency information is generated based on one of the appearance frequency or the crossing frequency, or the word-word combination frequency to provide the statistical information, the calculation is performed by the processor according to word-word and word-document classification; standardizing the statistical information by applying a multi-dimensional vector set to the statistical information, wherein the standardizing the statistical information comprises generating a combination probability distribution of a random variable corresponding to a pair of words and standardizing the statistical information based on the word-word combination frequency, wherein the word-word combination frequency associated with the pair of words is a number of documents that include all words in the pair, a number of documents that do not include any word in the pair, and a number of documents that include one of the words in the pair, and wherein the random variable is defined in a point space of columns and rows that comprise appearance or non-appearance points of the word; determining, by the processor, the relevance among the words as a numerical value based on the standardization; and providing the numerical value associated with the relevance among words to a search system. - View Dependent Claims (15)
-
-
16. A method, using a processor, of calculating a relevance among words based on a relevance of each word in a document, the method comprising:
-
generating statistical information associated with relevance among words by calculating a crossing frequency of words associated with a number of times of each of cross-word being appeared in a document, an appearance frequency of a word, or a word-word combination frequency associated with an appearance and a non-appearance of a combination of a first word and a second word, wherein the appearance frequency is a number of times that a word appears and frequency information is generated based on one of the appearance frequency or the crossing frequency, or the word-word combination frequency to provide the statistical information, the calculation is performed by the processor according to word-word and word-document classification; standardizing the statistical information by applying a parameter to the calculated statistical information, wherein the standardizing the statistical information comprises generating a combination probability distribution of a random variable corresponding to a pair of words and standardizing the statistical information based on the word-word combination frequency, wherein the word-word combination frequency associated with the pair of words is a number of documents that include all words in the pair, a number of documents that do not include any word in the pair, and a number of documents that include one of the words in the pair, and wherein the random variable is defined in a point space of columns and rows that comprise appearance or non-appearance points of the word; determining, by the processor, the relevance among the words as a numerical value based on the standardization; and providing the numerical value associated with the relevance among words to a search system. - View Dependent Claims (17)
-
-
18. A system for calculating relevance among words based on relevance of each word in a document, the system comprising:
-
a statistical information unit, coupled to a processor, to generate statistical information associated with relevance among words according to word-word and word-document classification by calculating a crossing frequency of words associated with a number of times of each of cross-word being appeared in a document, an appearance frequency of a word, or a word-word combination frequency associated with an appearance and a non-appearance of a combination of a first word and a second word, wherein the appearance frequency is a number of times that a word appears and frequency information is generated based on one of the appearance frequency or the crossing frequency, or the word-word combination frequency to provide the statistical information, the calculation is performed by the processor according to the classification; and a standardization unit to standardize the statistical information by applying a parameter to the statistical information, wherein to standardize the statistical information comprises to generate a combination probability distribution of a random variable corresponding to a pair of words and to standardize the statistical information based on the word-word combination frequency, wherein the word-word combination frequency associated with the pair of words is a number of documents that include all words in the pair, a number of documents that do not include any word in the pair, and a number of documents that include one of the words in the pair, and wherein the random variable is defined in a point space of columns and rows that comprise appearance or non-appearance points of the word, and wherein the relevance among the words is determined by the processor based on the standardized statistical information. - View Dependent Claims (19, 20, 21, 22, 23)
-
Specification