System and methods for determining semantic similarity of sentences
First Claim
1. A method for determining a similarity between a first word and a second word, the method comprising:
- based on at least one definition of the first word, associating the first word with;
at least one definition class and at least one corresponding definition weight, based on at least one definition of the second word, associating the second word with;
at least one definition class and at least one corresponding definition weight, identifying common definition classes, the common definition classes including one or more of the at least one definition class associated with both the first and second words, and determining the similarity based on the definition weights corresponding to;
the first word and the common definition classes, and the second word and the common definition classes.
3 Assignments
0 Petitions
Accused Products
Abstract
A system and associated methods determine the semantic similarity of different sentences to one another. A particularly appropriate application of the present invention is to automatic processing of Chinese-language text, for example, for document retrieval. A method for computing the similarity between a first and a second set of words comprises identifying a word of the second set of words as being most similar to a word of the first set of words, wherein the word of the second set of words need not be identical to the word of the first set of words; and computing a score of the similarity between the first and second set of words based at least in part on the word of the second set of words.
120 Citations
33 Claims
-
1. A method for determining a similarity between a first word and a second word, the method comprising:
-
based on at least one definition of the first word, associating the first word with;
at least one definition class and at least one corresponding definition weight,based on at least one definition of the second word, associating the second word with;
at least one definition class and at least one corresponding definition weight,identifying common definition classes, the common definition classes including one or more of the at least one definition class associated with both the first and second words, and determining the similarity based on the definition weights corresponding to;
the first word and the common definition classes, and the second word and the common definition classes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
determining a first number representing one or more of the at least one definition class associated with the first word, and a second number representing one or more of the at least one definition class associated with the second word, and wherein determining the similarity includes determining the similarity based on the first number and the second number.
-
-
3. The method of claim 1, further comprising:
-
associating the first word with at least one speech class based on at least one part of speech of the first word, and associating the second word with at least one speech class based on at least one part of speech of the second word, and wherein determining includes;
determining the similarity based on the at least one speech class associated with the first word and the at least one speech class associated with the second word.
-
-
4. The method of claim 3, wherein the at least one part of speech includes at least one of an adjective, an adverb, a conjunction, a noun, a preposition, a pronoun, and a verb.
-
5. The method of claim 3, wherein determining the similarity includes:
-
determining a first number representing one or more of the at least one speech class associated with the first word, a second number representing one or more of the at least one speech class associated with the second word, and a third number representing zero or more of the at least one speech class associated with both the first and second words, and determining the similarity based on the first number, the second number, and the third number.
-
-
6. The method of claim 1, wherein determining the similarity includes:
-
determining a number representing zero or more common characters included in both the first and second words, and determining the similarity based on the number of common characters.
-
-
7. The method of claim 1, wherein determining the similarity includes:
-
determining a first number representing one or more characters included in the first word, a second number representing one or more characters included in the second number, and a third number representing zero or more common characters included in both the first and second words, and determining the similarity based on the first number, the second number, and the third number.
-
-
8. The method of claim 6, wherein the characters include characters of at least one ideographic language.
-
9. The method of claim 6, wherein the characters include characters of at least one of Chinese and Japanese.
-
10. The method of claim 1, wherein each of the definition weights is based on at least one of:
- a type and a degree of relationship between a word and a definition class.
-
11. A method for determining a similarity between a first word and a second word, the method comprising:
-
based on at least one definition of the first word, associating the first word with;
at least one definition class and at least one corresponding definition weight,based on at least one definition of the second word, associating the second word with;
at least one definition class and at least one corresponding definition weight,identifying common definition classes, the common definition classes including one or more of the at least one definition class associated with both the first and second words, and determining the similarity based on the definition weights corresponding to;
the first word and the common definition classes, the second word and the common definition classes, and a number representing zero or more common characters included in both the first and second words. - View Dependent Claims (12, 13, 14, 15, 16)
determining the similarity based on a first number representing one or more characters included in the first word, and a second number representing one or more characters included in the second word.
-
-
13. The method of claim 11, wherein the characters include characters of at least one ideographic language.
-
14. The method of claim 11, wherein the characters include characters of at least one of Chinese and Japanese.
-
15. The method of claim 11, wherein determining includes:
-
determining a first number representing one or more of the at least one definition class associated with the first word, and a second number representing one or more of the at least one definition class associated with the second word, and wherein determining the similarity includes determining the similarity based on the first number and the second number.
-
-
16. The method of claim 11, wherein each of the definition weights is based on at least one of:
- a type and a degree of relationship between a word and a definition class.
-
17. A method for determining a similarity between a first set of at least one first word and a second set of at least one second word, the method comprising:
-
determining word-word similarities between one of the at least one first word and one of the at least one second word, the word-word similarities based on a semantic similarity between the one of the at least one first word and the one of the at least one second word, the semantic similarity between the one of the at least one first word and the one of the at least one second word based on definition weights corresponding to;
the one of the at least one first word and common definition classes, and the one of the at least one second word and common definition classes, in which the one of the at least one first word is associated with at least one definition class and at least one corresponding definition weight based on at least one definition of the one of the at least one first word, the one of the at least one second word is associated with at least one definition class and at least one corresponding definition weight based on at least one definition of the one of the at least one second word, and the common definition classes include one or more of the at least one definition class associated with both the one of the at least one first word and the one of the at least one second word, and determining the similarity between the first set and the second set based on the word-word similarities between each of the at least one first word and each of the at least one second word. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
accessing pre-determined word-word similarities.
-
-
19. The method of claim 17, wherein determining the word-word similarities further includes:
-
determining the word-word similarities based on a lexical similarity between the one of the at least one first word and the one of the at least one second word, and wherein the lexical similarity between the one of the at least one first word and the one of the at least one second word is based on a number representing zero or more common characters included in both the one of the at least one first word and the one of the at least one second word.
-
-
20. The method of claim 19, wherein the characters are characters of at least one of Chinese and Japanese.
-
21. The method of claim 17, wherein determining the word-word similarities farther includes:
-
determining the word-word similarities based on a syntactic similarity between each of the at least one first word and each of the at least one second word, and wherein the syntactic similarity between one of the at least one first word and one of the at least one second word is based on;
at least one speech class associated with the one of the at least one first word based on at least one part of speech of the one of the at least one first word, and at least one speech class associated with the one of the at least one second word based on at least one part of speech of the one of the at least one second word.
-
-
22. The method of claim 21, wherein the syntactic similarity between the one of the at least one first word and the one of the at least one second word is based on a number representing zero or more of the at least one speech class associated with both the one of the at least one first word and the one of the at least one second word.
-
23. The method of claim 17, wherein determining the similarity includes:
-
for each word of the at least one first word, determining one of the at least one second word having the greatest word-word similarity for the each word of the at least one first word, and based on the greatest word-word similarity for each word of the at least one first word, determining the similarity between the first set and the second set.
-
-
24. The method of claim 17, wherein each of the definition weights is based on at least one of:
- a type and a degree of relationship between a word and a definition class.
-
25. A method for searching a database, the method comprising:
-
receiving search data including a first set of at least one first word from a client, associating data in the database with at least one second set of at least one second word, for each of the at least one second set, determining word-word similarities between each of the at least one first word and each of the at least one second word, the word-word similarities based on a semantic similarity between each of the at least one first word and each of the at least one second word, the semantic similarity between one of the at least one first word and one of the at least one second word based on definition weights corresponding to;
the one of the at least one first word and common definition classes, and the one of the at least one second word and common definition classes, in which the one of the at least one first word is associated with at least one definition class and at least one corresponding definition weight based on at least one definition of the one of the at least one first word, the one of the at least one second word is associated with at least one definition class and at least one corresponding definition weight based on at least one definition of the one of the at least one second word, and the common definition classes include one or more of the at least one definition class associated with both the one of the at least one first word and the one of the at least one second word, and for each of the at least one second set, based on the word-word similarities between each of the at least one first word and each of the at least one second word, determining set-set similarities between the first set and the at least one second set, and based on the set-set similarities between the first set and the at least one second set, providing data associated with one of the at least one second set from the database to the client. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33)
accessing predetermined word-word similarities.
-
-
27. The method of claim 25, wherein determining the word-word similarities further includes:
-
determining the word-word similarities based on a lexical similarity between one of the at least one first word and one of the at least one second word, and wherein the lexical similarity between the one of the at least one first word and the one of the at least one second word is based on a number representing zero or more common characters included in both the one of the at least one first word and the one of the at least one second word.
-
-
28. The method of claim 27, wherein the characters arm characters of at least one of Chinese and Japanese.
-
29. The method of claim 25, wherein determining the word-word similarities further includes:
-
determining the word-word similarities based on a syntactic similarity between each of the at least one first word and each of the at least one second word, and wherein the syntactic similarity between one of the at least one first word and one of the at least one second word is based on;
at least one speech class associated with the one of the at least one first word based on at least one part of speech of the one of the at least one first word, and at least one speech class associated with the one of the at least one second word based on at least one part of speech of the one of the at least one second word.
-
-
30. The method of claim 29, wherein the syntactic similarity between the one of the at least one first word and the one of the at least one second word is based on a number representing zero or more of the at least one speech class associated with both the one of the at least one first word and the one of the at least one second word.
-
31. The method of claim 25, wherein determining the set-set similarities includes:
-
for each word of the at least one first word, for each of the at least one second set, determining one of the at least one second word having the greatest word-word similarity for the each word of the at least one first word, and for each of the at least one second set, based on the greatest word-word similarity for each word of the at least one first word, determining the set-set similarity between the first set and each of the at least one second set.
-
-
32. The method of claim 25, wherein providing includes:
providing data associated with the second set having the greatest set-set similarity for the fist set from the database to the client.
-
33. The method of claim 25, wherein each of the definition weights is based on at least one of:
- a type and a degree of relationship between a word and a definition class.
Specification