Bootstrapping sense characterizations of occurrences of polysemous words

US 6,078,878 A
Filed: 07/31/1997
Issued: 06/20/2000
Est. Priority Date: 07/31/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method in a computer system, the method performed in a lexical knowledge base derived from one or more corpora, the lexical knowledge base comprising a network of nodes each representing a word occurrence in the corpora, the lexical knowledge base having word subgraphs each corresponding to one word and containing text segment subgraphs derived from text segments containing the word, the method characterizing the sense of an occurrence of a polysemous word represented as a node of the lexical knowledge base and comprising the steps of:

selecting a word subgraph of the lexical knowledge base corresponding to a first word;

identifying within the selected word subgraph a first node representing a first occurrence of a second word, the first node having no word sense characterization;

identifying within the selected word subgraph a second node representing a second occurrence of the second word, the second node having a word sense characterization; and

copying the word sense characterization of the second node to the first node.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention is directed to characterizing the sense of an occurrence of a polysemous word in a representation of a dictionary. In a preferred embodiment, the representation of the dictionary is made up of a plurality of text segments containing word occurrences having a word sense characterization and word occurrences not having a word sense characterization. The embodiment first selects a plurality of the dictionary text segments that each contain a first word. The embodiment then identifies from among the selected text segments a first and a second occurrence of a second word. The identified second occurrence of the second word has a word sense characterization. The embodiment then attributes to the first occurrence of the second word sense characterization of the second occurrence of the second word.

Citations

41 Claims

1. A method in a computer system, the method performed in a lexical knowledge base derived from one or more corpora, the lexical knowledge base comprising a network of nodes each representing a word occurrence in the corpora, the lexical knowledge base having word subgraphs each corresponding to one word and containing text segment subgraphs derived from text segments containing the word, the method characterizing the sense of an occurrence of a polysemous word represented as a node of the lexical knowledge base and comprising the steps of:
- selecting a word subgraph of the lexical knowledge base corresponding to a first word;
  
  identifying within the selected word subgraph a first node representing a first occurrence of a second word, the first node having no word sense characterization;
  
  identifying within the selected word subgraph a second node representing a second occurrence of the second word, the second node having a word sense characterization; and
  
  copying the word sense characterization of the second node to the first node.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, further comprising the step of deriving the lexical knowledge base from one or more dictionaries.
  - 3. The method of claim 1 wherein the copying step is only performed where a condition relating to the first and second nodes is satisfied.
  - 4. The method of claim 1 wherein the copying step is only performed where the first and second nodes share the same part of speech.
  - 5. The method of claim 1 wherein the first and second nodes both represent word occurrences have the verb part of speech, and wherein the copying step is only performed where the first and second nodes either both represent transitive verbs or both represent intransitive verbs.
  - 6. The method of claim 1 wherein the copying step is only performed where the second node has no positive register features.
  - 7. The method of claim 1, further comprising the steps of:
    - identifying within the selected word subgraph a third node representing a third occurrence of the third word, the third node having a word sense characterization; and
      
      determining to copy to the first node the word sense characterization of the second node rather than the word sense characterization of the third node based upon a characteristic of the second node.
  - 8. The method of claim 7 wherein the determining step determines to copy to the first node the word sense characterization of the second node rather than the word sense characterization of the third node based upon a determination that the first and second nodes are derived from the same corpus, while the first and third nodes are not derived from the same corpus.
  - 9. The method of claim 7 wherein the determining step determines to copy to the first node the word sense characterization of the second node rather than the word sense characterization of the third node based upon a determination that the weight of the path from the head of the word subgraph to the second node exceeds the weight of the path from the head of the word subgraph to the third node.
  - 10. The method of claim 1 wherein the identifying steps identify first and second nodes within a proper subset of the paths comprising the selected word subgraph.
  - 11. The method of claim 10, further comprising selecting the subset of paths from the selected word subgraph by selecting from the selected word subgraph a predetermined number of paths having the highest weights.
  - 12. The method of claim 10, further comprising selecting the subset of paths from the selected word subgraph by selecting all paths contained within the selected word subgraph having a path weight at least as large as a minimum path weight.
  - 13. The method of claim 10 wherein the identifying steps and the copying step are repeated for each node in the subset of paths having no word sense characterization.
  - 14. The method of claim 1 wherein the identifying steps and the copying step are repeated for each of a plurality of nodes in the selected word subgraph having no word sense characterization.
  - 15. The method of claim 1 wherein the identifying steps and the copying step are repeated for each of a plurality of nodes having no word sense characterization, in each word subgraph of the lexical knowledge base.
  - 16. The method of claim 15, further comprising the steps of, for each first node for which the identifying steps and the copying step are repeated:
    - identifying each occurrence of the text segment subgraph of the selected word subgraph containing the first node in the lexical knowledge base;
      
      in each identified occurrence of the text segment subgraph, determine whether the node corresponding to the first node has a word sense characterization; and
      
      copying the word sense characterization copied to the first node to each identified corresponding node having no word sense characterization.
  - 17. The method of claim 1, further comprising the steps of:
    - identifying an occurrence of the text segment subgraph within the selected word subgraph containing the first node in another word subgraph of the lexical knowledge base;
      
      in the identified occurrence of the text segment subgraph, identify the node corresponding to the first node; and
      
      copying the word sense characterization copied to the first node to the identified corresponding node.

18. A computer-readable medium whose contents cause a computer system to characterize the sense of an occurrence of a polysemous word in a lexical knowledge base derived from one or more corpora each comprising a plurality of text segments, by performing the steps of:
- selecting a plurality of text segments each containing a first word;
  
  identifying among the selected text segments a first and second occurrence of a second word, the second occurrence of the second word having a word sense characterization; and
  
  attributing to the first occurrence of the second word the word sense characterization of the second occurrence of the second word.
- View Dependent Claims (19, 20, 21, 22)
- - 19. The computer-readable medium of claim 18 wherein the second occurrence of the second word is in a text segment for defining or exemplifying the usage of the second word.
  - 20. The computer-readable medium of claim 18 wherein the contents of the computer-readable medium further cause the computer system to perform the steps of:
    - identifying among the selected text segments a third occurrence of the second word, the third occurrence of the second word, like the second occurrence of the second word, having a word sense characterization; and
      
      determining to attribute to the first occurrence of the second word the word sense characterization of the second occurrence of the second word rather than the word sense characterization of the third occurrence of the second word based upon a characteristic of the second occurrence of the second word.
  - 21. The computer-readable medium of claim 20 wherein the determining step determines to attribute to the first occurrence of the second word the word sense characterization of the second occurrence of the second word rather than the word sense characterization of the third occurrence of the second word based upon a determination that the first and second occurrences of the second word are from the same corpus, while the first and third occurrences of the second word are not from the same corpus.
  - 22. The computer-readable medium of claim 20 wherein the determining step determines to attribute to the first occurrence of the second word the word sense characterization of the second occurrence of the second word rather than the word sense characterization of the third occurrence of the second word based upon a determination that the second occurrence of the second word is more closely related to the first word than is the third occurrence of the second word.

23. A method in a computer system, the method performed in a lexical knowledge base derived from one or more dictionaries, the lexical knowledge base comprising a network of nodes each representing a word occurrence in the dictionaries, the lexical knowledge base containing text segment subgraphs each comprising a plurality of nodes and derived from dictionary text segments, the method characterizing the sense of an occurrence of a polysemous word represented as a node of the lexical knowledge base and comprising the steps of:
- (a) selecting a pair of words having a high level of semantic coherency;
  
  (b) identifying in the lexical knowledge base a plurality of text segment subgraphs between the words of the pair;
  
  (c) identifying within the identified plurality of text segment subgraphs a first node having no word sense characterization and representing a first occurrence of a first word;
  
  (d) identifying within the identified plurality of text segment subgraphs a second node having a word sense characterization and representing a second occurrence of the second word; and
  
  (e) copying the word sense characterization of the second node to the first node.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 24. The method of claim 23 wherein the selecting step selects a pair of words that are synonyms.
  - 25. The method of claim 23 wherein the selecting step selects a pair of words having a hypernym/hyponym relationship.
  - 26. The method of claim 23 wherein the selecting step selects a pair of words having a verb/typical object relationship.
  - 27. The method of claim 23, further comprising the step of(f) repeating steps (c)-(e) for each node within the identified plurality of text segment subgraphs having no word sense characterization.
  - 28. The method of claim 23 wherein step (b) identifies a proper subset of the text segment subgraphs between the words of the pair having the highest weights.
  - 29. The method of claim 23 wherein steps (a)-(e) are repeated for each of a plurality of word pairs having a high level of semantic coherency, and wherein steps (c)-(e) are repeated for each node within the identified plurality of text segment subgraphs having no word sense characterization for each of the plurality of word pairs.
  - 30. The method of claim 23 wherein the copying step is only performed where a condition relating to the first and second nodes is satisfied.
  - 31. The method of claim 23 wherein the copying step is only performed where the first and second nodes share the same part of speech.
  - 32. The method of claim 23 wherein the first and second nodes both represent word occurrences have the verb part of speech, and wherein the copying step is only performed where the first and second nodes either both represent transitive verbs or both represent intransitive verbs.
  - 33. The method of claim 23 wherein the copying step is only performed where the second node has no positive register features.
  - 34. The method of claim 23, further comprising the steps of:
    - identifying within the identified plurality of lexical paths a third node representing a third occurrence of the third word, the third node having a word sense characterization; and
      
      determining to copy to the first node the word sense characterization of the second node rather than the word sense characterization of the third node based upon a characteristic of the second node.
  - 35. The method of claim 34 wherein the determining step determines to copy to the first node the word sense characterization of the second node rather than the word sense characterization of the third node based upon a determination that the first and second nodes are derived from the same dictionary, while the first and third nodes are not derived from the same dictionary.

36. A method in a computer system for bootstrapping the sense characterization of some nodes of a lexical knowledge base to additional nodes of the lexical knowledge base, the lexical knowledge base comprising a network of nodes each representing a word occurrence in the dictionaries, the lexical knowledge base having word subgraphs each corresponding to one word and containing text segment subgraphs derived from dictionary text segments containing the word, the method comprising the steps of(a) for each of the word subgraphs:
- (1) selecting a proper subset of the text segment subgraphs of the word subgraph having the highest weights;
  
  (2) selecting within the selected subset of text segment subgraphs each node, other than nodes representing the word to which the word subgraph corresponds, not having a sense characterization;
  
  (3) for each selected node;
  
  (A) identifying within the selected subset of text segment subgraphs each node that represents the same word as the selected node and has a sense characterization;
  
  (B) rejecting any identified nodes having distinguishing features;
  
  (C) choosing one node from the unrejected identified nodes; and
  
  (D) copying to the selected node the sense characterization of the chosen node; and
  
  (b) for each selected node;
  
  (1) copying the new sense characterization of the selected node to a node corresponding to the selected node in each reoccurrence within the lexical knowledge base of the text segment subgraph containing the selected node.
- View Dependent Claims (37, 38, 39, 40, 41)
- - 37. The method of claim 36 wherein steps (a) and (b) are repeated a plurality of times.
  - 38. The method of claim 37 wherein step (a)(1) selects a number of text segment subgraphs that increases each time steps (a) and (b) are repeated.
  - 39. The method of claim 37, further comprising the step of:
    - after steps (a) and (b) are repeated a plurality of times, for each node not having a sense characterization;
      
      assigning to the node a default characterization for the word represented by the node,such that each word in the lexical knowledge base has a sense characterization.
  - 40. The method of claim 36 wherein the choosing step chooses an unrejected identified node derived from the same dictionary as the selected node.
  - 41. The method of claim 36 wherein the choosing step chooses the unrejected identified node connected to the head of the word subgraph by the path of the highest weight.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Dolan, William B.
Primary Examiner(s)
Thomas, Joseph

Application Number

US08/904,422
Time in Patent Office

1,055 Days
Field of Search

704/9, 704/10, 704/1, 704/257, 707/3, 707/4, 707/5, 707/6, 707/104, 707/530, 707/531, 707/532, 707/533, 707/534, 434/167, 434/169, 434/156, 706/934, 706/927
US Class Current

704/9
CPC Class Codes

G06F 40/247 Thesauruses; Synonyms

Bootstrapping sense characterizations of occurrences of polysemous words

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Bootstrapping sense characterizations of occurrences of polysemous words

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links