CONCEPTUAL WORLD REPRESENTATION NATURAL LANGUAGE UNDERSTANDING SYSTEM AND METHOD

US 20110179032A1
Filed: 03/01/2011
Published: 07/21/2011
Est. Priority Date: 07/12/2002
Status: Active Grant

First Claim

Patent Images

1. A method for indexing a free text document, the method comprising:

typographically and functionally segmenting said free text document;

identifying words and multi-word terms in said free text document,matching said words and multi-word terms to a first plurality of concepts, said first plurality of concepts being contained in a formal ontology,adding said first plurality of concepts to a conceptual graph,identifying a second plurality of concepts, said second plurality of concepts being related to said first plurality of concepts, said second plurality of concepts being contained in said formal ontology,adding said second plurality of concepts to said conceptual graph,ranking the relevance of said first and second plurality of concepts to a meaning contained in said free text to create a list of relevant concepts, said list of relevant concepts representing said meaning contained in said free text, andadding said list of relevant concepts to an index for said free text document.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A Natural Language Understanding system is provided for indexing of free text documents. The system according to the invention utilizes typographical and functional segmentation of text to identify those portions of free text that carry meaning. The system then uses words and multi-word terms and phrases identified in the free to text to identify concepts in the free text. The system uses a lexicon of terms linked to a formal ontology that is independent of a specific language to extract concepts from the free text based on the words and multi-word terms in the free text. The formal ontology contains both language independent domain knowledge concepts and language dependent linguistic concepts that govern the relationships between concepts and contain the rules about how language works. The system according to the current invention may preferably be used to index medical documents and assign codes from independent coding systems, such as, SNOMED, ICD-9 and ICD-10. The system according to the current invention may also preferably make use of syntactic parsing to improve the efficiency of the method.

96 Citations

View as Search Results

35 Claims

1. A method for indexing a free text document, the method comprising:
- typographically and functionally segmenting said free text document;
  
  identifying words and multi-word terms in said free text document,matching said words and multi-word terms to a first plurality of concepts, said first plurality of concepts being contained in a formal ontology,adding said first plurality of concepts to a conceptual graph,identifying a second plurality of concepts, said second plurality of concepts being related to said first plurality of concepts, said second plurality of concepts being contained in said formal ontology,adding said second plurality of concepts to said conceptual graph,ranking the relevance of said first and second plurality of concepts to a meaning contained in said free text to create a list of relevant concepts, said list of relevant concepts representing said meaning contained in said free text, andadding said list of relevant concepts to an index for said free text document.
- View Dependent Claims (2)
- - 2. The method according to claim 1, wherein:
    - said typographically segmenting said free text document comprises;
      
      delimiting said free text document into words, sentences, titles, list items and paragraph based character patterns in said free text document, andsaid functionally segmenting said free text document comprises;
      
      grouping words into multi-word terms, segmenting said sentences into clause-phrase segments, and grouping words into noun phrases.

3. A method of processing free text documents for indexing, said method comprising:
- typographically segmenting a free text document, said typographically segmenting comprising;
  
  delimiting said free text document into words, sentences, titles, list items and paragraph based character patterns in said free text document, andfunctionally segmenting said free text document, said functionally segmenting comprising;
  
  grouping words into multi-word terms, segmenting said sentences into clause-phrase segments, and grouping words into noun phrases.
- View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 4. The method according to claim 3, whereinsaid delimiting of said free text document into words is accomplished by recognizing spaces and punctuation marks between characters in said free text,said delimiting of said free text document into sentences is accomplished by recognizing punctuations following a series of words, wherein said punctuations are defined as ending sentences, andsaid delimiting of said free text document into paragraphs is accomplished by recognizing a paragraph marker in said free text document.
  - 5. The method according to claim 3, wherein said grouping words into multi-word terms is accomplished by:
    - identifying at least two adjacent words,pairing said at least two adjacent words,searching a lexicon of terms for said pairing of at least two adjacent words, andif said pairing is found on said lexicon of terms, tagging said pairing as a multi-word term.
  - 6. The method according to claim 5, further comprising:
    - re-writing at least one of said at least two adjacent words to generate a pairing of at least two adjacent words containing at least one re-written word;
      
      searching said lexicon for said pairing of at least two adjacent words containing at least one re-written word; and
      
      if said pairing of at least two adjacent words containing at least one re-written word is found in said lexicon, replacing said pairing of at least two adjacent words with said pairing of at least two adjacent words containing at least one re-written word; and
      
      tagging said pairing of at least two adjacent words containing at least one re-written word as a multi-word term.
  - 7. The method according to claim 3, wherein said segmenting said sentences into clause-phrase segments comprises:
    - identifying a first segment and a second segment in a sentence, wherein said first segment and said second segment are split by a marker that signals the start of a new clause or phrase, andtagging said first segment as a first clause or phrase and tagging said second segment as a second clause or phrase.
  - 8. The method according to claim 7, wherein said marker that signals the start of a new clause or phrase is selected from the group consisting of:
    - and, but, or, “
      
      ,”
      
      , “
      
      ;
      
      ”
      
      , although, however, therefore, because, since, during, until, which, if, except, who, while, when, where, with, without, “
      
      to avoid”
      
      , and “
      
      to the point”
      
      ,with the following proviso;
      
      if said first segment and said second segment are split by “
      
      and”
      
      or “
      
      or” and
      
      said first segment ends in a noun phrase and said second segment begins in a noun phrase, said first segment and said second segment are tagged as a single clause or phrase;
      
      if said first segment and said second segment are split by “
      
      ,” and
      
      said first segment ends in with nominal word and said second segment begins in with nominal word, said first segment and said second segment are tagged as a single clause or phrase;
      
      if said first segment and said second segment are split by “
      
      ,” and
      
      said first segment is an adverb, said first segment and said second segment are tagged as a single clause or phrase; and
      
      if said second segment comprises “
      
      etc.”
      
      , said first segment and said second segment are tagged as a single clause or phrase.
  - 9. The method according to claim 3, further comprising:
    - identifying negating words in said free text.
  - 10. The method according to claim 9, wherein said negating words are selected from the group consisting of:
    - not, no, without, zero, non, nor, avoid, absence, denies, deny, denied, never, won'"'"'t, shouldn'"'"'t, wouldn'"'"'t, couldn'"'"'t, can'"'"'t, “
      
      with no” and
      
      “
      
      ruled out”
      
      .
  - 11. The method according to claim 10, wherein clauses or phrases containing negating words are tagged as negating text and ignored in further processing.
  - 12. The method according to claim 3, further comprising:
    - identifying modalizing words in said free text.
  - 13. The method according to claim 12, wherein said modalizing words are selected from the group consisting of:
    - might, may, would, could, should, possibly, probably, can presumed, prefers, prefer, preferred, preferably, wants, wanted, wanting, desires, desired, desire, desiring, likely, unlikely, encourage, encouraged, if, maybe, questionable and suggestive.
  - 14. The method according to claim 12, wherein clauses or phrases containing modalizing words are tagged as modalised text.
  - 15. The method according to claim 14, further comprising identifying modalizing words adjacent to negating words, wherein clauses or phrases containing modalizing words adjacent to negating words are tagged as modalised text.
  - 16. The method according to claim 3, further comprising:
    - grouping said paragraphs into functional sections.
  - 17. The method according to claim 16, further comprising:
    - labeling said functional sections by topic.
  - 18. The method according to claim 3, further comprising:
    - syntactically parsing said free text document.
  - 19. The method according to claim 18, wherein said syntactic parsing is performed using dependency grammar.

20. A method of deriving the degree of association between words and human-applied labels for a body of text, the method comprising:
- a) collecting a set of documents representative of the kind needed for an application,b) providing for each paragraph and title in the said documents a label which is considered appropriate for that paragraph or title,c) counting the number of occurrences of a first word within a first paragraph of text designated with a first label,d) counting the number of occurrences of said first word within paragraphs of text designated with a label other than first said label,e) computing the ratio of the occurrences in acts (c) and (d), this ratio being taken as the degree of association between said first word and said section, a ratio greater than 1 signifying a greater than normal association, a ratio less than 1 signifying a weaker than normal association,f) repeating acts (c) through (e) for each word within said first paragraph of text.
- View Dependent Claims (21, 22, 23, 24)
- - 21. A method of deriving the probability that a given paragraph or other unit of text should be labeled with a particular label, the method comprising:
    - a) deriving the degree of association between words and human-applied labels by the method according to claim 20,b) limiting said degree of association to fall within the ranges 0.1 and 100.0,c) collecting a list of words which appear in a section of text to be labeled, deleting any repeats of a word,d) for each section label, multiplying together the levels of association between said label and words collected in act (c), producing a level of association between the text and the label, ande) normalizing said levels of association derived in act (d), by dividing each said level of association by the sum of all levels of association, to produce a list of probabilities for each section label, the said probabilities summing to 1.0.
  - 22. A method of segmenting a free text document into functional sections, wherein said document comprises a plurality of functional sections, each of said plurality of functional sections representing a sub-topic, said free text document further being delimited into a plurality of paragraphs, the method comprising:
    - a) dividing the document into paragraphs,b) using the method according to claim 20, deriving for each paragraph the probability for each label being appropriate for the said paragraph,c) assigning each paragraph the label with highest probability,d) grouping any sequence of one or more sequential paragraphs with the same label as a single functional section, ande) either assigning or not assigning said paragraph to said functional section based on said first probability,each of acts (a) through (e) being performed on each of said plurality of paragraphs for each of said plurality of functional sections.
  - 23. The method according to claim 22, wherein said paragraph is preceded by a title, the method further comprising:
    - calculating a second probability that said paragraph belongs to said functional section based on said title, andeither assigning or not assigning said paragraph to said functional section based on a combination of said first probability and said second probability.
  - 24. The method according to claim 20, further comprising:
    - calculating the probability that said paragraph belongs to said functional section based on the location of said paragraph in said free text document.

25. A method for indexing a free text document, comprising:
- typographically segmenting, by a computing device, the free text document;
  
  functionally segmenting, by the computing device, the free text document;
  
  extracting, by the computing device, concepts from the segmented free text document by matching words and multi-word terms in the segmented free text document to a plurality of concepts contained in a formal ontology; and
  
  indexing, by the computing device, the free text document based on the extracted concepts.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35)
- - 26. A method as defined in claim 25, further comprising syntactic parsing, by the computing device, of the free text document.
  - 27. A method as defined in claim 25, wherein the plurality of concepts contained in the formal ontology include concepts that are independent of a specific language and concepts that explain the relationships between the language-independent concepts and language.
  - 28. A method as defined in claim 25, wherein the formal ontology comprises:
    - a plurality of concepts arranged in a hierarchy, the hierarchy having a primary node, wherein a primary concept occupies the primary node, the primary concept being the most general concept in the formal ontology, wherein the concepts become more specific at lower levels of the hierarchy;
      
      the plurality of concepts representing real world objects;
      
      each of the plurality of concepts having at least one definition;
      
      wherein a definition of a first concept comprises a first link to the first concept from a second concept, the link representing a relationship between the first concept and the second concept.
  - 29. A method as defined in claim 28, wherein each of the plurality of concepts is independently selected from the group consisting of domain concept, linguistic concept and domain/linguistic concept.
  - 30. A method as defined in claim 25, wherein extracting concepts from the segmented free text document comprises:
    - identifying words and multi-word terms in the free text document;
      
      matching the words and multi-word terms to a first plurality of concepts, the first plurality of concepts being contained in the formal ontology;
      
      adding the first plurality of concepts to a conceptual graph;
      
      identifying a second plurality of concepts, the second plurality of concepts being related to the first plurality of concepts, the second plurality of concepts being contained in the formal ontology;
      
      adding the second plurality of concepts to the conceptual graph;
      
      ranking the relevance of the first and second plurality of concepts to a meaning contained in the free text document to create a list of relevant concepts, the list of relevant concepts representing the meaning contained in the free text document; and
      
      adding the list of relevant concepts to an index for the free text document.
  - 31. A method as defined in claim 25, wherein typographically segmenting the free text document comprises delimiting the free text document into words, sentences, titles, list items and paragraph based character patterns in the free text document.
  - 32. A method as defined in claim 31, wherein functionally segmenting the free text document comprises grouping words into multi-word terms, segmenting the sentences into clause-phrase segments, and grouping words into noun phrases.
  - 33. A method as defined in claim 30, wherein the second plurality of concepts are related to the first plurality of concepts by parent/child relationships, the second plurality of concepts being parent concepts.
  - 34. A method as defined in claim 30, wherein the second plurality of concepts are related to the first plurality of concepts by a plurality of link types, wherein a link type defines a relationship between a first concept and a second concept.
  - 35. A method as defined in claim 30, wherein the words and multi-word terms are matched to the plurality of concepts by first matching the words and multi-word terms to a lexicon of terms, the lexicon of terms containing terms in a plurality of languages, the terms being linked to the concepts in the formal ontology.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
O'Donnell, Mick, Ceusters, Werner, Montyne, Frank, Van Mol, Maarten, Coppens, Frederik

Granted Patent

US 8,442,814 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/737
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 40/289   Phrasal analysis, e.g. fini...

G06F 40/30   Semantic analysis

G06F 40/40   Processing or translation o...

G06N 5/02   Knowledge representation; S...

CONCEPTUAL WORLD REPRESENTATION NATURAL LANGUAGE UNDERSTANDING SYSTEM AND METHOD

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

96 Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

CONCEPTUAL WORLD REPRESENTATION NATURAL LANGUAGE UNDERSTANDING SYSTEM AND METHOD

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

96 Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links