System and method for language extraction and encoding

US 10,275,424 B2
Filed: 01/28/2014
Issued: 04/30/2019
Est. Priority Date: 07/29/2011
Status: Active Grant

First Claim

Patent Images

1. A method for extracting information from medical or natural-language input text, comprising:

receiving, by a computing system, medical or natural-language input text, wherein one or more words or portions of said medical or natural-language input text includes an identification tag;

selecting, by computing system, the input text using the identification tag to determine a relevant text input and an irrelevant text input, wherein the identification tag includes a string value and/or a nested structure value, wherein the identification tag is configured to be customized and recognized by a processor;

utilizing, by the computing system, a lexicon knowledge base to identify and categorize multi-word and single word phrases within sentences of the relevant text input, wherein said lexicon knowledge base is configured to be dynamically customized by a user;

receiving, from a user, filenames having new lexical entries, and modifying the lexicon knowledge based on the filenames;

disambiguating, by the computing system, one or more ambiguous words in the relevant text input using a contextual disambiguation rule, wherein the contextual disambiguation rule is configured to analyze words following or preceding each ambiguous word, words in the same sentence, words in a certain section, and/or words in a certain domain, and wherein the contextual disambiguation rule is configured to be dynamically loaded without compiling the entire computing system;

parsing, by the computing system, said relevant text input to determine a grammatical structure of the relevant text input, said parsing step comprising the step of referring to a domain parameter having a value indicative of a domain from which the text data originated, the domain parameter corresponding to one or more rules of grammar within a knowledge base related to the domain to be applied for parsing the relevant text input;

regularizing, by the computing system, the parsed text data to form a canonical output form;

converting, by the computing system, the canonical output form into controlled vocabulary terms using a table of codes, wherein the table of codes is configured to be dynamically customized without compiling the entire computing system;

tagging, by the computing system, the input text with a structured data component derived from the controlled vocabulary terms; and

outputting the tagged text data to be stored in a database.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Improved systems and methods for extracting information from medical and natural-language text data.

22 Citations

View as Search Results

9 Claims

1. A method for extracting information from medical or natural-language input text, comprising:
- receiving, by a computing system, medical or natural-language input text, wherein one or more words or portions of said medical or natural-language input text includes an identification tag;
  
  selecting, by computing system, the input text using the identification tag to determine a relevant text input and an irrelevant text input, wherein the identification tag includes a string value and/or a nested structure value, wherein the identification tag is configured to be customized and recognized by a processor;
  
  utilizing, by the computing system, a lexicon knowledge base to identify and categorize multi-word and single word phrases within sentences of the relevant text input, wherein said lexicon knowledge base is configured to be dynamically customized by a user;
  
  receiving, from a user, filenames having new lexical entries, and modifying the lexicon knowledge based on the filenames;
  
  disambiguating, by the computing system, one or more ambiguous words in the relevant text input using a contextual disambiguation rule, wherein the contextual disambiguation rule is configured to analyze words following or preceding each ambiguous word, words in the same sentence, words in a certain section, and/or words in a certain domain, and wherein the contextual disambiguation rule is configured to be dynamically loaded without compiling the entire computing system;
  
  parsing, by the computing system, said relevant text input to determine a grammatical structure of the relevant text input, said parsing step comprising the step of referring to a domain parameter having a value indicative of a domain from which the text data originated, the domain parameter corresponding to one or more rules of grammar within a knowledge base related to the domain to be applied for parsing the relevant text input;
  
  regularizing, by the computing system, the parsed text data to form a canonical output form;
  
  converting, by the computing system, the canonical output form into controlled vocabulary terms using a table of codes, wherein the table of codes is configured to be dynamically customized without compiling the entire computing system;
  
  tagging, by the computing system, the input text with a structured data component derived from the controlled vocabulary terms; and
  
  outputting the tagged text data to be stored in a database.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein the identification tag is selected from the group consisting of dates, names, phone numbers, addresses, and locations.
  - 3. The method of claim 1, wherein the tagged text data is stored in a form compatible with a standard spreadsheet application or relational database.
  - 4. The method of claim 1, wherein the structure of the identification tag is configured to be updated with an adaptation of the processor.

5. A system for extracting information from medical or natural-language input text, comprising:
- a lexicon knowledge base to identify and categorize multi-word and single word phrases within sentences of a relevant input text, wherein said lexicon knowledge base is configured to be dynamically customized by a user by receiving, from the user, filenames having new lexical entries, and the lexicon knowledge is modified based on the filenames;
  
  a processor, coupled to said lexicon knowledge base and receiving said medical or natural-language input text, tagging one or more words or portions of said medical or natural-language input text with an identification tag, and selecting the input test using the identification tag to determine the relevant text input and an irrelevant text input, wherein the identification tag is configured to be customized by a user;
  
  a boundary identifier, coupled to said processor and said lexicon knowledge base and receiving said medical or natural-language input text and dropping the irrelevant text input;
  
  a parser, coupled to said boundary identifier and receiving said relevant input text to determine the grammatical structure of the relevant text input and generating a parsed text wherein one or more ambiguous words in the parsed data are disambiguated using a contextual disambiguation rule, wherein the disambiguation rule is configured to be dynamically loaded without compiling the entire computing system;
  
  a phrase regulator, coupled to said parser and replacing the parsed text with a canonical output form; and
  
  an encoder, coupled to said phrase regulator and receiving the canonical output form, converting the canonical output form into a controlled vocabulary term using a table of code, tagging the input text with a structured data component derived from controlled vocabulary terms, and outputting the tagged text data to be stored in a database, wherein the table of codes is configured to be dynamically customized without compiling the entire computing system.
- View Dependent Claims (6, 7, 8, 9)
- - 6. The system of claim 5, wherein the identification tag is selected from the group consisting of dates, names, phone numbers, addresses, and locations.
  - 7. The system of claim 5, wherein the tagged text data is stored in a form compatible with a standard spreadsheet application, a xml format, or relational database.
  - 8. The system of claim 5, wherein the disambiguation rules are configured to be compiled by specifying the filename option, wherein the filename comprises the disambiguation rules.
  - 9. The system of claim 5, wherein the structure of the identification tag is tag is configured to be updated with an adaptation of the processor.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Trustees Of Columbia University In The City Of New York (Columbia University)
Original Assignee
Trustees Of Columbia University In The City Of New York (Columbia University)
Inventors
Friedman, Carol
Primary Examiner(s)
Armstrong, Angela A

Application Number

US14/166,160
Publication Number

US 20140142924A1
Time in Patent Office

1,918 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/10 Text processing natural lan...

G06F 40/279 Recognition of textual enti...

System and method for language extraction and encoding

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

22 Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for language extraction and encoding

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links