System and method for normalization of a string of words

US 7,822,598 B2
Filed: 02/28/2005
Issued: 10/26/2010
Est. Priority Date: 02/27/2004
Status: Active Grant

First Claim

Patent Images

1. In a computer system, a method for use in a predetermined categorization scheme, comprising:

normalizing a string of words utilizing a computer configured to perform the steps of;

receiving an input string of text;

tagging the string of text by annotating a string of words with labels marking the start and end of relevant portions of text;

comparing said tagged strings of text to a literal index, the literal index including a plurality of predetermined text sequences;

determining if the string of text matches at least one of the plurality of predetermined text sequences within the literal index;

if the string of words does not match at least one of the plurality of predetermined text sequences;

determining a baseform transform of the input string, said baseform transform derived by removing of noise words and stemming the remaining words using de-derivation and uninflection, said baseform transform including at least one baseform associated with the input string;

preparing a sorted version of the baseform transform;

comparing the at least one baseform to a baseform index, the baseform index including a plurality of predetermined baseform sequences;

determining a score for each of the plurality of predetermined baseform sequences that substantially match the at least one baseform and outputting feedback for any baseforms that exceed a predetermined threshold score;

if no baseforms exceed the predetermined threshold score;

computing a feature transformation of the input string, the feature transform including at least one feature associated with the input string;

comparing the at least one feature to a feature index, the feature index including a plurality of predetermined feature sequences;

determining a score for each of the plurality of predetermined feature sequences that substantially match the at least one feature; and

outputting a hit list of candidate sequence matches based on the input string, and if no feature sequences are found based on the input string, outputting an indication that no predetermined text sequences were found within the predetermined categorization scheme wherein the method is performed by a computer executing stored instructions.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates generally to a system and method for categorization of strings of words. More specifically, the present invention relates to a system and method for normalizing a string of words for use in a system for categorization of words in a predetermined categorization scheme. A method for adaptive categorization of words in a predetermined categorization scheme may include receiving a string of text, tagging the string of text, and normalizing the string of text. Normalization may be performed with a three-stage algorithm including a literal match processing stage, an approximation match processing stage, and a nearest neighbor match processing stage. The normalized string of text can be compared to a number of sequences of text in the predetermined categorization scheme.

115 Citations

View as Search Results

2 Claims

1. In a computer system, a method for use in a predetermined categorization scheme, comprising:
- normalizing a string of words utilizing a computer configured to perform the steps of;
  
  receiving an input string of text;
  
  tagging the string of text by annotating a string of words with labels marking the start and end of relevant portions of text;
  
  comparing said tagged strings of text to a literal index, the literal index including a plurality of predetermined text sequences;
  
  determining if the string of text matches at least one of the plurality of predetermined text sequences within the literal index;
  
  if the string of words does not match at least one of the plurality of predetermined text sequences;
  
  determining a baseform transform of the input string, said baseform transform derived by removing of noise words and stemming the remaining words using de-derivation and uninflection, said baseform transform including at least one baseform associated with the input string;
  
  preparing a sorted version of the baseform transform;
  
  comparing the at least one baseform to a baseform index, the baseform index including a plurality of predetermined baseform sequences;
  
  determining a score for each of the plurality of predetermined baseform sequences that substantially match the at least one baseform and outputting feedback for any baseforms that exceed a predetermined threshold score;
  
  if no baseforms exceed the predetermined threshold score;
  
  computing a feature transformation of the input string, the feature transform including at least one feature associated with the input string;
  
  comparing the at least one feature to a feature index, the feature index including a plurality of predetermined feature sequences;
  
  determining a score for each of the plurality of predetermined feature sequences that substantially match the at least one feature; and
  
  outputting a hit list of candidate sequence matches based on the input string, and if no feature sequences are found based on the input string, outputting an indication that no predetermined text sequences were found within the predetermined categorization scheme wherein the method is performed by a computer executing stored instructions.

2. An apparatus for normalizing a string of words for use in a predetermined categorization scheme, comprising:
- a processor and a memory encoded with instructions, for execution by the processor, to receive an input string of text, to tag relevant portions of the input string by marking the beginning and the end of said relevant portions of the input string and by marking said relevant portions of the input string with semantic labels based on the predetermined categorization scheme, to compare the tagged portions of said input string to a literal index, where the literal index includes a plurality of predetermined text sequences, and to determine if the string of text matches at least one of the plurality of predetermined text sequences within the literal index;
  
  whereinif the string of words does not match at least one of the plurality of predetermined text sequences;
  
  determining a baseform transform of the input string, said baseform transform derived by removing of noise words and stemming the remaining words using de-derivation and uninflection, said baseform transform including at least one baseform associated with the input string;
  
  preparing a sorted version of the baseform transform;
  
  comparing the at least one baseform to a baseform index, the baseform index including a plurality of predetermined baseform sequences;
  
  determining a score for each of the plurality of predetermined baseform sequences that substantially match the at least one baseform and outputting feedback for any baseforms that exceed a predetermined threshold score;
  
  if no baseforms exceed the predetermined threshold score;
  
  computing a feature transformation of the input string, the feature transform including at least one feature associated with the input string;
  
  comparing the at least one feature to a feature index, the feature index including a plurality of predetermined feature sequences;
  
  determining a score for each of the plurality of predetermined feature sequences that substantially match the at least one feature; and
  
  outputting a hit list of candidate sequence matches based on the input string, and if no feature sequences are found based on the input string, outputting an indication that no predetermined text sequences were found within the predetermined categorization scheme.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Dictaphone Corporation (Microsoft Corporation)
Inventors
DePlonty, Thomas J. III, Carus, Alwin B.
Primary Examiner(s)
GODBOLD, DOUGLAS

Application Number

US11/068,493
Publication Number

US 20050192792A1
Time in Patent Office

2,066 Days
Field of Search

704/9, 704/10, 704/1, 707 4- 6
US Class Current

704/9
CPC Class Codes

G06F 40/242   Dictionaries

G06F 40/30   Semantic analysis

G16H 70/00   ICT specially adapted for t...

System and method for normalization of a string of words

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

115 Citations

2 Claims

Specification

Use Cases

Quick Links

Others

System and method for normalization of a string of words

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

115 Citations

2 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others