System and method for document section segmentation

US 20050144184A1
Filed: 09/30/2004
Published: 06/30/2005
Est. Priority Date: 10/01/2003
Status: Abandoned Application

First Claim

Patent Images

1. A system and method for document heading categorization, comprising the steps of:

constructing a first data set consisting of exemplars having at least one pair of expressions and corresponding codes;

constructing a second data set having a structural hierarchy, where the second data set contains at least one corresponding code mapped to at least one expression;

transforming at least one of the expressions into a first representation, where the first representation includes sequential word features;

constructing a target data set consisting of at least one first representation and at least one corresponding code;

comparing a candidate string to the target data set;

identifying a least dissimilar target representation in the target data set having a dissimilarity score exceeding a first pre-determined value;

providing the corresponding code of the least dissimilar target in the target data set;

selectively saving a candidate string having a dissimilarity score not exceeding a second pre-determined value; and

selectively reviewing the saved candidate string and assigning its representation and corresponding code to the target data set.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for facilitating the processing and the use of documents by providing a system for categorizing document section headings under a set of canonical section headings. In the method for categorizing section headings, there may be a process of training a database and matching methods to categorize different but equivalent document section headings under canonical headings and categories. Once trained the system may match and categorize the document sections with little to no supervision of the categorization for large sets of documents.

Citations

9 Claims

1. A system and method for document heading categorization, comprising the steps of:
- constructing a first data set consisting of exemplars having at least one pair of expressions and corresponding codes;
  
  constructing a second data set having a structural hierarchy, where the second data set contains at least one corresponding code mapped to at least one expression;
  
  transforming at least one of the expressions into a first representation, where the first representation includes sequential word features;
  
  constructing a target data set consisting of at least one first representation and at least one corresponding code;
  
  comparing a candidate string to the target data set;
  
  identifying a least dissimilar target representation in the target data set having a dissimilarity score exceeding a first pre-determined value;
  
  providing the corresponding code of the least dissimilar target in the target data set;
  
  selectively saving a candidate string having a dissimilarity score not exceeding a second pre-determined value; and
  
  selectively reviewing the saved candidate string and assigning its representation and corresponding code to the target data set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1, further comprising the step of selectively transforming at least one of expressions into a second representation, where the second representation includes a plurality of sequences of word stems.
  - 3. The method according to claim 2, further comprising the step of transforming at least one of the first and second representations into a third representation, where the third representation includes a plurality of n-grams.
  - 4. The method according to claim 1, where the set of exemplars includes empirical data consisting of headings taken from existing documents.
  - 5. The method according to claim 2, where the first representation includes words that are normalized to the word stems.
  - 6. The method according to claim 5, where the stemmed forms are filtered for non-content or stop words.
  - 7. The method according to claim 5, where the stemmed forms include synonyms or hypernyms.
  - 8. The method according to claim 3, where the third representation includes stemmed forms based upon at least one sequence of word stems or n-grams from the second representation.
  - 9. The method according to claim 2, where second representation further includes filtering of stop words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Dictaphone Corporation (Microsoft Corporation)
Inventors
Carus, Alwin B., Heyvaert, Stefaan, MacPherson, Melissa, Parkes, Cornelia

Application Number

US10/953,448
Publication Number

US 20050144184A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 40/258 Heading extraction; Automat...

System and method for document section segmentation

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

Citations

9 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for document section segmentation

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

9 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links