Region-Matching Transducers for Natural Language Processing

US 20100161313A1
Filed: 12/18/2008
Published: 06/24/2010
Est. Priority Date: 12/18/2008
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method, comprising:

recording in a memory input data having delimited strings;

recording in the memory a region-matching transducer defining one or more patterns of one or more sequences of delimited strings, with at least one of the patterns defined in the region-matching transducer having an arrangement of a plurality of class-matching networks;

the plurality of class-matching networks defining a combination of two or more entity classes from one or both of part-of-speech classes and application-specific classes;

the region-matching transducer (i) having, for each of the one or more patterns, an arc that leads from a penultimate state with a transition label that identifies the entity class of the pattern, and (ii) sharing states between patterns leading to a penultimate state when segments of delimited strings making up two or more patterns overlap;

applying the region-matching transducer recorded in the memory to the input data with an apply-stage replacement method, which apply-stage replacement method follows a longest match principle for identifying one or more patterns in the region-matching transducer that match one or more sequences of delimited strings in the input data;

at least one of the matching sequences of delimited strings satisfying at least one pattern in the region-matching transducer defined by an arrangement of a plurality of class-matching networks; and

recording in the memory, in response to said applying, the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Computer methods, apparatus and articles of manufacture therefor, are disclosed for developing a region-matching transducer for marking language data having delimited strings. The region-matching transducer defines one or more patterns of one or more sequences of delimited strings, with at least one of the patterns defined in the region-matching transducer having an arrangement of a plurality of class-matching networks. The plurality of class-matching networks defines a combination of two or more entity classes from one or both of part-of-speech classes and application-specific classes. The region-matching transducer has, for each of the one or more patterns, an arc that leads from a penultimate state with a transition label that identifies the entity class of the pattern, and shares states between patterns leading to a penultimate state when segments of delimited strings making up two or more patterns overlap.

303 Citations

20 Claims

1. A computer implemented method, comprising:
- recording in a memory input data having delimited strings;
  
  recording in the memory a region-matching transducer defining one or more patterns of one or more sequences of delimited strings, with at least one of the patterns defined in the region-matching transducer having an arrangement of a plurality of class-matching networks;
  
  the plurality of class-matching networks defining a combination of two or more entity classes from one or both of part-of-speech classes and application-specific classes;
  
  the region-matching transducer (i) having, for each of the one or more patterns, an arc that leads from a penultimate state with a transition label that identifies the entity class of the pattern, and (ii) sharing states between patterns leading to a penultimate state when segments of delimited strings making up two or more patterns overlap;
  
  applying the region-matching transducer recorded in the memory to the input data with an apply-stage replacement method, which apply-stage replacement method follows a longest match principle for identifying one or more patterns in the region-matching transducer that match one or more sequences of delimited strings in the input data;
  
  at least one of the matching sequences of delimited strings satisfying at least one pattern in the region-matching transducer defined by an arrangement of a plurality of class-matching networks; and
  
  recording in the memory, in response to said applying, the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method according to claim 1, further comprising processing the one or more sequences of delimited strings in the input data with one or more linguistic applications.
  - 3. The method according to claim 2, wherein the one or more linguistic applications comprises one or more of categorization, summarization, and translation.
  - 4. The method according to claim 1, further comprising producing one of the plurality of class-matching networks by compiling a regular expression that includes a list of words.
  - 5. The method according to claim 4, wherein the regular expression includes a penultimate state with a transition label that identifies the entity class of the class-matching network mapped to an epsilon, the entity class identifying one of a part-of-speech class and an application specific class.
  - 6. The method according to claim 1, further comprising producing one of the plurality of class-matching networks by retaining one part-of-speech for each word forming part of a morphological-analyzing network.
  - 7. The method according to claim 6, wherein the morphological-analyzing network is a transducer, describing lexical forms of words of a language and their associated inflected forms.
  - 8. The method according to claim 1, wherein the input data recorded in the memory, in response to said applying, is recorded in the memory with one or more labels identifying the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.
  - 9. The method according to claim 1, wherein the region-matching transducer recognizes a pattern defined by a plurality of parts-of-speech classes.
  - 10. The method according to claim 9, wherein the plurality of parts-of-speech classes is a noun phrase defined by a determiner part-of-speech class, an adjective part-of-speech class, and a noun part-of-speech class.
  - 11. The method according to claim 1, wherein the region-matching transducer recognizes a pattern defined by a word beginning with a capital letter and at least one subsequent word followed by a sentence-final punctuation character.
  - 12. The method according to claim 1, wherein the region-matching transducer recognizes pattern defined by one or more of a paragraph, a sentence, and a phrase.
  - 13. The method according to claim 1, wherein the input data is language data.
  - 14. The method according to claim 1, wherein the region-matching transducer identifies intelligible information in the input data.

15. A computer apparatus, comprising:
- a memory for storing processing instructions of the apparatus; and
  
  a processor coupled to the memory for executing the processing instructions of the apparatus;
  
  the processor in executing the processing instructions;
  
  recording in the memory input data having delimited strings;
  
  recording in the memory a region-matching transducer defining one or more patterns of one or more sequences of delimited strings, with at least one of the patterns defined in the region-matching transducer having an arrangement of a plurality of class-matching networks;
  
  the plurality of class-matching networks defining a combination of two or more entity classes from one or both of part-of-speech classes and application-specific classes;
  
  the region-matching transducer (i) having, for each of the one or more patterns, an arc that leads from a penultimate state with a transition label that identifies the entity class of the pattern, and (ii) sharing states between patterns leading to a penultimate state when segments of delimited strings making up two or more patterns overlap;
  
  applying the region-matching transducer recorded in the memory to the input data with an apply-stage replacement method, which apply-stage replacement method follows a longest match principle for identifying one or more patterns in the region-matching transducer that match one or more sequences of delimited strings in the input data;
  
  at least one of the matching sequences of delimited strings satisfying at least one pattern in the region-matching transducer defined by an arrangement of a plurality of class-matching networks; and
  
  recording in the memory, in response to said applying, the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.
- View Dependent Claims (16)
- - 16. The apparatus according to claim 15, wherein the input data recorded in the memory, in response to said applying, is recorded in the memory with labels identifying the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.

17. An article of manufacture comprising computer usable media including computer readable instructions embedded therein that causes a computer to perform a method, wherein the method comprises:
- recording in a memory input data having delimited strings;
  
  recording in the memory a region-matching transducer defining one or more patterns of one or more sequences of delimited strings, with at least one of the patterns defined in the region-matching transducer having an arrangement of a plurality of class-matching networks;
  
  the plurality of class-matching networks defining a combination of two or more entity classes from one or both of part-of-speech classes and application-specific classes;
  
  the region-matching transducer (i) having, for each of the one or more patterns, an arc that leads from a penultimate state with a transition label that identifies the entity class of the pattern, and (ii) sharing states between patterns leading to a penultimate state when segments of delimited strings making up two or more patterns overlap;
  
  applying the region-matching transducer recorded in the memory to the input data with an apply-stage replacement method, which apply-stage replacement method follows a longest match principle for identifying one or more patterns in the region-matching transducer that match one or more sequences of delimited strings in the input data;
  
  at least one of the matching sequences of delimited strings satisfying at least one pattern in the region-matching transducer defined by an arrangement of a plurality of class-matching networks; and
  
  recording in the memory, in response to said applying, the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.
- View Dependent Claims (18)
- - 18. The article of manufacture according to claim 17, wherein the input data recorded in the memory, in response to said applying, is recorded in the memory with one or more labels identifying the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.

19. A computer apparatus, comprising:
- a memory for recording input data having delimited strings;
  
  a region-matching transducer defining one or more patterns of one or more sequences of delimited strings, with at least one of the patterns defined in the region-matching transducer having an arrangement of a plurality of class-matching networks;
  
  the plurality of class-matching networks defining a combination of two or more entity classes from one or both of part-of-speech classes and application-specific classes;
  
  the region-matching transducer (i) having, for each of the one or more patterns, an arc that leads from a penultimate state with a transition label that identifies the entity class of the pattern, and (ii) sharing states between patterns leading to a penultimate state when segments of delimited strings making up two or more patterns overlap;
  
  an FST engine for applying the region-matching transducer recorded in the memory to the input data with an apply-stage replacement method, which apply-stage replacement method follows a longest match principle for identifying one or more patterns in the region-matching transducer that match one or more sequences of delimited strings in the input data;
  
  at least one of the matching sequences of delimited strings satisfying at least one pattern in the region-matching transducer defined by an arrangement of a plurality of class-matching networks; and
  
  wherein the memory records the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.
- View Dependent Claims (20)
- - 20. The apparatus according to claim 19, wherein the input data recorded in the memory, in response to the FST engine applying the region-matching transducer to the input data, is recorded in the memory with one or more labels identifying the one or more sequences of delimited strings in the input data matching the one or more patterns in the region-matching transducer.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palo Alto Research Center, Inc. (Xerox Holdings Corp.)
Original Assignee
Palo Alto Research Center, Inc. (Xerox Holdings Corp.)
Inventors
Karttunen, Lauri J.

Granted Patent

US 8,447,588 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/289 Phrasal analysis, e.g. fini...

Region-Matching Transducers for Natural Language Processing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

303 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Region-Matching Transducers for Natural Language Processing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

303 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links