System and method for adaptive sentence boundary disambiguation
First Claim
1. A method for adaptive sentence boundary disambiguation, comprising:
- receiving, from a natural language processing system, a document containing text;
identifying, by a first heuristic algorithm, sentence text in the document;
identifying, by a second heuristic algorithm, non-sentence text in the document, wherein the second heuristic algorithm is operable to identify one of non-sentence texts in a group consisting of lists, tables, names of people, addresses, text without a sentence structure, text included as a list and spatially separated data as non-sentence text;
parsing said non-sentence text into one or more logical constructs, wherein each logical construct comprises a set of words;
inserting a disambiguator after each of said one or more logical constructs to define a sentence boundary for the logical construct based on one or more natural language structures; and
sending the disambiguated document to the natural language processing system, wherein the disambiguated document consists of disambiguated sentences, each disambiguated sentence having a defined boundary and including related contextual information, and the disambiguator is used to signal the natural language processing system the presence of a logical construct to be evaluated independently of other logical constructs.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments disclosed herein provide a system and method useful for pre-processing non-sentence text extracted from business documents (e.g., malformed bulleted lists, runaway sentence identification, spatially separated data, etc.). One embodiment includes two heuristic algorithms: one searches for sentences in a document and another looks for non-sentences (e.g., lists, tables, tabs, names of people, addresses, etc.) in the same document. In one embodiment, when malformed text is encountered, a particular character (e.g., “?”) is inserted to signify to a natural language processing layer that this set of “words” represent a logical construct and should be evaluated independent of other sentences. Embodiments disclosed herein allow non-sentence text, which is linguistically dry but contextually rich, be included in the natural language processing. Embodiments disclosed herein also facilitate to reduce false positive concept extraction assertions by the natural language processing layer.
-
Citations
18 Claims
-
1. A method for adaptive sentence boundary disambiguation, comprising:
-
receiving, from a natural language processing system, a document containing text; identifying, by a first heuristic algorithm, sentence text in the document; identifying, by a second heuristic algorithm, non-sentence text in the document, wherein the second heuristic algorithm is operable to identify one of non-sentence texts in a group consisting of lists, tables, names of people, addresses, text without a sentence structure, text included as a list and spatially separated data as non-sentence text; parsing said non-sentence text into one or more logical constructs, wherein each logical construct comprises a set of words; inserting a disambiguator after each of said one or more logical constructs to define a sentence boundary for the logical construct based on one or more natural language structures; and sending the disambiguated document to the natural language processing system, wherein the disambiguated document consists of disambiguated sentences, each disambiguated sentence having a defined boundary and including related contextual information, and the disambiguator is used to signal the natural language processing system the presence of a logical construct to be evaluated independently of other logical constructs. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer-readable storage medium carrying computer-executable program instructions translatable to implement a method for adaptive sentence boundary disambiguation, comprising:
-
receiving, from a natural language processing system, a document containing text; identifying, by a first heuristic algorithm, sentence text in the document; identifying, by a second heuristic algorithm, non-sentence text in the document, wherein the second heuristic algorithm is operable to identify one of non-sentence texts in a group consisting of lists, tables, names of people, addresses, text without a sentence structure, text included as a list and spatially separated data as non-sentence text; parsing said non-sentence text into one or more logical constructs, wherein each logical construct comprises a set of words; inserting a disambiguator after each of said one or more logical constructs to define a sentence boundary for the logical construct based on one or more natural language structures, and sending the disambiguated document to the natural language processing system, wherein the disambiguated document consists of disambiguated sentences, each disambiguated sentence having a defined boundary and including related contextual information, and the disambiguator is used to signal the natural language processing system the presence of a logical construct to be evaluated independently of other logical constructs. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. A system for adaptive sentence boundary disambiguation, comprising:
-
a processor; a computer-readable storage medium accessible by said processor and carrying program instructions executable by said processor to; receive, from a natural language processing system, a document containing text; identify, by a first heuristic algorithm, sentence text in the document; identify, by a second heuristic algorithm, non-sentence text in the document, wherein the second heuristic algorithm is operable to identify one of non-sentence texts in a group consisting of lists, tables, names of people, addresses, text without a sentence structure, text included as a list and spatially separated data as non-sentence text; parse said non-sentence text into one or more logical constructs, wherein each logical construct comprises a set of words; and insert a disambiguator after each of said one or more logical constructs to define a sentence boundary for the logical construct based on one or more natural language structures and send the disambiguated document to the natural language processing system, wherein the disambiguated document consists of disambiguated sentences, each disambiguated sentence having a defined boundary and including related contextual information, and the disambiguator is used to signal the natural language processing system the presence of a logical construct to be evaluated independently of other logical constructs. - View Dependent Claims (15, 16, 17, 18)
-
Specification