System and method for document filtering
First Claim
1. A method for filtering dictated and non-dictated sections of documents, comprising the steps of:
- gathering speech recognition output and a first set of corresponding documents;
conforming at least one associated document from the first set of corresponding documents to a selected speech recognition format;
comparing the speech recognition output and the at least one associated document;
determining long homogeneous sequences of misaligned tokens from the speech recognition output and the at least one associated document;
detecting boundaries between dictated and non-dictated sections in the at least one associated document; and
annotating the at least one associated document with the boundaries.
8 Assignments
0 Petitions
Accused Products
Abstract
A system and method for filtering documents to determine section boundaries between dictated and non-dictated text. The system and method identifies portions of a text report that correspond to an original dictation and, correspondingly, those portions that are not part of the original dictation. The system and method include comparing tokenized and normalized forms of the original dictation and the final report, determining mismatches between the two forms, and applying machine-learning techniques to identify document headers, footers, page turns, macros, and lists automatically and accurately.
59 Citations
33 Claims
-
1. A method for filtering dictated and non-dictated sections of documents, comprising the steps of:
-
gathering speech recognition output and a first set of corresponding documents;
conforming at least one associated document from the first set of corresponding documents to a selected speech recognition format;
comparing the speech recognition output and the at least one associated document;
determining long homogeneous sequences of misaligned tokens from the speech recognition output and the at least one associated document;
detecting boundaries between dictated and non-dictated sections in the at least one associated document; and
annotating the at least one associated document with the boundaries. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for filtering dictated and non-dictated sections of documents, comprising the steps of:
-
gathering a set of documents having dictated and non-dictated section boundaries;
featurizing text in at least one document from the set of documents;
differentiating dictated and non-dictated sections of text in the at least one document;
categorizing text of a second set of documents to identify dictated and non-dictated sections of text within at least one of the second set of documents; and
outputting dictated sections of the at least one document from the second set of documents to an automatic speech recognition process. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
-
21. A system for filtering dictated and non-dictated sections of electronic documents to determine dictated and non-dictated text in the documents, the system comprising:
-
a central processing unit;
a computer code operatively associated with the central processing unit, the computer code including;
a first set of instructions configured to gather speech recognition output and a first set of documents corresponding to the speech recognition output;
a second set of instructions configured to conform at least one associated document from the first set of corresponding documents to a selected speech recognition format;
a third set of instructions configured to compare the speech recognition output and the at least one associated document;
a fourth set of instructions configured to determine long homogeneous sequences of misaligned tokens from the speech recognition output and the at least one associated document;
a fifth set of instructions configured to detect boundaries between dictated and non-dictated sections in the at least one associated document; and
a sixth set of instructions configured to annotate the at least one associated document with the boundaries. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
Specification