Systems and methods for filtering dictated and non-dictated sections of documents
First Claim
Patent Images
1. A method for filtering dictated and non-dictated sections of documents, the method comprising steps of:
- gathering speech recognition output and a first set of corresponding documents;
conforming at least one associated document from the first set of corresponding documents to a selected speech recognition format;
comparing the speech recognition output and the at least one associated document;
determining, using a processing unit, long homogeneous sequences of misaligned tokens from the speech recognition output and the at least one associated document;
detecting boundaries between dictated and non-dictated sections in the at least one associated document; and
annotating the at least one associated document with the boundaries.
8 Assignments
0 Petitions
Accused Products
Abstract
A system and method for filtering documents to determine section boundaries between dictated and non-dictated text. The system and method identifies portions of a text report that correspond to an original dictation and, correspondingly, those portions that are not part of the original dictation. The system and method include comparing tokenized and normalized forms of the original dictation and the final report, determining mismatches between the two forms, and applying machine-learning techniques to identify document headers, footers, page turns, macros, and lists automatically and accurately.
52 Citations
30 Claims
-
1. A method for filtering dictated and non-dictated sections of documents, the method comprising steps of:
-
gathering speech recognition output and a first set of corresponding documents; conforming at least one associated document from the first set of corresponding documents to a selected speech recognition format; comparing the speech recognition output and the at least one associated document; determining, using a processing unit, long homogeneous sequences of misaligned tokens from the speech recognition output and the at least one associated document; detecting boundaries between dictated and non-dictated sections in the at least one associated document; and annotating the at least one associated document with the boundaries. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A system for filtering dictated and non-dictated sections of electronic documents to determine dictated and non-dictated text in the documents, the system comprising:
-
a central processing unit; a computer code operatively associated with the central processing unit, the computer code including; a first set of instructions configured to gather speech recognition output and a first set of documents corresponding to the speech recognition output; a second set of instructions configured to conform at least one associated document from the first set of corresponding documents to a selected speech recognition format; a third set of instructions configured to compare the speech recognition output and the at least one associated document; a fourth set of instructions configured to determine long homogeneous sequences of misaligned tokens from the speech recognition output and the at least one associated document; a fifth set of instructions configured to detect boundaries between dictated and non-dictated sections in the at least one associated document; and a sixth set of instructions configured to annotate the at least one associated document with the boundaries. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
-
-
29. A method for identifying dictated and non-dictated sections of at least one document, the method comprising:
-
comparing, using a processing unit, speech recognition output and at least one associated document to label at least some tokens in the at least one associated document as misaligned tokens; identifying at least one sequence of a predetermined number or more of consecutive tokens in the at least one associated document that are labeled as misaligned tokens; based at least in part on the at least one identified sequence, identifying at least one boundary between at least one dictated section and at least one non-dictated section in the at least one associated document; and annotating the at least one associated document with the at least one boundary.
-
-
30. A system for identifying dictated and non-dictated sections of at least one document, the system comprising:
-
a central processing unit; and a computer code operatively associated with the central processing unit, the computer code including instructions to cause the central processing unit to; compare speech recognition output and at least one associated document to label at least some tokens in the at least one associated document as misaligned tokens; identify at least one sequence of a predetermined number or more of consecutive tokens in the at least one associated document that are labeled as misaligned tokens; based at least in part on the at least one identified sequence, identify at least one boundary between at least one dictated section and at least one non-dictated section in the at least one associated document; and annotate the at least one associated document with the at least one boundary.
-
Specification