System and method for providing speech recognition using personal vocabulary in a network environment
First Claim
1. A method, comprising:
- receiving data propagating in a network environment;
ignoring Joint Photographic Experts Group (JPEG) documents in the data;
identifying an audio and video media file in the data, wherein the audio and video media file is associated with a plurality of individuals;
generating a text file based on the audio and video media file;
comparing the text file to a plurality of blacklisted words;
dropping the text file if a blacklisted word is found in the text file;
identifying, using a processor, selected words within the text file based on a whitelist to create a first word list, wherein the first word list includes fewer words than the text file;
comparing the selected words in the first word list to a personal vocabulary database associated with an individual from the plurality of individuals, wherein the personal vocabulary database associated with the individual includes one or more words that the individual added to the personal vocabulary database, and wherein words in the personal vocabulary database associated with the individual may be marked as private; and
removing from the first word list, one or more of the selected words to create a second word list based on the selected words not being found in the personal vocabulary database associated with the individual, wherein the second word list includes fewer words then the first word list, wherein at least one of the selected words that is removed is associated with a false positive from two words that phonetically sound similar.
1 Assignment
0 Petitions
Accused Products
Abstract
A method is provided in one example and includes receiving a media file and generating a text file based on the media file. The method includes identifying selected words within the text file based on a whitelist, the whitelist includes a plurality of designated words to be tagged. The selected words are compared to a group of words associated with an individual. One or more of the selected words are removed based on the selected words not being found in the group of words associated with the individual. In more specific embodiments, the method includes generating a resultant after removing one or more of the selected words, the resultant can be separated into fields that identify a title and an author associated with the resultant. At least one of the selected words that is removed is associated with a false positive associated with two words that phonetically sound similar.
192 Citations
18 Claims
-
1. A method, comprising:
-
receiving data propagating in a network environment; ignoring Joint Photographic Experts Group (JPEG) documents in the data; identifying an audio and video media file in the data, wherein the audio and video media file is associated with a plurality of individuals; generating a text file based on the audio and video media file; comparing the text file to a plurality of blacklisted words; dropping the text file if a blacklisted word is found in the text file; identifying, using a processor, selected words within the text file based on a whitelist to create a first word list, wherein the first word list includes fewer words than the text file; comparing the selected words in the first word list to a personal vocabulary database associated with an individual from the plurality of individuals, wherein the personal vocabulary database associated with the individual includes one or more words that the individual added to the personal vocabulary database, and wherein words in the personal vocabulary database associated with the individual may be marked as private; and removing from the first word list, one or more of the selected words to create a second word list based on the selected words not being found in the personal vocabulary database associated with the individual, wherein the second word list includes fewer words then the first word list, wherein at least one of the selected words that is removed is associated with a false positive from two words that phonetically sound similar. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. Logic encoded in one or more non-transitory media that includes code for execution and when executed by a processor is operable to perform operations comprising:
-
receiving data propagating in a network environment; ignoring Joint Photographic Experts Group (JPEG) documents in the data; identifying an audio and video media file in the data, wherein the audio and video media file is associated with a plurality of individuals; generating a text file based on the audio and video media file; comparing the text file to a plurality of blacklisted words; dropping the text file if a blacklisted word is found in the text file; identifying selected words within the text file based on a whitelist to create a first word list, wherein the first word list includes fewer words than the text file; comparing the selected words in the first word list to a personal vocabulary database associated with an individual from the plurality of individuals, wherein the personal vocabulary database associated with the individual includes one or more words that the individual added to the personal vocabulary database, and wherein words in the personal vocabulary database associated with the individual may be marked as private; and removing from the first word list, one or more of the selected words to create a second word list based on the selected words not being found in the personal vocabulary database associated with the individual, wherein the second word list includes fewer words then the first word list, wherein at least one of the selected words that is removed is associated with a false positive from two words that phonetically sound similar. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. An apparatus, comprising:
-
a memory element configured to store data; a processor operable to execute instructions associated with the data; a network sensor configured to interface with the memory element and the processor, the network sensor being configured to; receive data propagating in a network environment; ignore Joint Photographic Experts Group (JPEG) documents in the data; identify an audio and video media file in the data, wherein the audio and video media file is associated with a plurality of individuals; generate a text file based on the audio and video media file; compare the text file to a plurality of blacklisted words; drop the text file if a blacklisted word is found in the text file; identify selected words within the text file based on a whitelist to create a first word list; compare the selected words in the first word list to a personal vocabulary database associated with an individual from the plurality of individuals, wherein the personal vocabulary database associated with the individual includes one or more words that the individual added to the personal vocabulary database, and wherein words in the personal vocabulary database associated with the individual may be marked as private; and remove from the first word list one or more of the selected words to create a second word list based on the selected words not being found in the personal vocabulary database associated with the individual, wherein the second word list includes fewer words then the first word list, wherein at least one of the selected words that is removed is associated with a false positive from two words that phonetically sound similar. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification