System and method for phonetic searching of data
First Claim
1. A multiprocessor-implemented method of indexing media information within a Hadoop framework for phonetic searching, the method comprising:
- providing, within a Hadoop framework of processors, pointers to respective locations of source media files including audio information which is to be made searchable;
wherein each pointer corresponds to a respective source media file;
providing, within the Hadoop framework of processors, a respective set of one or more of the pointers to respective ones of a plurality of Hadoop Map Reduce Framework (MR) jobs,wherein each respective set comprises one or more subsets of the one or more of the pointers;
wherein each MR job instantiates concurrently executing Map tasks, each Map task associated with one of the subsets of the one or more pointers and wherein each Map task;
processes each of the corresponding source media files corresponding to the associated one of the subsets of the one or more pointers, andreads each of the corresponding source media files and generates a respective binary index file corresponding to a probabilistic phonetic stream of audio information for that corresponding source media file;
appending, within the Hadoop framework of processors, each of the respective binary index files to a respective associated one of a plurality of different archive files;
each respective archive file comprising a searchable phonetic representation of the audio information appended thereto; and
appending, within the Hadoop framework of processors, the respective binary index file of the concurrently executing Map tasks to different ones of the plurality of different archive files in order for the concurrently executing Map tasks to run in parallel using separate processors, said plurality of different archive files stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising each respective archive file are replicated to be locally available to one or more processors from a cluster of processors for sequential reading of said sequential blocks, each block storing a plurality of the respective binary index files, wherein each respective binary index file is formatted to be compatible with search tasks running a phonetic speech search engine.
21 Assignments
0 Petitions
Accused Products
Abstract
A method of phonetically searching media information comprises receiving a plurality of search queries from one or more client systems and providing a phonetic representation of each search query. One or more search jobs are instantiated, each search job comprising a plurality of tasks, each task being arranged to sequentially read a block from an archive file. The archive file is stored within a distributed filing system (DFS) in which sequential blocks of data comprising the archive file are replicated to be locally available to one or more processors from a cluster of processors for executing the tasks. Each block stores index files corresponding to a plurality of source media files, each index file containing a phonetic stream corresponding to audio information for a given source media file. Each task obtains phonetic representations of outstanding search queries for a block and sequentially searches the block for each outstanding search query.
-
Citations
16 Claims
-
1. A multiprocessor-implemented method of indexing media information within a Hadoop framework for phonetic searching, the method comprising:
-
providing, within a Hadoop framework of processors, pointers to respective locations of source media files including audio information which is to be made searchable;
wherein each pointer corresponds to a respective source media file;providing, within the Hadoop framework of processors, a respective set of one or more of the pointers to respective ones of a plurality of Hadoop Map Reduce Framework (MR) jobs, wherein each respective set comprises one or more subsets of the one or more of the pointers; wherein each MR job instantiates concurrently executing Map tasks, each Map task associated with one of the subsets of the one or more pointers and wherein each Map task; processes each of the corresponding source media files corresponding to the associated one of the subsets of the one or more pointers, and reads each of the corresponding source media files and generates a respective binary index file corresponding to a probabilistic phonetic stream of audio information for that corresponding source media file; appending, within the Hadoop framework of processors, each of the respective binary index files to a respective associated one of a plurality of different archive files;
each respective archive file comprising a searchable phonetic representation of the audio information appended thereto; andappending, within the Hadoop framework of processors, the respective binary index file of the concurrently executing Map tasks to different ones of the plurality of different archive files in order for the concurrently executing Map tasks to run in parallel using separate processors, said plurality of different archive files stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising each respective archive file are replicated to be locally available to one or more processors from a cluster of processors for sequential reading of said sequential blocks, each block storing a plurality of the respective binary index files, wherein each respective binary index file is formatted to be compatible with search tasks running a phonetic speech search engine. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer program product for execution on processors of a distributed multi-processor system, the computer program product comprising:
-
a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising; computer readable program code configured to provide pointers to respective locations of source media files including audio information which is to be made searchable, wherein each pointer corresponds to a respective source media file; computer readable program code configured to provide a respective set of the one or more pointers to respective ones of a plurality of Hadoop Map Reduce Framework (MR) jobs, wherein each respective set comprises one or more subsets of the one or more of the pointers; wherein each MR job instantiates concurrently executing Map tasks, each Map task associated with one of the subsets of the one or more pointers and wherein each Map task; processes each of the corresponding source media files corresponding to the associated one of the subsets of the one or more pointers, and reads each of the corresponding source media files and generates a respective binary index file corresponding to a probabilistic phonetic stream of audio information for that corresponding source media file; computer readable program code configured to each of the respective binary index files to a respective associated one of a plurality of different archive files;
each archive file comprising a searchable phonetic representation of the audio information appended thereto; andcomputer readable program code configured to append the respective binary index file of the concurrently executing Map tasks to different ones of the plurality of different archive files in order for the concurrently executing Map tasks to run in parallel using separate processors, wherein each respective binary index file is formatted to be compatible with search tasks running a phonetic speech search engine, and wherein each of the plurality of different archive files is stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising each respective archive file are replicated to be locally available to one or more processors from a cluster of processors for sequential reading of said sequential blocks, each block storing a plurality of the respective binary index files.
-
-
11. A system comprising:
-
a distributed multi-processor framework; a computer readable storage medium accessible by one or more of the processors of the distributed multi-processor framework; computer executable instructions stored on the computer readable storage media which when executed causes the distributed multi-processor framework to perform; providing, within a Hadoop framework of processors, pointers to respective locations of source media files including audio information which is to be made searchable;
wherein each pointer corresponds to a respective source media file;providing, within the Hadoop framework of processors, a respective set subsets of one or more of the pointers to respective ones of a plurality of Hadoop Map Reduce Framework (MR) jobs, wherein each respective set comprises one or more subsets of the one or more of the pointers; wherein each MR job instantiates concurrently executing Map tasks, each Map task associated with one of the subsets of the one or more pointers and wherein each Map task; processes each of the corresponding source media files corresponding to the associated one of the subsets of the one or more pointers, and reads each of the corresponding source media files and generates a respective binary index file corresponding to a probabilistic phonetic stream of audio information for that corresponding source media file; appending, within the Hadoop framework of processors, each of the respective binary index files to a respective associated one of a plurality of different archive files;
each respective archive file comprising a searchable phonetic representation of the audio information appended thereto; andappending, within the Hadoop framework of processors, the respective binary index file of the concurrently executing Map tasks to different ones of the plurality of different archive files in order for the concurrently executing Map tasks to run in parallel using separate processors, said plurality of different archive files stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising each respective archive file are replicated to be locally available to one or more processors from a cluster of processors for sequential reading of said sequential blocks, each block storing a plurality of the respective binary index files, wherein each respective binary index file is formatted to be compatible with search tasks running a phonetic speech search engine.
-
-
12. A method of phonetically searching media information within a Hadoop framework of a cluster of processors, the method comprising:
-
receiving, within a Hadoop framework of processors, a plurality of search queries from one or more client systems; providing, within the Hadoop framework of processors, a phonetic representation of each search query; instantiating, within the Hadoop framework of processors, one or more search jobs, each search job comprising a plurality of tasks, each task being arranged to sequentially read a block from an archive file, said archive file stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising said archive file are replicated to be locally available to one or more processors from the cluster of processors for executing said tasks, each block storing an aggregation of index files corresponding to a plurality of source media files, the index files being derived from Hadoop Map Reduce Framework (MR) jobs; storing, within the Hadoop framework of processors, the index files of concurrently executing tasks in different archive files in order for the concurrently executing tasks to run in parallel using separate processors, each index file containing a probabilistic phonetic stream corresponding to audio information for a given source media file, wherein the aggregation of index files in each block provides a searchable phonetic representation of the audio information, wherein the index files are formatted to be compatible with search tasks running a phonetic speech search engine; for each task, obtaining phonetic representations of outstanding search queries for a block and sequentially searching said block for each outstanding search query; and responsive to matching one of the outstanding search queries with a location within said phonetic stream for an index file, returning, within a Hadoop framework of processors, said location and an identifier of said source media file for responding to said one of the outstanding search queries. - View Dependent Claims (13, 14)
-
-
15. A computer program product for execution on a cluster of processors, the computer program product comprising is arranged to perform the steps of:
-
a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising; computer readable program code configured to receive a plurality of search queries from one or more client systems; computer readable program code configured to provide a phonetic representation of each search query; computer readable program code configured to instantiate one or more search jobs, each search job comprising a plurality of tasks, each task being arranged to sequentially read a block from an archive file, said archive file stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising said archive file are replicated to be locally available to one or more processors from the cluster of processors for executing said tasks, each block storing an aggregation of index files corresponding to a plurality of source media files, the index files being derived from Hadoop Map Reduce Framework (MR) jobs; computer readable program code configured to store the index files of concurrently executing tasks in different archive files in order for the concurrently executing tasks to run in parallel using separate processors, each index file containing a probabilistic phonetic stream corresponding to audio information for a given source media file, wherein the aggregation of index files in each block provides a searchable phonetic representation of the audio information, and wherein the index files are formatted to be compatible with search tasks running a phonetic speech search engine; computer readable program code configured to, for each task, obtain phonetic representations of outstanding search queries for a block and sequentially search said block for each outstanding search query; and computer readable program code configured to, responsive to matching one of the outstanding search queries with a location within said phonetic stream for an index file, return said location and an identifier of said source media file for responding to said one of the outstanding search queries.
-
-
16. A system comprising:
-
a distributed multi-processor framework; a computer readable storage medium accessible by one or more of the processors of the distributed multiprocessor framework; computer executable instructions stored on the computer readable storage media which when executed causes the distributed multi-processor framework to perform; receiving a plurality of search queries from one or more client systems; providing a phonetic representation of each search query; instantiating one or more search jobs, each search job comprising a plurality of tasks, each task being arranged to sequentially read a block from an archive file, said archive file stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising said archive file are replicated to be locally available to one or more processors from the distributed multi-processor framework for executing said tasks, each block storing an aggregation of index files corresponding to a plurality of source media files, the index files being derived from Hadoop Map Reduce Framework (MR) jobs, wherein the index files are formatted to be compatible with search tasks running a phonetic speech search engine; storing the index files of concurrently executing tasks in different archive files in order for the concurrently executing indexing tasks to run in parallel using separate processors, each index file containing a probabilistic phonetic stream corresponding to audio information for a given source media file, wherein the aggregation of index files in each block provides a searchable phonetic representation of the audio information; for each task, obtaining phonetic representations of outstanding search queries for a block and sequentially searching said block for each outstanding search query; and responsive to matching one of the outstanding search queries a with a location within said phonetic stream for an index file, returning said location and an identifier of said source media file for responding to said one of the outstanding search queries.
-
Specification