System and method for phonetic searching of data

US 9,405,828 B2
Filed: 09/06/2012
Issued: 08/02/2016
Est. Priority Date: 09/06/2012
Status: Active Grant

First Claim

Patent Images

1. A multiprocessor-implemented method of indexing media information within a Hadoop framework for phonetic searching, the method comprising:

providing, within a Hadoop framework of processors, pointers to respective locations of source media files including audio information which is to be made searchable;

wherein each pointer corresponds to a respective source media file;

providing, within the Hadoop framework of processors, a respective set of one or more of the pointers to respective ones of a plurality of Hadoop Map Reduce Framework (MR) jobs,wherein each respective set comprises one or more subsets of the one or more of the pointers;

wherein each MR job instantiates concurrently executing Map tasks, each Map task associated with one of the subsets of the one or more pointers and wherein each Map task;

processes each of the corresponding source media files corresponding to the associated one of the subsets of the one or more pointers, andreads each of the corresponding source media files and generates a respective binary index file corresponding to a probabilistic phonetic stream of audio information for that corresponding source media file;

appending, within the Hadoop framework of processors, each of the respective binary index files to a respective associated one of a plurality of different archive files;

each respective archive file comprising a searchable phonetic representation of the audio information appended thereto; and

appending, within the Hadoop framework of processors, the respective binary index file of the concurrently executing Map tasks to different ones of the plurality of different archive files in order for the concurrently executing Map tasks to run in parallel using separate processors, said plurality of different archive files stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising each respective archive file are replicated to be locally available to one or more processors from a cluster of processors for sequential reading of said sequential blocks, each block storing a plurality of the respective binary index files, wherein each respective binary index file is formatted to be compatible with search tasks running a phonetic speech search engine.

View all claims

21 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of phonetically searching media information comprises receiving a plurality of search queries from one or more client systems and providing a phonetic representation of each search query. One or more search jobs are instantiated, each search job comprising a plurality of tasks, each task being arranged to sequentially read a block from an archive file. The archive file is stored within a distributed filing system (DFS) in which sequential blocks of data comprising the archive file are replicated to be locally available to one or more processors from a cluster of processors for executing the tasks. Each block stores index files corresponding to a plurality of source media files, each index file containing a phonetic stream corresponding to audio information for a given source media file. Each task obtains phonetic representations of outstanding search queries for a block and sequentially searches the block for each outstanding search query.

Citations

16 Claims

1. A multiprocessor-implemented method of indexing media information within a Hadoop framework for phonetic searching, the method comprising:
- providing, within a Hadoop framework of processors, pointers to respective locations of source media files including audio information which is to be made searchable;
  
  wherein each pointer corresponds to a respective source media file;
  
  providing, within the Hadoop framework of processors, a respective set of one or more of the pointers to respective ones of a plurality of Hadoop Map Reduce Framework (MR) jobs,wherein each respective set comprises one or more subsets of the one or more of the pointers;
  
  wherein each MR job instantiates concurrently executing Map tasks, each Map task associated with one of the subsets of the one or more pointers and wherein each Map task;
  
  processes each of the corresponding source media files corresponding to the associated one of the subsets of the one or more pointers, andreads each of the corresponding source media files and generates a respective binary index file corresponding to a probabilistic phonetic stream of audio information for that corresponding source media file;
  
  appending, within the Hadoop framework of processors, each of the respective binary index files to a respective associated one of a plurality of different archive files;
  
  each respective archive file comprising a searchable phonetic representation of the audio information appended thereto; and
  
  appending, within the Hadoop framework of processors, the respective binary index file of the concurrently executing Map tasks to different ones of the plurality of different archive files in order for the concurrently executing Map tasks to run in parallel using separate processors, said plurality of different archive files stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising each respective archive file are replicated to be locally available to one or more processors from a cluster of processors for sequential reading of said sequential blocks, each block storing a plurality of the respective binary index files, wherein each respective binary index file is formatted to be compatible with search tasks running a phonetic speech search engine.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1 wherein each respective binary index file comprises a header indicating the start of said respective binary index file, an identifier linking said respective binary index file to the corresponding source media file, an indicator of the length of said respective binary index file and its corresponding probabilistic phonetic stream.
  - 3. The method according to claim 2 wherein the header of each respective binary index file further comprises offset indicators indicating start and end locations within the associated one archive file of index information for the corresponding source media file.
  - 4. The method according to claim 2 wherein each respective binary index file further comprises one or more of:
    - an indicator of number of audio channels or a speech type of the corresponding source media file.
  - 5. The method according to claim 1 wherein each of the sequential blocks stores information for at least 10 binary index files.
  - 6. The method according to claim 1 wherein block boundaries within said archive files do not correspond with index file boundaries.
  - 7. The method according to claim 1 wherein said appending comprises appending respective binary index files to respective different archive files in parallel.
  - 8. The method according to claim 1 wherein said source media files comprise recordings of contacts processed by a contact center.
  - 9. The method according to claim 1 wherein said source media files comprise one of television or radio broadcast programmes.

10. A computer program product for execution on processors of a distributed multi-processor system, the computer program product comprising:
- a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising;
  
  computer readable program code configured to provide pointers to respective locations of source media files including audio information which is to be made searchable, wherein each pointer corresponds to a respective source media file;
  
  computer readable program code configured to provide a respective set of the one or more pointers to respective ones of a plurality of Hadoop Map Reduce Framework (MR) jobs,wherein each respective set comprises one or more subsets of the one or more of the pointers;
  
  wherein each MR job instantiates concurrently executing Map tasks, each Map task associated with one of the subsets of the one or more pointers and wherein each Map task;
  
  processes each of the corresponding source media files corresponding to the associated one of the subsets of the one or more pointers, andreads each of the corresponding source media files and generates a respective binary index file corresponding to a probabilistic phonetic stream of audio information for that corresponding source media file;
  
  computer readable program code configured to each of the respective binary index files to a respective associated one of a plurality of different archive files;
  
  each archive file comprising a searchable phonetic representation of the audio information appended thereto; and
  
  computer readable program code configured to append the respective binary index file of the concurrently executing Map tasks to different ones of the plurality of different archive files in order for the concurrently executing Map tasks to run in parallel using separate processors, wherein each respective binary index file is formatted to be compatible with search tasks running a phonetic speech search engine,and wherein each of the plurality of different archive files is stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising each respective archive file are replicated to be locally available to one or more processors from a cluster of processors for sequential reading of said sequential blocks, each block storing a plurality of the respective binary index files.

11. A system comprising:
- a distributed multi-processor framework;
  
  a computer readable storage medium accessible by one or more of the processors of the distributed multi-processor framework;
  
  computer executable instructions stored on the computer readable storage media which when executed causes the distributed multi-processor framework to perform;
  
  providing, within a Hadoop framework of processors, pointers to respective locations of source media files including audio information which is to be made searchable;
  
  wherein each pointer corresponds to a respective source media file;
  
  providing, within the Hadoop framework of processors, a respective set subsets of one or more of the pointers to respective ones of a plurality of Hadoop Map Reduce Framework (MR) jobs,wherein each respective set comprises one or more subsets of the one or more of the pointers;
  
  wherein each MR job instantiates concurrently executing Map tasks, each Map task associated with one of the subsets of the one or more pointers and wherein each Map task;
  
  processes each of the corresponding source media files corresponding to the associated one of the subsets of the one or more pointers, andreads each of the corresponding source media files and generates a respective binary index file corresponding to a probabilistic phonetic stream of audio information for that corresponding source media file;
  
  appending, within the Hadoop framework of processors, each of the respective binary index files to a respective associated one of a plurality of different archive files;
  
  each respective archive file comprising a searchable phonetic representation of the audio information appended thereto; and
  
  appending, within the Hadoop framework of processors, the respective binary index file of the concurrently executing Map tasks to different ones of the plurality of different archive files in order for the concurrently executing Map tasks to run in parallel using separate processors, said plurality of different archive files stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising each respective archive file are replicated to be locally available to one or more processors from a cluster of processors for sequential reading of said sequential blocks, each block storing a plurality of the respective binary index files, wherein each respective binary index file is formatted to be compatible with search tasks running a phonetic speech search engine.

12. A method of phonetically searching media information within a Hadoop framework of a cluster of processors, the method comprising:
- receiving, within a Hadoop framework of processors, a plurality of search queries from one or more client systems;
  
  providing, within the Hadoop framework of processors, a phonetic representation of each search query;
  
  instantiating, within the Hadoop framework of processors, one or more search jobs, each search job comprising a plurality of tasks, each task being arranged to sequentially read a block from an archive file, said archive file stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising said archive file are replicated to be locally available to one or more processors from the cluster of processors for executing said tasks, each block storing an aggregation of index files corresponding to a plurality of source media files, the index files being derived from Hadoop Map Reduce Framework (MR) jobs;
  
  storing, within the Hadoop framework of processors, the index files of concurrently executing tasks in different archive files in order for the concurrently executing tasks to run in parallel using separate processors, each index file containing a probabilistic phonetic stream corresponding to audio information for a given source media file, wherein the aggregation of index files in each block provides a searchable phonetic representation of the audio information, wherein the index files are formatted to be compatible with search tasks running a phonetic speech search engine;
  
  for each task, obtaining phonetic representations of outstanding search queries for a block and sequentially searching said block for each outstanding search query; and
  
  responsive to matching one of the outstanding search queries with a location within said phonetic stream for an index file, returning, within a Hadoop framework of processors, said location and an identifier of said source media file for responding to said one of the outstanding search queries.
- View Dependent Claims (13, 14)
- - 13. A method according to claim 12 wherein said returning comprises writing said location and said identifier to a distributed database.
  - 14. The method according to claim 12 wherein said source media files comprise at least one of recordings of contacts processed by a contact center;
    - one of television or radio broadcast programmes;
      
      recordings of video calls;
      
      or video recorded events.

15. A computer program product for execution on a cluster of processors, the computer program product comprising is arranged to perform the steps of:
- a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising;
  
  computer readable program code configured to receive a plurality of search queries from one or more client systems;
  
  computer readable program code configured to provide a phonetic representation of each search query;
  
  computer readable program code configured to instantiate one or more search jobs, each search job comprising a plurality of tasks, each task being arranged to sequentially read a block from an archive file, said archive file stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising said archive file are replicated to be locally available to one or more processors from the cluster of processors for executing said tasks, each block storing an aggregation of index files corresponding to a plurality of source media files, the index files being derived from Hadoop Map Reduce Framework (MR) jobs;
  
  computer readable program code configured to store the index files of concurrently executing tasks in different archive files in order for the concurrently executing tasks to run in parallel using separate processors, each index file containing a probabilistic phonetic stream corresponding to audio information for a given source media file, wherein the aggregation of index files in each block provides a searchable phonetic representation of the audio information, and wherein the index files are formatted to be compatible with search tasks running a phonetic speech search engine;
  
  computer readable program code configured to, for each task, obtain phonetic representations of outstanding search queries for a block and sequentially search said block for each outstanding search query; and
  
  computer readable program code configured to, responsive to matching one of the outstanding search queries with a location within said phonetic stream for an index file, return said location and an identifier of said source media file for responding to said one of the outstanding search queries.

16. A system comprising:
- a distributed multi-processor framework;
  
  a computer readable storage medium accessible by one or more of the processors of the distributed multiprocessor framework;
  
  computer executable instructions stored on the computer readable storage media which when executed causes the distributed multi-processor framework to perform;
  
  receiving a plurality of search queries from one or more client systems;
  
  providing a phonetic representation of each search query;
  
  instantiating one or more search jobs, each search job comprising a plurality of tasks, each task being arranged to sequentially read a block from an archive file, said archive file stored within a Hadoop distributed filing system (DFS) in which sequential blocks of data comprising said archive file are replicated to be locally available to one or more processors from the distributed multi-processor framework for executing said tasks, each block storing an aggregation of index files corresponding to a plurality of source media files, the index files being derived from Hadoop Map Reduce Framework (MR) jobs, wherein the index files are formatted to be compatible with search tasks running a phonetic speech search engine;
  
  storing the index files of concurrently executing tasks in different archive files in order for the concurrently executing indexing tasks to run in parallel using separate processors, each index file containing a probabilistic phonetic stream corresponding to audio information for a given source media file, wherein the aggregation of index files in each block provides a searchable phonetic representation of the audio information;
  
  for each task, obtaining phonetic representations of outstanding search queries for a block and sequentially searching said block for each outstanding search query; and
  
  responsive to matching one of the outstanding search queries a with a location within said phonetic stream for an index file, returning said location and an identifier of said source media file for responding to said one of the outstanding search queries.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Arlington Technologies, LLC (Dominion Harbor Enterprises, LLC)
Original Assignee
Avaya Incorporated
Inventors
Wilkins, Malcolm Fintan, Wynn, Gareth Alan
Primary Examiner(s)
SYED, FARHAN M

Application Number

US13/605,055
Publication Number

US 20140067820A1
Time in Patent Office

1,426 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/182 Distributed file systems

G06F 16/61 Indexing; Data structures t...

System and method for phonetic searching of data

First Claim

21 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for phonetic searching of data

First Claim

21 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links