Information processing method, information processing device, and recording medium for determining registered speakers as target speakers in speaker recognition

US 11,417,344 B2
Filed: 10/21/2019
Issued: 08/16/2022
Est. Priority Date: 10/24/2018
Status: Active Grant

First Claim

Patent Images

1. An information processing method performed by a computer, the information processing method comprising:

detecting at least one speech segment from speech utterances that are sequentially input to a speech input unit;

extracting, from each of the at least one speech segment, a first feature quantity identifying a speaker whose voice is contained in the speech segment;

performing a comparison between the first feature quantity extracted and each of second feature quantities stored in a second storage as targets in speaker recognition for identifying respective voices of registered speakers, the second feature quantities being among second feature quantities pre-stored in a first storage and identifying respective voices of registered speakers;

performing a parsing and management of the registered speakers in the second storage, based on results of the comparison, which is performed for each consecutive speech segment of the at least one speech segment, of;

deleting, from the second storage, at least one second feature quantity among the second features quantities when a degree of similarity between the first feature quantity in the consecutive speech segments, which is present for a fixed period of time or for a fixed number of times, and the at least one second feature quantity stored in the second storage is less than or equal to a threshold and a predetermined condition is satisfied, to remove at least one registered speaker identified by the at least one second feature quantity from the registered speakers stored in the second storage and reduce a total number of registered speakers as target speakers for speaker recognition; and

when a first feature quantity having a degree of similarity between the first feature quantity and each of the second feature quantities stored in the second storage, which is less than or equal to a threshold, appears among first feature quantities in speech segments that follow the consecutive speech segments,storing, in the second storage, a second feature quantity having a degree of similarity between the first feature quantity that appeared among the first feature quantities and the second feature quantities stored in the first storage that is greater than a threshold, based on comparing the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage; and

adding, to the second storage, the first feature quantity that appeared among the first feature quantities as a feature quantity identifying a voice of a new registered speaker when a degree of similarity between the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage is less than or equal to a threshold based on a result of comparing the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage, to increase the total number of registered speakers who are target speakers for speaker recognition.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The information processing method in the present disclosure is performed as below. At least one speech segment is detected from speech input to a speech input unit. A first feature quantity is extracted from each speech segment detected, the first feature quantity identifying a speaker whose voice is contained in the speech segment. The first feature quantity extracted is compared with each of second feature quantities stored in storage and identifying the respective voices of registered speakers who are target speakers in speaker recognition. The comparison is performed for each of consecutive speech segments, and under a predetermined condition, among the second feature quantities stored in the storage, at least one second feature quantity whose similarity with the first feature quantity is less than or equal to a threshold is deleted, thereby removing the at least one registered speaker identified by the at least one second feature quantity.

14 Citations

View as Search Results

11 Claims

1. An information processing method performed by a computer, the information processing method comprising:
- detecting at least one speech segment from speech utterances that are sequentially input to a speech input unit;
  
  extracting, from each of the at least one speech segment, a first feature quantity identifying a speaker whose voice is contained in the speech segment;
  
  performing a comparison between the first feature quantity extracted and each of second feature quantities stored in a second storage as targets in speaker recognition for identifying respective voices of registered speakers, the second feature quantities being among second feature quantities pre-stored in a first storage and identifying respective voices of registered speakers;
  
  performing a parsing and management of the registered speakers in the second storage, based on results of the comparison, which is performed for each consecutive speech segment of the at least one speech segment, of;
  
  deleting, from the second storage, at least one second feature quantity among the second features quantities when a degree of similarity between the first feature quantity in the consecutive speech segments, which is present for a fixed period of time or for a fixed number of times, and the at least one second feature quantity stored in the second storage is less than or equal to a threshold and a predetermined condition is satisfied, to remove at least one registered speaker identified by the at least one second feature quantity from the registered speakers stored in the second storage and reduce a total number of registered speakers as target speakers for speaker recognition; and
  
  when a first feature quantity having a degree of similarity between the first feature quantity and each of the second feature quantities stored in the second storage, which is less than or equal to a threshold, appears among first feature quantities in speech segments that follow the consecutive speech segments,storing, in the second storage, a second feature quantity having a degree of similarity between the first feature quantity that appeared among the first feature quantities and the second feature quantities stored in the first storage that is greater than a threshold, based on comparing the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage; and
  
  adding, to the second storage, the first feature quantity that appeared among the first feature quantities as a feature quantity identifying a voice of a new registered speaker when a degree of similarity between the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage is less than or equal to a threshold based on a result of comparing the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage, to increase the total number of registered speakers who are target speakers for speaker recognition.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The information processing method according to claim 1,wherein when the second feature quantities stored in the second storage include a second feature quantity having a degree of similarity higher than the threshold, the second feature quantity having a degree of similarity higher than the threshold is updated to a feature quantity including the first feature quantity and the second feature quantity having a degree of similarity higher than the threshold, to update information on a registered speaker identified by the second feature quantity having a degree of similarity higher than the threshold and stored in the second storage, the degree of similarity being a degree of similarity with the first feature quantity.
  - 3. The information processing method according to claim 1,wherein the second storage pre-stores the second feature quantities.
  - 4. The information processing method according to claim 1, further comprising:
    - registering target speakers by (i) instructing each of the target speakers to utter first speech and inputting the respective first speech to the speech input unit, (ii) detecting first speech segments from the respective first speech, (iii) extracting, from the first speech segments, feature quantities in speech identifying the respective target speakers, and (iv) storing the feature quantities in the second storage as the second feature quantities.
  - 5. The information processing method according to claim 1,whereinas the predetermined condition, the comparison is performed a total of m times for the consecutive speech segments, where m is an integer greater than or equal to 2, andas a result of the comparison performed m times, when at least one second feature quantity having a degree of similarity less than or equal to the threshold is included, at least one registered speaker identified by the at least one second feature quantity is removed, the degree of similarity being a degree of similarity with the first feature quantity extracted in each of the consecutive speech segments.
  - 6. The information processing method according to claim 1,wherein,as the predetermined condition, the comparison is performed for a predetermined period, andas a result of the comparison performed for the predetermined period, when at least one second feature quantity having a degree of similarity less than or equal to the threshold is included, at least one registered speaker identified by the at least one second feature quantity is removed, the degree of similarity being a degree of similarity with the first feature quantity.
  - 7. The information processing method according to claim 1,wherein, when the second storage stores, as the second feature quantities, second feature quantities identifying two or more respective registered speakers who are target speakers in speaker recognition, at least one registered speaker identified by the at least one second feature quantity is removed.
  - 8. The information processing method according to claim 1,wherein in the detecting, speech segments are detected consecutively in a time sequence from speech input to the speech input unit.
  - 9. The information processing method according to claim 1,wherein in the detecting, speech segments are detected at predetermined intervals from speech input to the speech input unit.

10. An information processing device comprising:
- a non-transitory computer-readable recording medium configured to store a program thereon; and
  
  a hardware processor configured to execute the program and cause the information processing device to;
  
  detect at least one speech segment from speech utterances that are sequentially input;
  
  extract, from each of the at least one speech segment, a first feature quantity identifying a speaker whose voice is contained in the speech segment;
  
  perform a comparison between the first feature quantity extracted and each of second feature quantities stored in a second storage as targets in speaker recognition for identifying respective registered speakers, the second feature quantities being among second feature quantities pre-stored in a first storage and identifying respective voices of registered speakers;
  
  perform a parsing and management of the registered speakers in the second storage, based on results of the compassion, which is performed for each consecutive speech segment of the at least one speech segment, and which includes to;
  
  remove at least one registered speaker identified by at least one second feature quantity from the registered speakers stored in the second storage when a degree of similarity between the first feature quantity in the consecutive speech segments, which is present for a fixed period of time or for a fixed number of times, and the at least one second feature quantity stored in the second storage is less than or equal to a threshold and a predetermined condition is satisfied, to reduce a total number of registered speakers as target speakers for speaker recognition; and
  
  when a first feature quantity having a degree of similarity between the first feature quantity and each of the second feature quantities stored in the second storage, which is less than or equal to a threshold, appears among first feature quantities in speech segments that follow the consecutive speech segments,storing, in the second storage, a second feature quantity having a degree of similarity between the first feature quantity that appeared among the first feature quantities and the second feature quantities stored in the first storage that is greater than a threshold, based on comparing the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage; and
  
  add, to the second storage, the first feature quantity that appeared among the first feature quantities as a feature quantity identifying a voice of a new registered speaker when a degree of similarity between the first feature quantity and each of the second feature quantities stored in the storage is less than or equal to a threshold based on a result of comparing the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage, to increase a total number of registered speakers who are target speakers for speaker recognition.

11. A non-transitory computer-readable recording medium for use in a computer, the recording medium having a program recorded thereon for causing the computer to perform an information processing method, the information processing method comprising:
- detecting at least one speech segment from speech utterances that are sequentially input to a speech input unit;
  
  extracting, from each of the at least one speech segment, a first feature quantity identifying a speaker whose voice is contained in the speech segment;
  
  performing a comparison between the first feature quantity extracted and each of second feature quantities stored in a second storage as targets in speaker recognition for identifying respective voices of registered speakers, the second feature quantities being among second feature quantities pre-stored in a first storage and identifying respective voices of registered speakers;
  
  performing a parsing and management of the registered speakers in the second storage, based on results of the comparison, which is performed for each consecutive speech segment of the at least one speech segment, of;
  
  deleting, from the second storage, at least one second feature quantity among the second features quantities when a degree of similarity between the first feature quantity in the consecutive speech segments, which is present for a fixed period of time or for a fixed number of times, and the at least one second feature quantity stored in the second storage is less than or equal to a threshold and a predetermined condition is satisfied, to remove at least one registered speaker identified by the at least one second feature quantity from the registered speakers stored in the second storage and reduce a total number of registered speakers as target speakers for speaker recognition; and
  
  when a first feature quantity having a degree of similarity between the first feature quantity and each of the second feature quantities stored in the second storage, which is less than or equal to a threshold, appears among first feature quantities in speech segments that follow the consecutive speech segments,storing, in the second storage, a second feature quantity having a degree of similarity between the first feature quantity that appeared among the first feature quantities and the second feature quantities stored in the first storage that is greater than a threshold, based on comparing the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage; and
  
  adding, to the second storage, the first feature quantity that appeared among the first feature quantities as a feature quantity identifying a voice of a new registered speaker when a degree of similarity between the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage is less than or equal to a threshold based on a result of comparing the first feature quantity that appeared among the first features quantities and each of the second feature quantities stored in the first storage, to increase the total number of registered speakers who are target speakers for speaker recognition.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Panasonic Intellectual Property Corporation of America (Panasonic Holdings Corporation)
Original Assignee
Panasonic Intellectual Property Corporation of America (Panasonic Holdings Corporation)
Inventors
Doi, Misaki
Primary Examiner(s)
Leland, III, Edwin S

Application Number

US16/658,769
Publication Number

US 20200135211A1
Time in Patent Office

1,030 Days
Field of Search

704246
US Class Current
CPC Class Codes

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 17/06   Decision making techniques;...

G10L 17/22   Interactive procedures; Man...

Information processing method, information processing device, and recording medium for determining registered speakers as target speakers in speaker recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

14 Citations

11 Claims

Specification

Solutions

Use Cases

Quick Links

Information processing method, information processing device, and recording medium for determining registered speakers as target speakers in speaker recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

11 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links