Sound packet reproducing method, sound packet reproducing apparatus, sound packet reproducing program, and recording medium

US 7,710,982 B2
Filed: 05/25/2005
Issued: 05/04/2010
Est. Priority Date: 05/26/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A reproducing method for receiving a stream of sent audio packets containing an audio code generated by encoding an input audio data stream frame by frame and reproducing an audio signal, comprising:

(a) storing received packets in a receiving buffer;

(b) detecting, in a control part, a largest delay jitter and a number of buffered packets, the largest jitter being any of a largest value or statistical value of jitter obtained by observing arrival jitter of the received packets over a predetermined period of time and the number of buffered packets being a number of packets stored in the receiving buffer;

(c) obtaining, in a control part and based on the largest delay jitter, an optimum number of buffered packets by using a predetermined relation between the largest delay jitter and the optimum number of buffered packets, the optimum number of buffered packets being an optimum number of packets to be stored in the receiving buffer;

(d) determining, in the control part, on a scale of a plurality of levels, a difference between the detected number of buffered packets and the optimum number of buffered packets;

(e) retrieving, by the control part, a packet corresponding to a current frame from the receiving buffer and decoding an audio code in the packet to obtain a decoded audio data stream in the current frame; and

(f) performing, in a consumption adjusting part, any of expansion, reduction, and preservation of a waveform of the decoded audio data stream in accordance with a rule to make the number of buffered packets close to the optimum number of buffered packets, the rule being established for each level of the difference, and outputting a result as audio data of the current frame,wherein step (f) includes obtaining a pitch length of the decoded audio data stream, analyzing the audio data stream to determine whether the audio data stream is in a voice segment or a non-voice segment, and performing any of expansion, reduction, and preservation by inserting or removing a waveform corresponding to the pitch length in the decoded audio string or by not changing the decoded audio signal string, on the basis of a result of the determination of voice/non-voice segment and a result of the determination of the difference.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention prevents a receiving buffer from becoming empty by: storing received packets in the receiving buffer; detecting the largest arrival delay jitter of the packets and the buffer level of the receiving buffer by a state detecting part; obtaining an optimum buffer level for the largest delay jitter using a predetermined table by a control part; determining, based on the detected buffer level and the optimum buffer level, the level of urgency about the need to adjust the buffer level; expanding or reducing the waveform of a decoded audio data stream of the current frame decoded from a packet read out of the receiving buffer by a consumption adjusting part to adjust the consumption of reproduction frames on the basis of the urgency level, the detected buffer level, and the optimum buffer level.

Citations

12 Claims

1. A reproducing method for receiving a stream of sent audio packets containing an audio code generated by encoding an input audio data stream frame by frame and reproducing an audio signal, comprising:
- (a) storing received packets in a receiving buffer;
  
  (b) detecting, in a control part, a largest delay jitter and a number of buffered packets, the largest jitter being any of a largest value or statistical value of jitter obtained by observing arrival jitter of the received packets over a predetermined period of time and the number of buffered packets being a number of packets stored in the receiving buffer;
  
  (c) obtaining, in a control part and based on the largest delay jitter, an optimum number of buffered packets by using a predetermined relation between the largest delay jitter and the optimum number of buffered packets, the optimum number of buffered packets being an optimum number of packets to be stored in the receiving buffer;
  
  (d) determining, in the control part, on a scale of a plurality of levels, a difference between the detected number of buffered packets and the optimum number of buffered packets;
  
  (e) retrieving, by the control part, a packet corresponding to a current frame from the receiving buffer and decoding an audio code in the packet to obtain a decoded audio data stream in the current frame; and
  
  (f) performing, in a consumption adjusting part, any of expansion, reduction, and preservation of a waveform of the decoded audio data stream in accordance with a rule to make the number of buffered packets close to the optimum number of buffered packets, the rule being established for each level of the difference, and outputting a result as audio data of the current frame,wherein step (f) includes obtaining a pitch length of the decoded audio data stream, analyzing the audio data stream to determine whether the audio data stream is in a voice segment or a non-voice segment, and performing any of expansion, reduction, and preservation by inserting or removing a waveform corresponding to the pitch length in the decoded audio string or by not changing the decoded audio signal string, on the basis of a result of the determination of voice/non-voice segment and a result of the determination of the difference.
- View Dependent Claims (2, 3)
- - 2. The reproducing method according to claim 1, wherein,step (d) comprises determining whether a level of the difference represents a high urgency level indicating that the number of buffered packets should be urgently increased or decreased or a low urgency level indicating that the number of buffered packets should be slowly increased or decreased;
    - andstep (f) further comprises, if the level represents the high urgency level, expanding or reducing the waveform of the decoded audio data stream regardless of whether the data stream is in a voice segment or a non-voice segment;
      
      if the level represents the low urgency level, expanding or reducing the waveform of the decoded audio data stream, on condition that the decoded audio data stream is in a non-voice segment.
  - 3. The reproducing method according to claim 1, wherein,step (d) comprises determining whether a level of the difference represents a high urgency level indicating that the number of buffered packets should be urgently increased or decreased or a low urgency level indicating that the number of buffered packets should be slowly increased or decreased;
    - andstep (f) further comprises, if the level represents the high urgency level, expanding or reducing the waveform of the decoded audio data stream regardless of whether the decoded audio data stream is in a voice segment or a non-voice segment, if the level represents the low urgency level, expanding or reducing the waveform of the decoded audio data stream once every predetermined number N1 of frames when the decoded audio data stream is in a voice segment, or expanding or reducing the waveform of the decoded audio data stream once every predetermined number N2 of frames when the decoded audio data stream is in a non-voice period, where N1 and N2 being integers greater than or equal to 1 and N2 is smaller than N1.

4. A reproducing method for receiving a stream of sent audio packets containing an audio code generated by encoding an input audio data stream frame by frame and reproducing an audio signal, comprising:
- (a) storing received packets in a receiving buffer;
  
  (b) detecting, in a control part, a largest delay jitter and a number of buffered packets, the largest jitter being any of a largest value or statistical value of jitter obtained by observing arrival jitter of the received packets over a predetermined period of time and the number of buffered packets being a number of packets stored in the receiving buffer;
  
  (c) obtaining, in a control part and based on the largest delay jitter, an optimum number of buffered packets by using a predetermined relation between the largest delay jitter and the optimum number of buffered packets, the optimum number of buffered packets being an optimum number of packets to be stored in the receiving buffer;
  
  (d) determining, in the control part, on a scale of a plurality of levels, a difference between the detected number of buffered packets and the optimum number of buffered packets;
  
  (e) retrieving, by the control part, a packet corresponding to a current frame from the receiving buffer and decoding an audio code in the packet to obtain a decoded audio data stream in the current frame; and
  
  (f) performing, in a consumption adjusting part, any of expansion, reduction, and preservation of a waveform of the decoded audio data stream in accordance with a rule to make the number of buffered packets close to the optimum number of buffered packets, the rule being established for each level of the difference, and outputting a result as audio data of the current frame,wherein step (f) includes obtaining the pitch length of the decoded audio data stream, analyzing the decoded audio data stream to determine which of a voiced sound segment, an unvoiced sound segment, a background noise segment, and a silence segment the decoded audio data stream is in, and performing any of expansion, reduction, and preservation of the decoded audio data stream by inserting or removing a waveform corresponding to the pitch length in the decoded audio data stream or by not changing the decoded audio data stream, on the basis of the result of the segment determination and the result of the determination of the difference level.
- View Dependent Claims (5)
- - 5. The reproducing method according to claim 4, wherein,step (d) comprises determining whether a level of the difference represents a high urgency level indicating that the number of buffered packets should be urgently increased or decreased or a low urgency level indicating that the number of buffered packets should be slowly increased or decreased;
    - andstep (f) further comprises, if the level represents the high urgency level, expanding or reducing the waveform of the decoded audio data stream regardless of a result of the segment determination;
      
      if the level represents a low urgency level, expanding or reducing the waveform of the decoded audio data stream once every predetermined number N1, N2, N3, N4 of frames, the predetermined number being predetermined for each of a voiced sound segment, an unvoiced sound segment, a background noise segment, and a silence segment, where N1, N2, N3, and N4 are positive integers and at least one of the integers is greater than or equal to 2 and differs from the other three integers.

6. A reproducing apparatus for audio packets which receives a stream of sent audio packets containing an audio code generated by encoding an input audio data stream frame by frame and reproduces an audio signal, comprising:
- a packet receiving part configured to receive audio packets from a packet communication network;
  
  a receiving buffer configured to temporarily store the received packets and configured to read out packets in response to a request;
  
  a state detecting part configured to detect a largest delay jitter and a number of buffered packets, the largest jitter being any of a largest value or statistical value of jitter obtained by observing arrival jitter of the received packets over a predetermined period of time and the number of buffered packets being a number of packets stored in the receiving buffer;
  
  a control part configured toobtain based on the largest delay jitter an optimum number of buffered packets by using a predetermined relation between the largest delay jitter and the optimum number of buffered packets, the optimum number of buffered packets being an optimum number of packets to be stored in the receiving buffer,determine, on a scale of a plurality of levels, a difference between the detected number of buffered packets and the optimum number of buffered packets, andgenerate a control signal to perform any of expansion, reduction, and preservation of a waveform of the decoded audio data stream in accordance with a rule to make the number of buffered packets close to the optimum number of buffered packets, the rule being established for each level of the difference;
  
  an audio packet decoding part configured to decode an audio code in a packet corresponding to a current frame extracted from the receiving buffer to obtain a decoded audio data stream in the current frame;
  
  a consumption adjusting part configured to perform any of expansion, reduction, and preservation of the waveform of the decoded audio data stream in accordance with the control signal and configured to output a result as sound data of the current frame; and
  
  an audio analyzing part configured to analyze the decoded audio data stream to determine whether the decoded audio data stream is in a voice segment or a non-voice segment, the audio analyzing part providing a result of the determination to the control part, the audio control part obtaining a pitch length of the decoded audio data stream and providing the pitch length to the consumption adjusting part,wherein, the control part provides control to cause the consumption adjusting part to perform any of expansion, reduction, and preservation of the decoded audio data stream of the current frame, on the basis of a result of the segment determination and a result of the difference level determination, and the consumption adjusting part inserts or removes a waveform corresponding to the pitch length in the decoded audio data stream or does not change the decoded audio data stream, in accordance with the control.
- View Dependent Claims (7, 8)
- - 7. The reproducing apparatus according to claim 6, wherein the control part determines whether a level of the difference represents a high urgency level indicating that the number of buffered packets should be urgently increased or decreased or a low urgency level indicating that the number of buffered packets should be slowly increased or decreased;
    - and, if the level represents the high urgency level provides control to cause the consumption adjusting part to expand or reduce the waveform of the decoded audio data stream regardless of whether the data stream is in a voice segment or a non-voice segment;
      
      if the level represents the low urgency level, provides control to cause the consumption adjusting part to expand or reduce the waveform of the decoded audio data stream, when the decoded audio data stream is in a non-voice segment.
  - 8. The reproducing apparatus according to claim 6, wherein the control part determines whether a level of the difference represents a high urgency level indicating that the number of buffered packets should be urgently increased or decreased or a low urgency level indicating that the number of buffered packets should be slowly increased or decreased;
    - and, if the level represents the high urgency level, provides a control to cause the consumption adjusting part to expand or reduce the waveform of the decoded audio data stream regardless of whether the decoded audio data stream is in a voice segment or a non-voice segment;
      
      if the level represents the low urgency level, provides a control to cause the consumption adjusting part to expand or reduce the waveform of the decoded audio data stream once every predetermined number N1 of frames on the condition that the decoded audio data stream is in a voice segment, or to expand or reduce the waveform of the decoded audio data stream once every predetermined number N2 of frames when the decoded audio data stream is in a non-voice period, where N1 and N2 being integers greater than or equal to 1 and N2 is smaller than N1.

9. A reproducing apparatus for audio packets which receives a stream of sent audio packets containing an audio code generated by encoding an input audio data stream frame by frame and reproduces an audio signal, comprising:
- a packet receiving part configured to receive audio packets from a packet communication network;
  
  a receiving buffer configured to temporarily store the received packets and reading out packets in response to a request;
  
  a state detecting part configured to detect a largest delay jitter and a number of buffered packets, the largest jitter being any of a largest value or statistical value of jitter obtained by observing arrival jitter of the received packets over a predetermined period of time and the number of buffered packets being a number of packets stored in the receiving buffer;
  
  a control part configured toobtain based on the largest delay jitter an optimum number of buffered packets by using a predetermined relation between the largest delay jitter and the optimum number of buffered packets, the optimum number of buffered packets being the optimum number of packets to be stored in the receiving buffer,determine, on a scale of a plurality of levels, a difference between the detected number of buffered packets and the optimum number of buffered packets, andgenerate a control signal for instructing to perform any of expansion, reduction, and preservation of a waveform of the decoded audio data stream in accordance with a rule to make the number of buffered packets close to the optimum number of buffered packets, the rule being established for each level of the difference;
  
  an audio packet decoding part configured to decode an audio code in a packet corresponding to a current frame extracted from the receiving buffer to obtain a decoded audio data stream in the current frame; and
  
  a consumption adjusting configured to perform any of expansion, reduction, and preservation of the waveform of the decoded audio data stream in accordance with the control signal and outputs a result as sound data of the current frame,wherein the audio analyzing part analyzes the decoded audio data stream to determine whether the decoded audio data stream includes a voiced sound segment, an unvoiced sound segment, a background noise segment, and a silence segment, provides a result of the determination to the control part, obtains a pitch length of the decoded audio data stream, and provides the pitch length to the consumption adjusting part;
  
  the control part provides a control based on a result of the segment determination and a result of the difference level determination to the consumption adjusting part to perform any of expansion, reduction, and preservation of the decoded audio data stream of a current frame; and
  
  the consumption adjusting part, in accordance with the control, inserts or removes a waveform corresponding to the pitch length in the decoded audio data stream or does not change the decoded audio data stream.
- View Dependent Claims (10)
- - 10. The reproducing apparatus according to claim 9, wherein the control part determines whether a level of the difference represents a high urgency level indicating that the number of buffered packets should be urgently increased or decreased or a low urgency level indicating that the number of buffered packets should be slowly increased or decreased;
    - and, if the level represents the high urgency level, provides a control to cause the consumption adjusting part to expand or reduce the waveform of the decoded audio data stream regardless of the result of the segment determination;
      
      if the level represents a low urgency level, provides a control to cause the consumption adjusting part to expand or reduce the waveform of the decoded audio data stream once every predetermined number N1, N2, N3, N4 of frames, the predetermined number being predetermined for each of the voiced sound segment, the unvoiced sound segment, the background noise segment, and the silence segment, where N1, N2, N3, and N4 are positive integers and at least one of the integers is greater than or equal to 2 and differs from the other three integers.

11. A computer-readable recording medium storing computer-readable instructions thereon, the computer-readable instructions when executed by a computer cause the computer to perform the method comprising:
- storing received packets in a receiving buffer;
  
  detecting a largest delay jitter and a number of buffered packets, the largest jitter being any of a largest value or statistical value of jitter obtained by observing arrival jitter of the received packets over a predetermined period of time and the number of buffered packets being a number of packets stored in the receiving buffer;
  
  obtaining, based on the largest delay jitter, an optimum number of buffered packets by using a predetermined relation between the largest delay jitter and the optimum number of buffered packets, the optimum number of buffered packets being an optimum number of packets to be stored in the receiving buffer;
  
  determining, on a scale of a plurality of levels, a difference between the detected number of buffered packets and the optimum number of buffered packets;
  
  retrieving a packet corresponding to a current frame from the receiving buffer and decoding an audio code in the packet to obtain a decoded audio data stream in the current frame; and
  
  performing any of expansion, reduction, and preservation of a waveform of the decoded audio data stream in accordance with a rule to make the number of buffered packets close to the optimum number of buffered packets, the rule being established for each level of the difference, and outputting a result as audio data of the current frame,wherein the performing includes obtaining a pitch length of the decoded audio data stream, analyzing the audio data stream to determine whether the audio data stream is in a voice segment or a non-voice segment, and performing any of expansion, reduction, and preservation by inserting or removing a waveform corresponding to the pitch length in the decoded audio string or by not changing the decoded audio signal string, on the basis of a result of the determination of voice/non-voice segment and a result of the determination of the difference level.

12. A computer-readable medium storing computer-readable instructions thereon, the computer readable instructions when executed by a computer cause the computer to perform the method comprising:
- storing received packets in a receiving buffer;
  
  detecting a largest delay jitter and a number of buffered packets, the largest jitter being any of a largest value or statistical value of jitter obtained by observing arrival jitter of the received packets over a predetermined period of time and the number of buffered packets being a number of packets stored in the receiving buffer;
  
  obtaining, based on the largest delay jitter, an optimum number of buffered packets by using a predetermined relation between the largest delay jitter and the optimum number of buffered packets, the optimum number of buffered packets being an optimum number of packets to be stored in the receiving buffer;
  
  determining, on a scale of a plurality of levels, a difference between the detected number of buffered packets and the optimum number of buffered packets;
  
  retrieving a packet corresponding to a current frame from the receiving buffer and decoding an audio code in the packet to obtain a decoded audio data stream in the current frame; and
  
  performing any of expansion, reduction, and preservation of a waveform of the decoded audio data stream in accordance with a rule to make the number of buffered packets close to the optimum number of buffered packets, the rule being established for each level of the difference, and outputting a result as audio data of the current frame,wherein the performing includes obtaining the pitch length of the decoded audio data stream, analyzing the decoded audio data stream to determine which of a voiced sound segment, an unvoiced sound segment, a background noise segment, and a silence segment the decoded audio data stream is in, and performing any of expansion, reduction, and preservation of the decoded audio data stream by inserting or removing a waveform corresponding to the pitch length in the decoded audio data stream or by not changing the decoded audio data stream, on the basis of a result of the segment determination and a result of the determination of the difference level.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nippon Telegraph and Telephone Corporation
Original Assignee
Nippon Telegraph and Telephone Corporation
Inventors
Mori, Takeshi, Ohmuro, Hitoshi, Hiwasaki, Yusuke, Kataoka, Akitoshi
Primary Examiner(s)
Shah; Chirag G
Assistant Examiner(s)
Rivas; Salvador E

Application Number

US10/591,183
Publication Number

US 20070177620A1
Time in Patent Office

1,805 Days
Field of Search

None
US Class Current

370/395.64
CPC Class Codes

G10L 19/005   Correction of errors induce...

G10L 21/04   Time compression or expansion

H04J 3/0632   Synchronisation of packets ...

Sound packet reproducing method, sound packet reproducing apparatus, sound packet reproducing program, and recording medium

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Sound packet reproducing method, sound packet reproducing apparatus, sound packet reproducing program, and recording medium

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links