Deep neural network partitioning on servers

US 10,452,971 B2
Filed: 06/29/2015
Issued: 10/22/2019
Est. Priority Date: 06/29/2015
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

multiple server units, each server unit comprising;

a plurality of central processing units;

a hardware acceleration processing unit coupled to a top-of-rack switch, wherein the hardware acceleration processing unit performs processing on packets received from, or sent to, the top-of-rack switch without burdening operations performed by one of the plurality of central processing units;

a local link communicationally coupling the central processing unit to the hardware acceleration processing unit;

a first network interface communicationally coupled to at least one of the plurality of central processing units; and

a second network interface different from, and independent of, the first network interface, the second network interface communicationally coupled to the hardware acceleration processing unit independently of one of the plurality of central processing units, such that a second hardware acceleration processing unit, of a second server unit of the multiple server units, communicates directly with the hardware acceleration processing unit through the second network interface, to the exclusion of communicating with one of the plurality of central processing units;

wherein a first set of the hardware acceleration processing units are head components that calculate feature values that will be used as input for subsequent processing to be performed by a second set of the hardware acceleration processing units;

wherein the second set of the hardware acceleration processing units are free form expression executing processors that receive the feature values from the head components and perform the subsequent processing; and

wherein one or more of the multiple server units execute computer-executable instructions which, when executed, cause the one or more of the multiple server units to provide a service mapping component that performs steps comprising;

assigning, to multiple head components, convolution processing associated with one or more convolution layers of a deep neural network (DNN), the multiple head components then executing the assigned convolution processing utilizing one or more free form expression executing processors to perform at least a portion of the convolution processing; and

assigning, to one or more central processing units of one or more of the multiple server units that comprise the multiple head components to which the convolution processing was assigned, linear processing of output of the convolution processing, the one or more central processing units then executing the assigned linear processing;

wherein the service mapping component associates a first central processing unit of a first server unit with a second hardware acceleration component of a second server unit, differing from the first server unit, in response to a failure of a first hardware acceleration component of the first server unit.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is provided for implementing a deep neural network on a server component that includes a host component including a CPU and a hardware acceleration component coupled to the host component. The deep neural network includes a plurality of layers. The method includes partitioning the deep neural network into a first segment and a second segment, the first segment including a first subset of the plurality of layers, the second segment including a second subset of the plurality of layers, configuring the host component to implement the first segment, and configuring the hardware acceleration component to implement the second segment.

Citations

19 Claims

1. A system comprising:
- multiple server units, each server unit comprising;
  
  a plurality of central processing units;
  
  a hardware acceleration processing unit coupled to a top-of-rack switch, wherein the hardware acceleration processing unit performs processing on packets received from, or sent to, the top-of-rack switch without burdening operations performed by one of the plurality of central processing units;
  
  a local link communicationally coupling the central processing unit to the hardware acceleration processing unit;
  
  a first network interface communicationally coupled to at least one of the plurality of central processing units; and
  
  a second network interface different from, and independent of, the first network interface, the second network interface communicationally coupled to the hardware acceleration processing unit independently of one of the plurality of central processing units, such that a second hardware acceleration processing unit, of a second server unit of the multiple server units, communicates directly with the hardware acceleration processing unit through the second network interface, to the exclusion of communicating with one of the plurality of central processing units;
  
  wherein a first set of the hardware acceleration processing units are head components that calculate feature values that will be used as input for subsequent processing to be performed by a second set of the hardware acceleration processing units;
  
  wherein the second set of the hardware acceleration processing units are free form expression executing processors that receive the feature values from the head components and perform the subsequent processing; and
  
  wherein one or more of the multiple server units execute computer-executable instructions which, when executed, cause the one or more of the multiple server units to provide a service mapping component that performs steps comprising;
  
  assigning, to multiple head components, convolution processing associated with one or more convolution layers of a deep neural network (DNN), the multiple head components then executing the assigned convolution processing utilizing one or more free form expression executing processors to perform at least a portion of the convolution processing; and
  
  assigning, to one or more central processing units of one or more of the multiple server units that comprise the multiple head components to which the convolution processing was assigned, linear processing of output of the convolution processing, the one or more central processing units then executing the assigned linear processing;
  
  wherein the service mapping component associates a first central processing unit of a first server unit with a second hardware acceleration component of a second server unit, differing from the first server unit, in response to a failure of a first hardware acceleration component of the first server unit.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein at least a subset of the multiple hardware acceleration processing units to which the convolution processing was assigned comprise a two-dimensional array of functional units, each functional unit performing a convolution of input data and weights data.
  - 3. The system of claim 1, wherein the service mapping component performs further steps comprising:
    - assigning, to the one or more central processing units, pooling operations on outputs of the one or more convolution layers whose convolution processing was assigned to the multiple hardware acceleration processing units.
  - 4. The system of claim 1, wherein the service mapping component performs further steps comprising:
    - assigning, to the multiple hardware acceleration processing units, pooling operations on outputs of the one or more convolution layers whose convolution processing was also assigned to the multiple hardware acceleration processing units.
  - 5. The system of claim 1, wherein at least a first subset of the multiple hardware acceleration processing units to which the convolution processing was assigned comprise:
    - an input double-buffer array;
      
      a kernel weights buffer array;
      
      a functional unit array; and
      
      an output buffer array.
  - 6. The system of claim 5, wherein at least a second subset of the multiple hardware acceleration processing units to which the convolution processing was assigned further comprises a controller component that retrieves input data from off-chip memory and stores it in at least a portion of the input double-buffer array;
    - wherein further each server unit is etched on a single chip.
  - 7. The system of claim 1, wherein the service mapping component receives utilization information from individual ones of the hardware acceleration processing units indicating a current level of utilization of each of the individual ones of the hardware acceleration components;
    - and assigns at least one of the convolution processing or the linear processing based on the received utilization information.

8. A server unit computing device comprising:
- a plurality of central processing units;
  
  a hardware acceleration processing unit coupled to a top-of-rack switch, wherein the hardware acceleration processing unit performs processing on packets received from, or sent to, the top-of-rack switch without burdening operations performed by one of the plurality of central processing units;
  
  a local link communicationally coupling the central processing unit to the hardware acceleration processing unit;
  
  a first network interface communicationally coupled to at least one of the plurality of central processing units; and
  
  a second network interface different from, and independent of, the first network interface, the second network interface communicationally coupled to the hardware acceleration processing unit independently of one of the plurality of central processing units, such that a second hardware acceleration processing unit, of a second server unit of the multiple server units, communicates directly with the hardware acceleration processing unit through the second network interface, to the exclusion of communicating with one of the plurality of central processing units;
  
  wherein the hardware acceleration processing unit is either configured as;
  
  (1) a head component that calculates feature values that will be used as input for subsequent processing to be performed by other hardware acceleration processing units of other server unit computing devices;
  
  or (2) a free form expression executing processor that receives the feature values and performs the subsequent processing;
  
  wherein the hardware acceleration processing unit is assigned, by a service mapping component, convolution processing associated with one or more convolution layers of a deep neural network (DNN), the hardware acceleration processing unit performing at least a portion of the assigned convolution processing;
  
  wherein the central processing unit is assigned, by the service mapping component, linear processing of output of the convolution processing performed by the hardware acceleration processing unit, the central processing unit performing at least a portion of the assigned linear processing; and
  
  wherein the service mapping component is provided by one or more server units executing computer-executable instructions which, when executed, cause the one or more server units to provide the service mapping component performing steps comprising;
  
  (1) the assigning of the convolution processing to the hardware acceleration processing unit, (2) the assigning of the linear processing to the central processing unit, and (3) associating a first central processing unit of a first server unit with a second hardware acceleration component of a second server unit, differing from the first server unit, in response to a failure of a first hardware acceleration component of the first server unit.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The server unit computing device of claim 8, wherein the hardware acceleration processing unit comprises a two-dimensional array of functional units, each functional unit performing a convolution of input data and weights data.
  - 10. The server unit computing device of claim 8, wherein at least one of the plurality of central processing units performs pooling operations on outputs of the one or more convolution layers whose convolution processing was assigned to the hardware acceleration processing unit.
  - 11. The server unit computing device of claim 8, wherein the hardware acceleration processing unit performs pooling operations on outputs of the one or more convolution layers.
  - 12. The server unit computing device of claim 8, wherein the hardware acceleration processing unit comprises:
    - an input double-buffer array;
      
      a kernel weights buffer array;
      
      a functional unit array; and
      
      an output buffer array.
  - 13. The server unit computing device of claim 12, wherein the hardware acceleration processing unit further comprises a controller component that retrieves input data from off-chip memory and stores it in at least a portion of the input double-buffer array;
    - wherein further the server unit computing device is etched on a single chip.

14. A system comprising:
- multiple server units, each server unit comprising;
  
  a plurality of central processing units;
  
  a hardware acceleration processing unit coupled to a top-of-rack switch, wherein the hardware acceleration processing unit performs processing on packets received from, or sent to, the top-of-rack switch without burdening operations performed by one of the plurality of central processing units;
  
  a local link communicationally coupling the central processing unit to the hardware acceleration processing unit;
  
  a first network interface communicationally coupled to at least one of the plurality of central processing units and coupling the at least one of the plurality of central processing units to others of the multiple server units; and
  
  a second network interface different from, and independent of, the first network interface, the second network interface communicationally coupled to the hardware acceleration processing unit independently of one of the plurality of central processing units, such that a second hardware acceleration processing unit, of a second server unit of the multiple server units, communicates directly with the hardware acceleration processing unit through the second network interface, to the exclusion of communicating with one of the plurality of central processing units;
  
  wherein a first set of the hardware acceleration processing units are head components that calculate feature values that will be used as input for subsequent processing to be performed by a second set of the hardware acceleration processing units;
  
  wherein the second set of the hardware acceleration processing units are free form expression executing processors that receive the feature values from the head components and perform the subsequent processing; and
  
  wherein one or more of the multiple server units execute computer-executable instructions which, when executed, cause the one or more of the multiple server units to provide a service mapping component that performs steps comprising;
  
  assigning, to multiple head components, a first layer of processing associated with a deep neural network (DNN) based on the first layer having a lower memory bandwidth requirement than a second layer, the multiple head components then executing the assigned first layer of processing utilizing one or more free form expression executing processors to perform at least a portion of the first layer of processing; and
  
  assigning, to one or more central processing units of one or more of the multiple server units that comprise the multiple head components to which the first layer of processing was assigned, a second layer of processing associated with the DNN based on the second layer having a higher memory bandwidth requirement than the first layer, the one or more central processing units then executing the assigned second layer of processing;
  
  wherein the service mapping component associates a first central processing unit of a first server unit with a second hardware acceleration component of a second server unit, differing from the first server unit, in response to a failure of a first hardware acceleration component of the first server unit.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The system of claim 14, wherein at least a subset of the multiple hardware acceleration processing units to which the first layer was assigned comprises a two-dimensional array of functional units, each functional unit performing a convolution of input data and weights data.
  - 16. The system of claim 14, wherein the service mapping component performs further steps comprising:
    - assigning, to the one or more central processing units, pooling operations on outputs of the first layer.
  - 17. The system of claim 14, wherein the service mapping component performs further steps comprising:
    - assigning, to the multiple hardware acceleration processing units, pooling operations on outputs of the first layer.
  - 18. The system of claim 14, wherein at least a first subset of the multiple hardware acceleration processing units to which the first layer was assigned comprises:
    - an input double-buffer array;
      
      a kernel weights buffer array;
      
      a functional unit array; and
      
      an output buffer array.
  - 19. The system of claim 18, wherein at least a second subset of the multiple hardware acceleration processing units to which the first layer was assigned further comprises a controller component that retrieves input data from off-chip memory and stores it in at least a portion of the input double-buffer array;
    - wherein further each server unit is etched on a single chip.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Chung, Eric, Strauss, Karin, Ovtcharov, Kalin, Kim, Joo-Young, Ruwase, Olatunji
Primary Examiner(s)
Afshar, Kamran
Assistant Examiner(s)
Chen, Ying Yu

Application Number

US14/754,384
Publication Number

US 20160379108A1
Time in Patent Office

1,576 Days
Field of Search
US Class Current
CPC Class Codes

G06N 3/04   Architecture, e.g. intercon...

G06N 3/045   Combinations of networks

G06N 3/063   using electronic means

Deep neural network partitioning on servers

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Deep neural network partitioning on servers

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links