Speeding up the exploration of the NN topology space

Speeding up the exploration of the NN topology space requires both time and money. The demand for faster, higher memory capacity, and application specific GPU’s is not abate any time soon. The Nvidia one size fits all solution for deep learning is going to be attacked by different companies who will develop different solutions for different NN applications (word2vec, neural nets, convolutional neural nets, and LTSM/RNN’s). Developing semi-automated systems for labeling data results in a huge boost in productivity.

1. Connectivity. Reduced training times can be achieved by parallelizing the NN operations. Training the NN on different chips requires chip to chip communication. Sending signals via IO pads through copper wires on PCB’s is an order of magnitude or more slower than the on chip communication. Furthermore, the bisection bandwidth between chips is significantly lower than the on chip bisection bandwidth. The chip design industry still has not developed chip to chip optical connections which would significantly increase the communication bandwidth between chips running a parallel process.

2. Best practices. MTBF. Building systems with 8 connected GPU cards is going to lead to significant reductions in the time to failure ( one reason is the increase in components pushing the design limits in terms of frequency/temperature). Being able to swap out a component and restart from the current point requires that the software teams checkpoint their trained weights and biases…

3. Reliability. This next idea might be a bridge to far and unnecessary. Building resilient parallel processing systems that have check pointing support in the hardware (out of band communication) to save the state of the weights/biases/… in the neural nets which are training for weeks is required. The hardware should have built in check pointing and recovery circuits. The hardware/software.

4. Cooling. Stacking 8 GPU cards next to each other is going to require liquid cooling solutions as part of the OEM package from the chip company. Refrigerators have been build for over a hundred years. Refrigeration technology is reliable. Yet chip companies continue to rely on air cooling to remove heat. Water and liquid nitrogen cooling are going to be necessary to cool future chip designs and will be part of the component delivered to board manufacturers.

As always corrections and comments are appreciated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s