NVIDIA could have a powerful card for mining in its boxes. It is said in fact that a CMP HX card based on GA100 would be in preparation at the Greens. Thus, we would be facing the first Ampere version for a Crypto Mining Processor.
With a GA100, derived from the A100, we would really be on a more Compute oriented card which could therefore prove to be much more efficient and interesting, because it offers 19.5 Tflops in single-precision, moreover it embeds the HBM2, 40 GB.
If NVIDIA released such a monster for mining, you would think that there would be much less video memory, which would probably lower the price of the card, since an A100 card is still worth $ 10,000.
So besides the 30HX, 40HX, 50HX, and 90HX, maybe we could also have a 100HX.
GA100 Specifications
- 8 GPCs, 8 TPCs / GPC, 2SMs / TPC, 16SMs / GPC, 128SMs per full GPU
- 64 CUDA Cores FP32 / SM, 8192 CUDA Cores FP32 per full GPU
- 4 3rd Gen / SM Tensor Cores, 512 3rd Gen Tensor Cores per Full GPU
- 6 HBM2 memory stacks, 12 512-bit memory controllers
GA100 SM architecture
- Third Generation Tensor Cores
- Acceleration for all types of data, including FP16, BF16, TF32, FP64, INT8, INT4, and Binary
- Tensor Cores TF32 operations provide an easy way to speed up input / output FP32 data in Deep Learning and High Performance Computing frameworks, running up to 10x faster than the Tesla V100 in FP32 FMA operations, or up to 20x faster. in sparse matrices.
- The FP16 / FP32 Mixed Precision Tensor Cores deliver unprecedented processing power for Deep Learning, running up to 2.5x faster than Volta's Tensor Cores, and up to 5x faster on sparse matrices.
- FP64 operations on Tensor Cors run up to 2.5x faster than DFMA FP64 operations on Tesla V100.
- The INT8 operations with sparse matrices offer unprecedented processing power in Deep Learning interference, running up to 20x faster than INT8 operations in the Tesla V100.
- 192KB of combined memory and L1 cache, 1.5x larger than a Tesla V100 SM
- New asynchronous copy instruction for direct data loading from global memory to shared memory, optionally skipping the L1 cache and eliminating the need for an intermediate log file.
- New barrier unit for shared memory (asynchronous barrier) for use in conjunction with the new asynchronous copy instruction.
- New instructions for handling the L2 cache and residency controls.
- New enhancements in scheduling to reduce software complexity.
0 Comments