20210806, 20:02  #23 
Jun 2012
Boulder, CO
2^{4}×3×7 Posts 

20210806, 21:20  #24 
Jun 2012
Boulder, CO
2^{4}×3×7 Posts 
On that note, though, are there any plans to support multiple GPUs? If a single A100 is this fast, 16x A100's with a fully interconnected fabric could probably tear through big matrices?

20210806, 21:41  #25 
Jul 2003
So Cal
2^{4}·139 Posts 
The current version supports multiple GPUs using MPI (compile with CUDA=1 MPI=1 CUDAAWARE=1) but relies on a good MPI implementation. OpenMPI's collectives transfer the data from card and do the reduction on the CPU. MVAPICH2GDR I think keeps the reductions on the card, but SDSC doesn't have that working on Expanse GPU yet so I haven't been able to test it. I hope to have time on NCSA Delta later this fall to try it out.
Edit: Edit 2: I've got a draft version working just now that passes vectors between GPUs using MPI CUDAaware pointtopoint comms (which uses NVLink or GPUDirect when available) then does the reduction on the GPU manually. In a quick test on a 43M matrix using two V100's connected with NVLink, this reduces LA time from nearly 90 hours when passing vectors through CPU memory to Edit 3: It's now in GitHub. Just compile with a CUDAAware MPI like OpenMPI using CUDA=XX MPI=1 CUDAAWARE=1 where XX is replaced by the compute capability of your GPU. Last fiddled with by frmky on 20210812 at 08:20 
20210807, 06:44  #26 
Jul 2003
So Cal
2^{4}×139 Posts 
Code:
linear algebra completed 45452 of 42101088 dimensions (0.1%, ETA 21h 4m) 
20210807, 11:10  #27 
(loop (#_fork))
Feb 2006
Cambridge, England
2^{2}×3^{2}×179 Posts 
Interesting! That's about a p3.8xlarge instance, for which the spot price is $4/hr, so that's $84 = £60 to solve the matrix.
I'm paying 19p/kWh here, and my Skylake machine uses about 250W and takes 820 hours for a 44M matrix, so that's £40 of electricity (but probably £60 in depreciation, assuming the £3360 machine will last five years); on another hand it's taking a month rather than a day, on a third hand that's still keeping up with my sieving resources. 
20210807, 16:13  #28 
Jul 2003
So Cal
2^{4}·139 Posts 
Code:
linear algebra completed 49005 of 84248506 dimensions (0.1%, ETA 94h30m) 
20210807, 16:20  #29  
Jun 2012
Boulder, CO
2^{4}×3×7 Posts 
Quote:
Code:
linear algebra completed 20216008 of 109441779 dimensions (18.5%, ETA 854h19m) 

20210807, 16:40  #30 
Jul 2003
So Cal
2^{4}·139 Posts 
Yes, that would have been on 6 Sandy Bridge nodes with 2x 10 core cpus each.
Here's the companion 2,2162L matrix, also 84.2M, running on 8 Fujitsu A64FX nodes. Code:
Fri Jul 2 01:59:19 2021 linear algebra at 0.0%, ETA 337h 2m 
20210808, 00:00  #31 
I moo ablest echo power!
May 2013
3×5×7×17 Posts 
Would something like work on my 3090? It has 24GB of ram on it, though I would have to get some help with compilation as I use WSL2, which doesn't support CUDA applications (yet).

20210808, 00:57  #32 
Jul 2003
So Cal
2^{4}·139 Posts 
Yes, you could solve a matrix up to about 15M or so on the card. If you have at least 32 GB system memory, you could go a bit larger transferring the matrix from system memory as needed using CUDA managed memory. But I have no experience compiling msieve for Windows.

20210811, 22:09  #33 
Jul 2003
So Cal
2^{4}·139 Posts 
The LA for 2,2162M, an 84.2M matrix, successfully completed on four NVLinkconnected V100's in a total of 95.5 hours of runtime. There was a restart due to the 48hour queue time limit on SDSC Expanse GPU. This run used just over 26GB of GPU memory on each of the four V100's.
Attached is a snapshot of the timeline for two block Lanzcos iterations on three of the four gpus. Per the time scale at the top, it takes just over 1 second/iteration. Over 80% of the time is spent in the SpMV routine. The transfer of vectors directly between GPU's takes relatively little time when NVLink is used. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Resume linear algebra  Timic  Msieve  35  20201005 23:08 
use msieve linear algebra after CADONFS filtering  aein  Msieve  2  20171005 01:52 
Has anyone tried linear algebra on a Threadripper yet?  fivemack  Hardware  3  20171003 03:11 
Linear algebra at 600%  CRGreathouse  Msieve  8  20090805 07:25 
Linear algebra proof  Damian  Math  8  20070212 22:25 