Cuda_error_out_of_memory

I am receiving this error message:

terminate called after throwing an instance of ‘std::runtime_error’
what(): cuMemAlloc : CUDA error: CUDA_ERROR_OUT_OF_MEMORY

/work/qchem6p1p1/bin/qchem: line 129: 3424860 Aborted (core dumped) ${QCPROG_S} ${inp} ${scr}
Error in Q-Chem run part 1

Command line is:

qchem -gpu -nt 60 baseline_11240_HF_Notrunc_qc_job_DFT.inp baseline_11240_HF_Notrunc_qc_job_DFT.out baseline_11240_HF_Notrunc_qc_job_DFT

This calculation works fine for regular HF.

What is the memory limit in Q-Chem for the GPUs? Isn’t 24 Gbytes per GPU sufficient?

Pertinent info:

Q-Chem version 6.1.0. Linux RedHat 8.6 cuda 11.8 Four NVIDIA GeForce RTX 3090 GPUs each having 24576 MiB of RAM

There are 1046 atoms and 7436 external point charges.

The input file looks like this (abbreviated for sanity):

$comment
BASELINE md1 11240-bin0 all Na and DNA DFT
$end

$molecule
0 1
H         7.1776000000000044      16.2891640000000031      18.0170574499999994
O         7.7101413000000054      16.5243172000000058      17.2537059499999970
C         8.7841912000000075      15.6469254000000042      17.2647457499999959
H         9.3712455000000023      15.6766647000000034      18.1826686499999965
H         9.4123488000000020      15.8436340000000015      16.3959331499999976
C         8.2067262000000056      14.2460273999999991      17.0260829499999957
.
.
.
Na        3.3425744000000002      16.3165315000000071      27.5854167499999932
Na       -8.8795172999999927      14.2412972000000018       7.9384593499999960
Na       -4.2830044999999934      -2.6798703999999978      -7.7863656500000014
Na       -1.3536274999999929      21.3721456000000067     -25.0405722500000074
$end

$external_charges
-35.0406969	-10.6890833	-7.89040995	0.447585
-34.5123626	-19.7969214	-12.92697885	0.447585
-34.3356882	-14.6226479	-7.30517725	-0.89517
-34.3003434	-13.9265665	-15.38567975	-0.89517
-34.2419116	-17.6615407	-6.98659905	0.447585
.
.
.
38.050571	-1.8716919	-21.53536495	0.447585
38.0834691	16.9182408	5.56798585	0.447585
38.2569417	0.0537589	-24.93606715	0.447585
$end

$rem
	METHOD	  wB97M-V
	SYMMETRY false
	SYM_IGNORE true
	BASIS	def2-SVP
	MEM_STATIC 10000
	GUI	2
	MAX_SCF_CYCLES 100
$end

Output looks like (also abbreviated for sanity):

Running Job 1 of 1 baseline_11240_HF_Notrunc_qc_job_DFT.inp
qchem baseline_11240_HF_Notrunc_qc_job_DFT.inp_3424847.0 /work/qcscratch/baseline_11240_HF_Notrunc_qc_job_DFT_2/ 1
/work/qchem6p1p1/exe/qcprog.exe_s baseline_11240_HF_Notrunc_qc_job_DFT.inp_3424847.0 /work/qcscratch/baseline_11240_HF_Notrunc_qc_job_DFT_2/
.
.
.
tid: 0x0 [2024.04.20-12:06:58.831] [INFO] Detected device: CUDA NVIDIA GeForce RTX 3090#0000:25:00.0
 tid: 0x0 [2024.04.20-12:06:58.831] [INFO] Detected device: CUDA NVIDIA GeForce RTX 3090#0000:41:00.0
 tid: 0x0 [2024.04.20-12:06:58.831] [INFO] Detected device: CUDA NVIDIA GeForce RTX 3090#0000:81:00.0
 tid: 0x0 [2024.04.20-12:06:58.831] [INFO] Detected device: CUDA NVIDIA GeForce RTX 3090#0000:E1:00.0
 tid: 0x0 [2024.04.20-12:06:58.833] [INFO] Allocating 3 threads for GPU: CUDA NVIDIA GeForce RTX 3090#0000:25:00.0 on node 0
 tid: 0x0 [2024.04.20-12:06:58.834] [INFO] Allocating 3 threads for GPU: CUDA NVIDIA GeForce RTX 3090#0000:41:00.0 on node 0
 tid: 0x0 [2024.04.20-12:06:58.836] [INFO] Allocating 3 threads for GPU: CUDA NVIDIA GeForce RTX 3090#0000:81:00.0 on node 0
 tid: 0x0 [2024.04.20-12:06:58.838] [INFO] Allocating 3 threads for GPU: CUDA NVIDIA GeForce RTX 3090#0000:E1:00.0 on node 0
 tid: 0x0 [2024.04.20-12:07:11.534] [INFO] Used GPU count: 4
 tid: 0x0 [2024.04.20-12:07:12.454] [INFO] End BrianQC module initialization
 tid: 0x0 [2024.04.20-12:07:12.457] [INFO] End brianAPIInit
 tid: 0x0 [2024.04.20-12:07:12.457] [INFO] Running preliminary BrianQC license check...
 tid: 0x0 [2024.04.20-12:07:12.458] [INFO] Preliminary BrianQC license check passed
.
.
.
Nuclear Repulsion Energy =      501801.73993400 hartrees
 There are     2680 alpha and     2680 beta electrons
 Requested basis set is def2-SVP
 There are 5268 shells and 11500 basis functions

 Q-Chem warning in module libmdc/get_mega.C, line 167:

 Allowed MEM_STATIC has been capped at MEM_STATIC=8191.
 Reduce the value of MEM_STATIC to silence this warning.


 Total QAlloc Memory Limit 1000000 MB
 Mega-Array Size      8007 MB
 MEM_STATIC part      8191 MB

 A cutoff of  1.0D-09 yielded 626466 shell pairs
 There are   2558682 function pairs (   2787687 Cartesian)
 Smallest overlap matrix eigenvalue = 3.05E-04

 Scale SEOQF with 1.000000e-03/1.000000e-03/1.000000e-03

 Standard Electronic Orientation quadrupole field applied
 Nucleus-field energy     =    -0.0000005877 hartrees
 Adding 7436 external point charges to one-electron Hamiltonian
 Nucleus-charge energy    =  1045.4415667537 hartrees
 Charge-charge energy     =  -909.4589444089 hartrees
 Guess from superposition of atomic densities
 Warning:  Energy on first SCF cycle will be non-variational
 SAD guess density has 5360.000004 electrons

 -----------------------------------------------------------------------
  General SCF calculation program by
  Eric Jon Sundstrom, Paul Horn, Yuezhi Mao, Dmitri Zuev, Alec White,
  David Stuck, Shaama M.S., Shane Yost, Joonho Lee, David Small,
  Daniel Levine, Susi Lehtola, Hugh Burton, Evgeny Epifanovsky,
  Bang C. Huynh
 -----------------------------------------------------------------------
 Exchange:     0.1500 Hartree-Fock + 1.0000 wB97M-V + LR-HF
 Correlation:  1.0000 wB97M-V
 Using SG-2 standard quadrature grid
 Nonlocal Correlation:  VV10 with C = 0.0100 and b = 6.00 and scale = 1.00000
 Grid used for NLC:  SG-1 standard quadrature
 using 60 threads for integral computing
 -------------------------------------------------------
 OpenMP Integral computing Module                
 Release: version 1.0, May 2013, Q-Chem Inc. Pittsburgh 
 -------------------------------------------------------
 ---- BrianQC J/K successfully initialized ---- 
 ---- BrianQC K successfully initialized ---- 
 ---- BrianQC XC successfully initialized ---- 
 Brianqc Computing J only
BrianQC JK build time 824.0000000000 (s)
 Brianqc Computing K only
BrianQC JK build time 821.0000000000 (s)

That’s the end of the output file.

I don’t run on GPUs so I can’t test, but the fact that Hartree-Fock works fine makes me think about the quadrature grid. (Even on CPUs, there have been some recent bug fixes related to a memory-greedy quadrature code in the new Kohn-Sham library.) Default grid for that functional is SG-2 so what if you turn it down to SG-1 or (just for testing) to SG-0, using XC_GRID = 1 or 0.

Thanks, John. XC_GRID = 1 solved the problem.

How big is this molecule (# atoms, # basis functions) and are you willing to provide a complete input file ? (You could omit the external charges, I don’t think they are relevant to the problem and GPUs may be incidental also.) If so I can create a bug ticket. It seems we are still feeling out what the memory limits for DFT quadrature integral batching should be with the new DFT library.

1046 atoms, 11500 basis functions.

How do you want me to send the input file?

will it fit here w/o the external charges?

No. It doesn’t fit by a factor of 3 or so.

okay, my email address is easy to find (but I won’t post it here)

Thanks, John. The file has been sent.

Received. I will look into it and post an update if anything changes.

1 Like