Question:In Cuda AtomicAdd for double can be implemented using a while loop and AtomicCAS operation. But how could I implement an atomic add for type int3 efficiently?
Answer:After further consideration, I’m not sure how an atomicAdd on an
int3would be any different than 3 separate
atomicAddoperations, each on an
intlocation. Why not do that?
int3cannot be loaded as a single quantity anyway in CUDA at the machine level. The compiler is guaranteed to split that into multiple loads, so although there would be a hazard to asynchronously read the
int3, that hazard would be there anyway, with or without atomics.)
But to answer the specific question you asked, it’s not possible using atomics.
int3is a 96-bit type.
CUDA atomics support operations up to 64 bits only. Here is an atomic add example for
float2(a 64-bit type) and you could do something similar for up to e.g.
You could alternatively use a reduction method or else a critical section. There are plenty of questions here on the SO
cudatag that discuss reductions and critical sections.
A reduction method could be implemented as follows:
Each thread that wants to make an atomic update to a particular
int3location uses this method to create a queue or list of the atomic update quantities.
Once the list generation is complete, launch a kernel to do a parallel reduction on the list, so as to produce the final reduced quantity that belongs in that location.
If you have better answer, please add a comment about this, thank you!