Question:
In Cuda AtomicAdd for double can be implemented using a while loop and AtomicCAS operation. But how could I implement an atomic add for type int3 efficiently?Answer:
After further consideration, I’m not sure how an atomicAdd on anint3
would be any different than 3 separate atomicAdd
operations, each on an int
location. Why not do that?(An
int3
cannot be loaded as a single quantity anyway in CUDA at the machine level. The compiler is guaranteed to split that into multiple loads, so although there would be a hazard to asynchronously read the int3
, that hazard would be there anyway, with or without atomics.)But to answer the specific question you asked, it’s not possible using atomics.
int3
is a 96-bit type.CUDA atomics support operations up to 64 bits only. Here is an atomic add example for
float2
(a 64-bit type) and you could do something similar for up to e.g. short3
or short4
.You could alternatively use a reduction method or else a critical section. There are plenty of questions here on the SO
cuda
tag that discuss reductions and critical sections.A reduction method could be implemented as follows:
Each thread that wants to make an atomic update to a particular
int3
location uses this method to create a queue or list of the atomic update quantities.Once the list generation is complete, launch a kernel to do a parallel reduction on the list, so as to produce the final reduced quantity that belongs in that location.
If you have better answer, please add a comment about this, thank you!