"A different solution to overcome startup delay by JIT while still allowing execution on newer GPUs is to specify multiple
code instances, as in
nvcc x.cu -arch=compute_10 -code=compute_10,sm_10,sm_13
This command generates exact code for
two Tesla variants, plus ptx code for use by JIT in case a
next-generation GPU is encountered.
nvcc organizes its device code in fatbinaries, which are able to hold multiple translations of the same GPU source code. At runtime,
the CUDA driver will select the most appropriate translation when the device function is launched."
"...the virtual compute architecture should always be chosen as low as possible, thereby maximizing the actual GPUs to run on. The real sm architecture should be chosen as high as possible (assuming that this always generates better code), but this is only possible with knowledge of the actual GPUs on which the application is expected to run. As we will see later, in the situation of just in time compilation, where the driver has this exact knowledge: the runtime GPU is the one on which the program is about to be launched/executed."
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc
No comments:
Post a Comment