[Rtk-users] GPUs testing

Fri Jul 20 15:58:04 CEST 2018

Hi,
Chao's observations are correct.
- Yes, it's the time per execution. The backprojection filter was called 43
times with an average of 0.0275691 s, so 1.185 s in total. The 0.025 second
difference with the info displayed on your command line is not completely
surprising because the timing done automatically for each filter
starts/stops the timing a bit after/before. Note that we have removed any
other timing than RTK_TIME_EACH_FILTER in the current version.
- CudaFDKConeBeamReconstructionFillter is a "mini-pipeline" (see ITK doc)
of a few filters, as documented here
<http://www.openrtk.org/Doxygen/classrtk_1_1FDKConeBeamReconstructionFilter.html>.
The ExtractImageFilter is missing from the drawing but the sum of the 4
(ExtractImagefilter, FDKWeightProjectionFilter, FFTRampImageFilter and
FDKBackprojectionImageFilter) gives
43*(0.0130324+0.0275+0.0389145+0.0275691)=4.6
s. It's different from the 9.6 s you observe and I'm afraid I don't know
why. But I did not see FDKWeightProjectionFilter in your list, did you
remove some filters from the list? That might explain the few seconds
missing if I realize some filters are not there.
- I'm not sure CUDAWeighting is the longest... Cuda computation is
asynchronous so the backprojection, which should be the longest, might be
finishing in the weighting filter. If you want a more accurate timing, you
probably need to force synchronous computation.
I hope this helps.
Simon

On Fri, Jul 20, 2018 at 1:03 PM, Chao Wu <wuchao04 at gmail.com> wrote:

> Hi, By a quick look, the time reported with RTK_TIME_EACH_FILTER seems to
> be time per execution of each filter. I didn't look into the code so I have
> no idea whether it is an average time or the time of the last execution .
> In addition (not shown in your example) if one filter has more than one
> instances in the pipeline, the report only lists the total number of
> executions of all instances.
> Regards, Chao
>
> Elena Padovani <elenapadovani.lk at gmail.com> 于2018年7月19日周四 上午11:44写道：
>
>> Hi Simon,
>> Thank you for the fast reply. i changed the
>> RTK_CUDA_PROJECTIONS_SLAB_SIZE but unfortunately nothing has changed. I
>> also compiled it with the FLAG RTK_TIME_EACH_FILTER on and i did not
>> understand why it tells me that CudaFDKBackProjectionImageFilter took
>> 0.0275 s, CudaFDKConeBeamReconstructionFilter took 9.58 s
>> and CudaFFTRampImageFilter took 0.0389 s while the PrintTiming method tells
>> me that; Prefilter operations took 6.65, Ramp Filter 1.71, Backprojection:
>> 1.21. So my question is where is the remaining time spent ? For instance,
>> is (Backprojection=1.21 - CudaFDKBackProjectionImageFilter=0.0275) the
>> time needed to copy the memory from CPU to GPU? The same holds for the ramp
>> filter.
>> Moreover it seems to me that what is taking long is the CUDAWeighting
>> filter so do you think that increasing the number of thread per block which
>> is now { 16, 16 , 2 } could help ?
>>
>> Here is what the applications shows me with the -v option:
>>
>> Reconstructing and writing... It took 11.8574 s
>> FDKConeBeamReconstructionFilter timing:
>>   Prefilter operations: 6.65107 s
>>   Ramp filter: 1.71472 s
>>   Backprojection: 1.21037 s
>>
>>
>> ************************************************************
>> *********************
>> Probe Tag                                     Starts    Stops     Time
>> (s)
>> ************************************************************
>> *********************
>> ConstantImageSource                           1         1
>>  0.0962241
>> CudaCropImageFilter                           43        43
>> 0.00230094
>> CudaFDKBackProjectionImageFilter              43        43
>> 0.0275691
>> CudaFDKConeBeamReconstructionFilter           1         1
>>  9.58291
>> CudaFFTRampImageFilter                        43        43
>> 0.0389145
>> ExtractImageFilter                            43        43
>> 0.0130324
>> FFTWRealToHalfHermitianForwardFFTImageFilter  12        12
>> 0.00128049
>> ImageFileReader                               686       686
>>  0.0481416
>> ImageFileWriter                               1         1
>>  11.8383
>> ImageSeriesReader                             686       686
>>  0.0484766
>> ProjectionsReader                             1         1
>>  44.7685
>> Self                                          129       129
>>  0.0506474
>> StreamingImageFilter                          2         2         27.713
>>
>> VarianObiRawImageFilter                       686       686
>>  0.0135297
>>
>> At the beginning i was using my own application with my own data i now
>> switched back to the wiki VarianRecontruction test ( with a 512^3
>> reconstructed volume).
>>
>> Thank you again,
>> Kind Regards
>>
>> Elena
>>
>> 2018-07-18 22:00 GMT+02:00 Simon Rit <simon.rit at creatis.insa-lyon.fr>:
>>
>>> Hi,
>>> Thanks for sharing your results.
>>> RTK uses CUFFT for the ramp filtering which does its own blocks/grid
>>> management. For backprojection, it's pretty simple, see
>>> https://github.com/SimonRit/RTK/blob/master/src/
>>> rtkCudaFDKBackProjectionImageFilter.cu#L198
>>> mostly hardcoded, independent of the number of CUDA cores and could be
>>> optimized. There is one compilation parameter that you can try to change to
>>> see if that speeds up the computation, that is the cmake variable
>>> RTK_CUDA_PROJECTIONS_SLAB_SIZE which controls how many projections are
>>> backprojected simultaneously.
>>> We currently currently don't propose any way to use multiple GPUs.
>>> Please keep us posted if you continue to do some tests. In particular, I
>>> advise turning on RTK_TIME_EACH_FILTER in cmake so that you get a report
>>> with -v option in applications on how much time your program spent in each
>>> filter.
>>> Best regards,
>>> Simon
>>>
>>> On Wed, Jul 18, 2018 at 6:48 PM, Elena Padovani <
>>> elenapadovani.lk at gmail.com> wrote:
>>>
>>>> Hi RTK-users,
>>>>
>>>> I compiled RTK with CUDA and tried to setup a benchmark to analyze the
>>>> performances trend of the GPUs when using the CUDA-FDK reconstruction
>>>> filter. Precisely, when reconstructing the same volume from the same
>>>> data-set on NVS510 GTX860M and GTX970M i got results consistent with the
>>>> number of GPUs cuda cores. Indeed, when setting up this benchmark i was
>>>> expecting a reduction in the reconstruction time with the increase of
>>>> cuda cores(at least until the dimension of the reconstructed volume was not
>>>> the actual bottleneck). However, when testing it on a Tesla P100 i got
>>>> performances comparable to the GTX860M. Would you expect such a result?
>>>>
>>>> Unfortunately i am new to CUDA and i was wondering if any of you could
>>>> help me figuring this out.
>>>> How does rtk with CUDA manage the number of blocks/grid dimension ?
>>>> Is the number of blocks/grid dimension depedent on the GPU cuda cores?
>>>> Is there a way to use multiple GPUs?
>>>>
>>>> The test was carried with the following data:
>>>> - 360 projections
>>>> - reconstructed volume 600x700x800 px
>>>>
>>>> Thank you in advance
>>>> Kind regards
>>>>
>>>> Elena
>>>>
>>>>
>>>> _______________________________________________
>>>> Rtk-users mailing list
>>>> Rtk-users at public.kitware.com
>>>> https://public.kitware.com/mailman/listinfo/rtk-users
>>>>
>>>>
>>>
>> _______________________________________________
>> Rtk-users mailing list
>> Rtk-users at public.kitware.com
>> https://public.kitware.com/mailman/listinfo/rtk-users
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.creatis.insa-lyon.fr/pipermail/rtk-users/attachments/20180720/776754af/attachment.htm>