Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
heart_wall [2018/10/03 17:37]
root
heart_wall [2018/10/03 17:38] (current)
root
Line 17: Line 17:
  
 Partitioning of the working set between caches and avoiding of cache trashing contribute to the performance. CUDA implementation of this code is a classic example of the exploitation of braided parallelism. Processing of sample points is assigned to multiprocessors (TLP), while processing of individual pixels in each sample image is assigned to processors inside each multiprocessor. However, each GPU multiprocessor is usually underutilized because of the limited amount of computation at each computation step. Large size of processed images and lack temporal locality did not allow for utilization of fast shared memory. Also the GPU overhead (data transfer and kernel launch) are significant. In order to provide better speedup, more drastic GPU optimization techniques that sacrificed modularity (in order to include code in one kernel call) were used. These techniques also combined unrelated functions and data transfers in single kernels. Partitioning of the working set between caches and avoiding of cache trashing contribute to the performance. CUDA implementation of this code is a classic example of the exploitation of braided parallelism. Processing of sample points is assigned to multiprocessors (TLP), while processing of individual pixels in each sample image is assigned to processors inside each multiprocessor. However, each GPU multiprocessor is usually underutilized because of the limited amount of computation at each computation step. Large size of processed images and lack temporal locality did not allow for utilization of fast shared memory. Also the GPU overhead (data transfer and kernel launch) are significant. In order to provide better speedup, more drastic GPU optimization techniques that sacrificed modularity (in order to include code in one kernel call) were used. These techniques also combined unrelated functions and data transfers in single kernels.
 +
 +
 +Retrieved from "​https://​www.cs.virginia.edu/​~skadron/​wiki/​rodinia/​index.php?​title=Heart_Wall&​oldid=669"​
  • heart_wall.txt
  • Last modified: 2018/10/03 17:38
  • by root