Skip to content
Prev 46256 / 63458 Next

question about Makeconf and nvcc/CUDA

Hi Erin

See the code below.

Basically, I have created a new routine that you will
call from R as

 .C("cuda4", 1L, "5")

where 1L is the number of arguments you are passing and "5" is the character vector of arguments.

We are using .C() here for simplicity. For other cases involving data, .Call() would be better.

That cuda4 routines is now not name-mangled, and has the correct parameter types and return type
to be called via the .C().
The good thing is this has nothing to do with CUDA, but just calling C++ routines from R.


This illustrates that there are complexities here with different devices, languages, etc.
This is one of the reasons a high-level interface to calling kernels is simpler and
more flexible.

The following R only code invokes the kernel on actual data we have in R (x).

# Put the square_array routine only in a file named, erinHodgess.cu
# add extern "C" before the square_array routine

  # generate the PTX code
f = nvcc("erinHodgess.cu")

  # load the PTX code
mod = loadModule(f)

  # Invoke the PTX code
x = rnorm(100000)
ans = .gpu(mod$square_array, x, length(x), gridBy = x)

all.equal(x*x, ans)


  D.

////////////////////////////////


#include <stdio.h>
#include <cuda.h>

// Kernel that executes on the CUDA device
extern "C"
__global__ void square_array(float *a, int N)
{
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  if (idx<N) a[idx] = a[idx] * a[idx];
}

// main routine that executes on the host
void stuff(int argc, char **argv)
{
  float *a_h, *a_d;  // Pointer to host & device arrays
  int N = atoi(argv[1]);
//  const int N = 10;   Number of elements in arrays
  size_t size = N * sizeof(float);
  a_h = (float *)malloc(size);        // Allocate array on host
  cudaMalloc((void **) &a_d, size);   // Allocate array on device
  // Initialize host array and copy it to CUDA device
  for (int i=0; i<N; i++) a_h[i] = (float)i;
  cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
  // Do calculation on device:
  int block_size = 4;
  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
  square_array <<< n_blocks, block_size >>> (a_d, N);
  // Retrieve result from device and store it in host array
  cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
  // Print results
  for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
  // Cleanup
  free(a_h); cudaFree(a_d);
}

extern "C"
void
cuda4(int *nels, char **els)
{
   stuff(*nels, els);
}
On 7/18/13 12:46 PM, Hodgess, Erin wrote: