Matrix multiplication algorithm in CUDA C