Why pure computation time in parallel is longer than the serial version?
hi 2014-02-22 19:30 GMT+09:00 Xuening Zhu <puddingnnn529 at gmail.com>:
My cpu is *Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz.* There are 2 physical cores and additional 2 logical cores. The memory size is 8G. And my
Logical performance of your CPU.. use SIMD 4 FLOPSparClock x 2.5GHz x 2phisicalcore = 20GFLOPS use AVX 8 FLOPSparClock x 2.5GHz x 2phisicalcore = 40GFLOPS # Because a physical core is two, the computing unit is two. amount of the operation of DGEMM is O(N^3).
I choose a 10^3 * 10^4 matrix and wants to evaluate its multiplication(t(m)%*%m) time. I don't consider tcrossprod() because I just want to make the computation longer. Maybe more cases can be compared later.
amount of the operation of the procession is O(N^3). require calculation ... 2*(3e3 * 4e3 * (3e3+4e3)/2) = 84GFLOPS use SIMD 84/20 = 4.2 sec use AVX 84/40 = 2.1 sec so becomes 2.1 seconds in the logical peak performance in your CPU. Maybe, because the effective efficiency is about range 80% to 90%, it is likely to become about 2.6 seconds normally.
user system elapsed 10.164 0.512 5.549
maybe this performance of NEHALEM Core. and hyperthread decreases the efficiency of cache in the procession. Best Regards,
EI-JI Nakama <nakama (a) ki.rim.or.jp> "\u4e2d\u9593\u6804\u6cbb" <nakama (a) ki.rim.or.jp>