The first step in performance tuning scientific code is to rewrite it so the
flow of control, especially the loop structure, is *crystal clear* and
obvious to the casual observer. Once you've done that, focus on the
innermost loops -- those sections that are executed on the order of the cube
of the problem size or higher. It is rare for scientific code to be higher
order than the cube of the problem size, although I've seen it in
computational chemistry.
Once you've isolated the spots that are being executed most often, try
replacing scalar operations with vector operations and vector operations
with matrix operations. These are usually translated fairly efficiently by
modern compilers, and special assembler level packages can be found for
things like the Basic Linear Algebra Subroutines (BLAS).