Dear all, I am currently having a weird problem with a large-scale optimization routine. It would be nice to know if any of you have already gone through something similar, and how you solved it. I apologize in advance for not providing an example, but I think the non-reproducibility of the error is maybe a key point of this problem. Simplest possible description of the problem: I have two functions: g(X) and f(v). g(X) does: i) inputs a large matrix X; ii) derives four other matrices from X (I'll call them A, B, C and D) then saves to disk for debugging purposes; Then, f(v) does: iii) loads A, B, C, D from disk iv) calculates the log-likelihood, which vary according to a vector of parameters, v. My goal application is quite big (X is a 40000x40000 matrix), so I created the following versions to test and run the codes/math/parallelization: #1) A simulated example with X being 100x100 #2) A degraded version of the goal application, with X being 4000x4000 #3) The goal application, with X being 40000x40000 When I use qsub to submit the job, using the exact same code and processing cluster, #1 and #2 run flawlessly, so no problem. These results tell me that the codes/math/parallelization are fine. For application #3, it converges to a vector v*. However, when I manually load A, B, C and D from disk and calculate f(v*), then the value I get is completely different. For example: - qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1. - when I manually load A, B, C, D from disk and calculate f(v*) on the exact same machine with the same libraries and environment variables, I get f(v*) = 1000. This is a very confusing behavior. In theory the size of X should not affect my problem, but it seems that things get unstable as the dimension grows. The main issue for debugging is that g(X) for simulation #3 takes two hours to run, and I am completely lost on how I could find the causes of the problem. Would you have any general advices? Thank you very much in advance for literally any suggestions you might have! Best regards, Arthur
Possible causes of unexpected behavior
7 messages · Eric Berger, Arthur Fendrich
Please confirm that when you do the manual load and check that f(v*) matches the result from qsub() it succeeds for cases #1,#2 but only fails for #3.
On Fri, Mar 4, 2022 at 10:06 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear all,
I am currently having a weird problem with a large-scale optimization
routine. It would be nice to know if any of you have already gone through
something similar, and how you solved it.
I apologize in advance for not providing an example, but I think the
non-reproducibility of the error is maybe a key point of this problem.
Simplest possible description of the problem: I have two functions: g(X)
and f(v).
g(X) does:
i) inputs a large matrix X;
ii) derives four other matrices from X (I'll call them A, B, C and D) then
saves to disk for debugging purposes;
Then, f(v) does:
iii) loads A, B, C, D from disk
iv) calculates the log-likelihood, which vary according to a vector of
parameters, v.
My goal application is quite big (X is a 40000x40000 matrix), so I created
the following versions to test and run the codes/math/parallelization:
#1) A simulated example with X being 100x100
#2) A degraded version of the goal application, with X being 4000x4000
#3) The goal application, with X being 40000x40000
When I use qsub to submit the job, using the exact same code and processing
cluster, #1 and #2 run flawlessly, so no problem. These results tell me
that the codes/math/parallelization are fine.
For application #3, it converges to a vector v*. However, when I manually
load A, B, C and D from disk and calculate f(v*), then the value I get is
completely different.
For example:
- qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1.
- when I manually load A, B, C, D from disk and calculate f(v*) on the
exact same machine with the same libraries and environment variables, I get
f(v*) = 1000.
This is a very confusing behavior. In theory the size of X should not
affect my problem, but it seems that things get unstable as the dimension
grows. The main issue for debugging is that g(X) for simulation #3 takes
two hours to run, and I am completely lost on how I could find the causes
of the problem. Would you have any general advices?
Thank you very much in advance for literally any suggestions you might
have!
Best regards,
Arthur
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dear Eric, Thank you for the response. Yes, I can confirm that, please see below the behavior. For #1, results are identical. For #2, they are not identical but very close. For #3, they are completely different. Best regards, Arthur -- For #1, - qsub execution: [1] "ll: 565.7251" [1] "norm gr @ minimum: 2.96967368608131e-08" - manual check: f(v*): 565.7251 gradient norm at v*: 2.969674e-08 # For #2, - qsub execution: [1] "ll: 14380.8308" [1] "norm gr @ minimum: 0.0140857561408041" - manual check: f(v*): 14380.84 gradient norm at v*: 0.01404779 # For #3, - qsub execution: [1] "ll: 14310.6812" [1] "norm gr @ minimum: 6232158.38877002" - manual check: f(v*): 97604.69 gradient norm at v*: 6266696595 Em sex., 4 de mar. de 2022 ?s 09:48, Eric Berger <ericjberger at gmail.com> escreveu:
Please confirm that when you do the manual load and check that f(v*) matches the result from qsub() it succeeds for cases #1,#2 but only fails for #3. On Fri, Mar 4, 2022 at 10:06 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear all,
I am currently having a weird problem with a large-scale optimization
routine. It would be nice to know if any of you have already gone through
something similar, and how you solved it.
I apologize in advance for not providing an example, but I think the
non-reproducibility of the error is maybe a key point of this problem.
Simplest possible description of the problem: I have two functions: g(X)
and f(v).
g(X) does:
i) inputs a large matrix X;
ii) derives four other matrices from X (I'll call them A, B, C and D)
then
saves to disk for debugging purposes;
Then, f(v) does:
iii) loads A, B, C, D from disk
iv) calculates the log-likelihood, which vary according to a vector of
parameters, v.
My goal application is quite big (X is a 40000x40000 matrix), so I created
the following versions to test and run the codes/math/parallelization:
#1) A simulated example with X being 100x100
#2) A degraded version of the goal application, with X being 4000x4000
#3) The goal application, with X being 40000x40000
When I use qsub to submit the job, using the exact same code and
processing
cluster, #1 and #2 run flawlessly, so no problem. These results tell me
that the codes/math/parallelization are fine.
For application #3, it converges to a vector v*. However, when I manually
load A, B, C and D from disk and calculate f(v*), then the value I get is
completely different.
For example:
- qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1.
- when I manually load A, B, C, D from disk and calculate f(v*) on the
exact same machine with the same libraries and environment variables, I
get
f(v*) = 1000.
This is a very confusing behavior. In theory the size of X should not
affect my problem, but it seems that things get unstable as the dimension
grows. The main issue for debugging is that g(X) for simulation #3 takes
two hours to run, and I am completely lost on how I could find the causes
of the problem. Would you have any general advices?
Thank you very much in advance for literally any suggestions you might
have!
Best regards,
Arthur
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Can you confirm you have a distributed calculation running in parallel? Have you determined that it is thread safe? How? Your check on the smaller examples may not have ruled out such possibilities.
On Fri, Mar 4, 2022 at 11:21 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear Eric, Thank you for the response. Yes, I can confirm that, please see below the behavior. For #1, results are identical. For #2, they are not identical but very close. For #3, they are completely different. Best regards, Arthur -- For #1, - qsub execution: [1] "ll: 565.7251" [1] "norm gr @ minimum: 2.96967368608131e-08" - manual check: f(v*): 565.7251 gradient norm at v*: 2.969674e-08 # For #2, - qsub execution: [1] "ll: 14380.8308" [1] "norm gr @ minimum: 0.0140857561408041" - manual check: f(v*): 14380.84 gradient norm at v*: 0.01404779 # For #3, - qsub execution: [1] "ll: 14310.6812" [1] "norm gr @ minimum: 6232158.38877002" - manual check: f(v*): 97604.69 gradient norm at v*: 6266696595 Em sex., 4 de mar. de 2022 ?s 09:48, Eric Berger <ericjberger at gmail.com> escreveu:
Please confirm that when you do the manual load and check that f(v*) matches the result from qsub() it succeeds for cases #1,#2 but only fails for #3. On Fri, Mar 4, 2022 at 10:06 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear all,
I am currently having a weird problem with a large-scale optimization
routine. It would be nice to know if any of you have already gone through
something similar, and how you solved it.
I apologize in advance for not providing an example, but I think the
non-reproducibility of the error is maybe a key point of this problem.
Simplest possible description of the problem: I have two functions: g(X)
and f(v).
g(X) does:
i) inputs a large matrix X;
ii) derives four other matrices from X (I'll call them A, B, C and D)
then
saves to disk for debugging purposes;
Then, f(v) does:
iii) loads A, B, C, D from disk
iv) calculates the log-likelihood, which vary according to a vector of
parameters, v.
My goal application is quite big (X is a 40000x40000 matrix), so I
created
the following versions to test and run the codes/math/parallelization:
#1) A simulated example with X being 100x100
#2) A degraded version of the goal application, with X being 4000x4000
#3) The goal application, with X being 40000x40000
When I use qsub to submit the job, using the exact same code and
processing
cluster, #1 and #2 run flawlessly, so no problem. These results tell me
that the codes/math/parallelization are fine.
For application #3, it converges to a vector v*. However, when I manually
load A, B, C and D from disk and calculate f(v*), then the value I get is
completely different.
For example:
- qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1.
- when I manually load A, B, C, D from disk and calculate f(v*) on the
exact same machine with the same libraries and environment variables, I
get
f(v*) = 1000.
This is a very confusing behavior. In theory the size of X should not
affect my problem, but it seems that things get unstable as the dimension
grows. The main issue for debugging is that g(X) for simulation #3 takes
two hours to run, and I am completely lost on how I could find the causes
of the problem. Would you have any general advices?
Thank you very much in advance for literally any suggestions you might
have!
Best regards,
Arthur
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dear Eric, Yes, I can confirm that I have distributed calculations running in parallel. I am not sure if this is a precise answer to the thread-safe question since I'm not familiar with this definition, but what I do is: i) First, chunks of A, B, C and D are calculated from X in parallel by the worker nodes. ii) Second, all the chunks are combined on my master node, and the final A, B, C and D are saved to disk. iii) Then, still on the master node, I optimize f(v) using the final A, B, C and D. When I debug, I skip steps i) and ii) and check only iii) manually by loading A, B, C and D from the disk and evaluating f(v*). Does that seem correct? Best regards, Arthur Em sex., 4 de mar. de 2022 ?s 10:33, Eric Berger <ericjberger at gmail.com> escreveu:
Can you confirm you have a distributed calculation running in parallel? Have you determined that it is thread safe? How? Your check on the smaller examples may not have ruled out such possibilities. On Fri, Mar 4, 2022 at 11:21 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear Eric, Thank you for the response. Yes, I can confirm that, please see below the behavior. For #1, results are identical. For #2, they are not identical but very close. For #3, they are completely different. Best regards, Arthur -- For #1, - qsub execution: [1] "ll: 565.7251" [1] "norm gr @ minimum: 2.96967368608131e-08" - manual check: f(v*): 565.7251 gradient norm at v*: 2.969674e-08 # For #2, - qsub execution: [1] "ll: 14380.8308" [1] "norm gr @ minimum: 0.0140857561408041" - manual check: f(v*): 14380.84 gradient norm at v*: 0.01404779 # For #3, - qsub execution: [1] "ll: 14310.6812" [1] "norm gr @ minimum: 6232158.38877002" - manual check: f(v*): 97604.69 gradient norm at v*: 6266696595 Em sex., 4 de mar. de 2022 ?s 09:48, Eric Berger <ericjberger at gmail.com> escreveu:
Please confirm that when you do the manual load and check that f(v*) matches the result from qsub() it succeeds for cases #1,#2 but only fails for #3. On Fri, Mar 4, 2022 at 10:06 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear all,
I am currently having a weird problem with a large-scale optimization
routine. It would be nice to know if any of you have already gone
through
something similar, and how you solved it.
I apologize in advance for not providing an example, but I think the
non-reproducibility of the error is maybe a key point of this problem.
Simplest possible description of the problem: I have two functions: g(X)
and f(v).
g(X) does:
i) inputs a large matrix X;
ii) derives four other matrices from X (I'll call them A, B, C and D)
then
saves to disk for debugging purposes;
Then, f(v) does:
iii) loads A, B, C, D from disk
iv) calculates the log-likelihood, which vary according to a vector of
parameters, v.
My goal application is quite big (X is a 40000x40000 matrix), so I
created
the following versions to test and run the codes/math/parallelization:
#1) A simulated example with X being 100x100
#2) A degraded version of the goal application, with X being 4000x4000
#3) The goal application, with X being 40000x40000
When I use qsub to submit the job, using the exact same code and
processing
cluster, #1 and #2 run flawlessly, so no problem. These results tell me
that the codes/math/parallelization are fine.
For application #3, it converges to a vector v*. However, when I
manually
load A, B, C and D from disk and calculate f(v*), then the value I get
is
completely different.
For example:
- qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1.
- when I manually load A, B, C, D from disk and calculate f(v*) on the
exact same machine with the same libraries and environment variables, I
get
f(v*) = 1000.
This is a very confusing behavior. In theory the size of X should not
affect my problem, but it seems that things get unstable as the
dimension
grows. The main issue for debugging is that g(X) for simulation #3 takes
two hours to run, and I am completely lost on how I could find the
causes
of the problem. Would you have any general advices?
Thank you very much in advance for literally any suggestions you might
have!
Best regards,
Arthur
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
If I understand correctly, steps i,ii can be ignored. i.e. we just focus on step iii with A,B,C,D fixed. You do the optimization of f(v) to calculate, say, v* = argmin f(v). This optimization is single threaded. (A) In that case, I suggest you add some logging so that for each call to f(), you output its input and output. Then you can (re-) confirm your validation test - i.e. that the "manual" calc of f(v*) gives a different result than what is found in the log file. (B) If (A) doesn't lead you anywhere .... Re-reading your original description of the process, it seems that the time consuming part is creating A,B,C,D. If the evaluation of f(v) is not overly time consuming, then run the optimization under valgrind. It is possible that you are depending on some uninitialized variables, or trashing memory somewhere.
On Fri, Mar 4, 2022 at 11:54 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear Eric, Yes, I can confirm that I have distributed calculations running in parallel. I am not sure if this is a precise answer to the thread-safe question since I'm not familiar with this definition, but what I do is: i) First, chunks of A, B, C and D are calculated from X in parallel by the worker nodes. ii) Second, all the chunks are combined on my master node, and the final A, B, C and D are saved to disk. iii) Then, still on the master node, I optimize f(v) using the final A, B, C and D. When I debug, I skip steps i) and ii) and check only iii) manually by loading A, B, C and D from the disk and evaluating f(v*). Does that seem correct? Best regards, Arthur Em sex., 4 de mar. de 2022 ?s 10:33, Eric Berger <ericjberger at gmail.com> escreveu:
Can you confirm you have a distributed calculation running in parallel? Have you determined that it is thread safe? How? Your check on the smaller examples may not have ruled out such possibilities. On Fri, Mar 4, 2022 at 11:21 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear Eric, Thank you for the response. Yes, I can confirm that, please see below the behavior. For #1, results are identical. For #2, they are not identical but very close. For #3, they are completely different. Best regards, Arthur -- For #1, - qsub execution: [1] "ll: 565.7251" [1] "norm gr @ minimum: 2.96967368608131e-08" - manual check: f(v*): 565.7251 gradient norm at v*: 2.969674e-08 # For #2, - qsub execution: [1] "ll: 14380.8308" [1] "norm gr @ minimum: 0.0140857561408041" - manual check: f(v*): 14380.84 gradient norm at v*: 0.01404779 # For #3, - qsub execution: [1] "ll: 14310.6812" [1] "norm gr @ minimum: 6232158.38877002" - manual check: f(v*): 97604.69 gradient norm at v*: 6266696595 Em sex., 4 de mar. de 2022 ?s 09:48, Eric Berger <ericjberger at gmail.com> escreveu:
Please confirm that when you do the manual load and check that f(v*) matches the result from qsub() it succeeds for cases #1,#2 but only fails for #3. On Fri, Mar 4, 2022 at 10:06 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear all,
I am currently having a weird problem with a large-scale optimization
routine. It would be nice to know if any of you have already gone
through
something similar, and how you solved it.
I apologize in advance for not providing an example, but I think the
non-reproducibility of the error is maybe a key point of this problem.
Simplest possible description of the problem: I have two functions:
g(X)
and f(v).
g(X) does:
i) inputs a large matrix X;
ii) derives four other matrices from X (I'll call them A, B, C and D)
then
saves to disk for debugging purposes;
Then, f(v) does:
iii) loads A, B, C, D from disk
iv) calculates the log-likelihood, which vary according to a vector of
parameters, v.
My goal application is quite big (X is a 40000x40000 matrix), so I
created
the following versions to test and run the codes/math/parallelization:
#1) A simulated example with X being 100x100
#2) A degraded version of the goal application, with X being 4000x4000
#3) The goal application, with X being 40000x40000
When I use qsub to submit the job, using the exact same code and
processing
cluster, #1 and #2 run flawlessly, so no problem. These results tell me
that the codes/math/parallelization are fine.
For application #3, it converges to a vector v*. However, when I
manually
load A, B, C and D from disk and calculate f(v*), then the value I get
is
completely different.
For example:
- qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1.
- when I manually load A, B, C, D from disk and calculate f(v*) on the
exact same machine with the same libraries and environment variables,
I get
f(v*) = 1000.
This is a very confusing behavior. In theory the size of X should not
affect my problem, but it seems that things get unstable as the
dimension
grows. The main issue for debugging is that g(X) for simulation #3
takes
two hours to run, and I am completely lost on how I could find the
causes
of the problem. Would you have any general advices?
Thank you very much in advance for literally any suggestions you might
have!
Best regards,
Arthur
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Dear Eric, I followed your suggestion (A) and I believe I finally got to the cause of the problem. It turns out that I was not exporting two environment variables for step iii. Because this part of the code does not run in parallel, I was simply ignoring them:
export OMP_NUM_THREADS=1 export OPENBLAS_NUM_THREADS=1
When I do that, the results change for some reason that I still have to investigate further. What I get now seems coherent (below). Thank you again for the help. Best regards, Arthur ## -- Results for optim(f) -- Case: qsub, with or without the two variables (same result for both): - initial guess: v = [0 0 0 0 0 0 0 0 0] f(v) = 599765.9 - solution: v = [0.3529 -6.4176 -0.0271 -0.0066 0.0013 -0.0172 -0.0198 -0.0034 -0.0171] f(v) = 14310.68 # Case: manual without the two variables: - initial guess: v = [0 0 0 0 0 0 0 0 0] f(v) = 643417.1 - solution: v = [1.5669 -6.2815 -0.0091 -0.0022 0.0004 -0.0059 -0.0066 -0.0014 -0.005] f(v) = 19712.85 # Case: manual with the two variables: - initial guess: v = [0 0 0 0 0 0 0 0 0] f(v) = 599765.9 - solution: v = [0.3529 -6.4176 -0.0271 -0.0066 0.0013 -0.0172 -0.0198 -0.0034 -0.0171] f(v) = 14310.68 Em sex., 4 de mar. de 2022 ?s 11:13, Eric Berger <ericjberger at gmail.com> escreveu:
If I understand correctly, steps i,ii can be ignored. i.e. we just focus on step iii with A,B,C,D fixed. You do the optimization of f(v) to calculate, say, v* = argmin f(v). This optimization is single threaded. (A) In that case, I suggest you add some logging so that for each call to f(), you output its input and output. Then you can (re-) confirm your validation test - i.e. that the "manual" calc of f(v*) gives a different result than what is found in the log file. (B) If (A) doesn't lead you anywhere .... Re-reading your original description of the process, it seems that the time consuming part is creating A,B,C,D. If the evaluation of f(v) is not overly time consuming, then run the optimization under valgrind. It is possible that you are depending on some uninitialized variables, or trashing memory somewhere. On Fri, Mar 4, 2022 at 11:54 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear Eric, Yes, I can confirm that I have distributed calculations running in parallel. I am not sure if this is a precise answer to the thread-safe question since I'm not familiar with this definition, but what I do is: i) First, chunks of A, B, C and D are calculated from X in parallel by the worker nodes. ii) Second, all the chunks are combined on my master node, and the final A, B, C and D are saved to disk. iii) Then, still on the master node, I optimize f(v) using the final A, B, C and D. When I debug, I skip steps i) and ii) and check only iii) manually by loading A, B, C and D from the disk and evaluating f(v*). Does that seem correct? Best regards, Arthur Em sex., 4 de mar. de 2022 ?s 10:33, Eric Berger <ericjberger at gmail.com> escreveu:
Can you confirm you have a distributed calculation running in parallel? Have you determined that it is thread safe? How? Your check on the smaller examples may not have ruled out such possibilities. On Fri, Mar 4, 2022 at 11:21 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear Eric, Thank you for the response. Yes, I can confirm that, please see below the behavior. For #1, results are identical. For #2, they are not identical but very close. For #3, they are completely different. Best regards, Arthur -- For #1, - qsub execution: [1] "ll: 565.7251" [1] "norm gr @ minimum: 2.96967368608131e-08" - manual check: f(v*): 565.7251 gradient norm at v*: 2.969674e-08 # For #2, - qsub execution: [1] "ll: 14380.8308" [1] "norm gr @ minimum: 0.0140857561408041" - manual check: f(v*): 14380.84 gradient norm at v*: 0.01404779 # For #3, - qsub execution: [1] "ll: 14310.6812" [1] "norm gr @ minimum: 6232158.38877002" - manual check: f(v*): 97604.69 gradient norm at v*: 6266696595 Em sex., 4 de mar. de 2022 ?s 09:48, Eric Berger <ericjberger at gmail.com> escreveu:
Please confirm that when you do the manual load and check that f(v*) matches the result from qsub() it succeeds for cases #1,#2 but only fails for #3. On Fri, Mar 4, 2022 at 10:06 AM Arthur Fendrich <arthfen at gmail.com> wrote:
Dear all,
I am currently having a weird problem with a large-scale optimization
routine. It would be nice to know if any of you have already gone
through
something similar, and how you solved it.
I apologize in advance for not providing an example, but I think the
non-reproducibility of the error is maybe a key point of this problem.
Simplest possible description of the problem: I have two functions:
g(X)
and f(v).
g(X) does:
i) inputs a large matrix X;
ii) derives four other matrices from X (I'll call them A, B, C and
D) then
saves to disk for debugging purposes;
Then, f(v) does:
iii) loads A, B, C, D from disk
iv) calculates the log-likelihood, which vary according to a vector
of
parameters, v.
My goal application is quite big (X is a 40000x40000 matrix), so I
created
the following versions to test and run the codes/math/parallelization:
#1) A simulated example with X being 100x100
#2) A degraded version of the goal application, with X being 4000x4000
#3) The goal application, with X being 40000x40000
When I use qsub to submit the job, using the exact same code and
processing
cluster, #1 and #2 run flawlessly, so no problem. These results tell
me
that the codes/math/parallelization are fine.
For application #3, it converges to a vector v*. However, when I
manually
load A, B, C and D from disk and calculate f(v*), then the value I
get is
completely different.
For example:
- qsub job says v* = c(0, 1, 2, 3) is a minimum with f(v*) = 1.
- when I manually load A, B, C, D from disk and calculate f(v*) on the
exact same machine with the same libraries and environment variables,
I get
f(v*) = 1000.
This is a very confusing behavior. In theory the size of X should not
affect my problem, but it seems that things get unstable as the
dimension
grows. The main issue for debugging is that g(X) for simulation #3
takes
two hours to run, and I am completely lost on how I could find the
causes
of the problem. Would you have any general advices?
Thank you very much in advance for literally any suggestions you
might have!
Best regards,
Arthur
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.