Taking advantage of parallel computing in estimating structural models

Topics: IRIS in General
Oct 17, 2014 at 10:11 PM
Edited Oct 17, 2014 at 10:16 PM
Dear all,

I have come to the estimation of a fairly large model (more than 50 constrained parameters) which takes a long time either in the regularized likelihood (estimate) or the posterior Bayesian simulation (arwm) steps.

More specifically, the first step runs an 'estimate' command

options = optimoptions('fmincon','UseParallel',true);
penalty = 0.5;
[est,pos,cov,hess,mstar,v,~,~,delta] =...
    'penalty=',penalty,'tolx=',1e-6,'tolfun=',1e-6,'MaxFunEvals=',100000,'nosolution=', ... 

This step takes more than two days and above 100 iterations and the running time does not improve by using a parallel pool, setting parallel options for 'fmincon' or both. Therefore, if the 'fmincon' is running, it seems that options are overridden inside function handler, which may explain why parpool and options statements do not improve execution time.

In the second step I run a parallelized 'arwm' as follows

[theta,logpost,ar] = arwm(pos,NDraws, ...
'adaptScale=',1,'adaptProposalCov=',0.5,'burnin=',0.10, ...
'estTime=',true,'firstPrefetch=',100, 'nStep=',100, ...

However, this step stops at the start because some parallel object 's' is not found, but if the parallelization options are removed it runs well, but slow.

Do you have any ideas on how to improve estimation speed by taking advantage of parallel computing?

Best regards,

Juan Manuel
Oct 17, 2014 at 10:33 PM
Edited Oct 17, 2014 at 10:50 PM
IRIS correctly passes options to the optimization toolbox. However, fmincon is not an algorithm that has much potential to benefit from parallel processing. Genetic algorithms (like the one in the Global Optimization Toolbox) and particle swarms (like the one in IRIS) work much better for solving large optimization problems in parallel. These algorithms scale efficiently in parallel because they allow distribution of work with minimal (and infrequent) communication overhead. The same cannot be said about gradient-based algorithms like fmincon, which can easily be slower in parallel than in serial.

I will look into the issue you're having with arwm(). However, I will warn you that in our testing with this we weren't able to obtain good performance improvements except on extremely large models (>400 equations) on large clusters (>64 workers), and even then the improvements were meager (50% reduction in run time). This is because MATLAB implements parallel processing using MPI, and this means that even if you're running code on a shared memory machine, all of the messages are passing through your network interface. In most cases the latency associated with passing messages through TCP/IP outweigh the benefits of sharing work across processors. If your model is large enough and it accordingly takes long enough to evaluate the likelihood then you may be able to see some benefit. But it's pretty doubtful until The Mathworks decides to include an OpenMP-style option for implementing parallel algorithms in MATLAB. And that's pretty doubtful.

Running N chains K times is NOT the same as running one chain N*K times. This is what Dynare does. You can do this pretty easily in IRIS by calling arwm() inside a parfor loop, but it's a poor way to do parallel sampling.

I have been testing some other algorithms but so far nothing works that great.
Oct 20, 2014 at 4:57 PM
Dear Mike,

Thanks for your prompt response.

I will try to implement PSO in IRIS directly.

Regarding arwm, I am using a single multi-core PC and thus the number of workers is restricted to 4. Therefore, I will have to make do with arwm. However, from the time it takes to compute and display the likelihood around the optimum, I gather that the calculation of the likelihood is time consuming and thus paralellized arwm might improve a little.

I would really appreciate if you point me out to a tutorial to get me started with PSO in IRIS.

Best regards,

Juan Manuel
Oct 20, 2014 at 5:30 PM
Edited Oct 22, 2014 at 6:58 PM
Dear Juan Manuel,

The tutorial I wrote at the time we included the PSO in IRIS is available at the old Google Code site. I will update this and move it to Codeplex, but until then you can still get it from the old site:


In my experience having the most efficient serial algorithm is almost always the best way to go. That is, if you have a well-behaved optimization problem without a lot of local minima, gradient-based methods in serial will typically outperform stochastic optimization methods in parallel. On the other hand, if you have an optimization problem with many local minima then the stochastic algorithms (like GA and PSO) will typically outperform grid searches or multi-start gradient-based methods.

The issue in arwm() is pretty minor and the fix will be out in the next release. If you want a working copy now then go to the Source tab and click Download. Given your hardware description, I would not expect you to get good returns unless your model is sufficiently complex that the likelihood takes a very long time to evaluate. If you look at the Brockwell (2005) and Strid (2009) references in the arwm documentation, then you will see that even the theoretical maximum returns to using pre-fetching (i.e., assuming zero communication overhead) aren't that great. When you add the latency associated with MPI passing messages through TCP/IP then you are probably going to get negative returns. We have attempted to minimize the amount of communication overhead by making sure that the model is sent only once, and only the parameter vectors are sent at the beginning of each parfor loop, but this only helps a limited amount since the issue is really latency and not bandwidth.

Bottom line: short parfor loops are not efficient. The cost of distributing the work outweighs the benefits of having multiple workers. To illustrate,
N = 1000 ;
K = 4 ;
tic; for ii = 1:N; parfor jj = 1:K; rand(); end; end; toc
Elapsed time is 39.720156 seconds.
tic; parfor ii = 1:N; for jj = 1:K; rand(); end; end; toc
Elapsed time is 0.058583 seconds.
In the prefetching case, having longer parfor loops means taking more steps through the lattice. Doing this efficiently requires having a large number of workers, which you do not. If you have only four workers then it's not efficient to take more than two steps through the lattice.

Prefetching is attractive only in that it is the only algorithm (to my knowledge) which allows one to do RWMH in parallel without fundamentally altering the original algorithm (it does technically force you to do adaptation slightly less often but this isn't a big deal). There are some other algorithms I have experimented with but I have not been able to find anything that reliably works well on anything other than simple test cases. But it's good to hear that this is of interest to people... it motivates me to continue trying!

Oct 20, 2014 at 7:42 PM
The PSO tutorial is now posted in the Documentation section here. It is updated to reflect changes to class structure (under the hood) but has no substantive changes relative to the previous version.