Computational Cluster Design
From Research Computing
This comes from the NW-Chem list. Someone asked for suggestions for an "optimal" cluster configuration for running NW-Chem (this configuration would probably be appropriate for gaussian as well). This reply is useful for its attempt to quantify the problem. -FL
-----Original Message----- From: owner-nwchem-users@emsl.pnl.gov To: pjstimac@umich.edu Cc: nwchem-users@emsl.pnl.gov Sent: 12/17/2004 4:43 AM Subject: Re: Linux cluster design On Thu, 16 Dec 2004 17:34:58 -0500 pjstimac@umich.edu wrote: Greetings, We are in the process of designing a Linux computer cluster. We anticipate a major function of the cluster will be to perform rather large quantum chemistry > calculations (CCSD(T) and multi- reference methods on small molecules). We plan on using NWChem to do some of these calculations. Our proposed cluster will have about 7 nodes consisting of dual 2.4Ghz Opteron processors. The motherboard we have selected has 8 slots for RAM. I am looking for recommendations for how much RAM we should get. Does anybody have any suggestions?
And the answer is... (long) same as ever: it depends on what you want to do. Your best bet is to make a sample run and profile its resource usage. And then I'll explain:
Your needs will be a function of usage pattern, usually a composite of
- single run needs
- memory
- cpu
- communications
- disk space
- number of runs
- pattern of runs (simultaneous vs serialized)
The formula might be something like
R_N = Njobs * R_1
i.e. the need for a resource is the need for a single job multiplied by the number of simultaneous jobs.
Now, depending on your usage patterns: you say you plan to run jobs on small molecules and that CPU resources are already decided. So, let's concentrate on usage:
Job dispersion: are they more or less of similar lengths? Probably yes I guess.
Response time: do you need results ASAP or may wait a little bit?
Number of jobs: many? few? If you expect
Njobs (simultaneous jobs) >= Ncpus,
then
- if dispersion is small (all of similar run lengths)
then there is no point on using single-job-parallelization: the CPUs will be busy anyways and you save communication overheads, ignore Resource_communication_speed
R_comm = negligible R_memory = Njobs * R_Mem_1 R_disk = Njobs * R_disk_1 R_cpu = already bound
- if dispersion is big
- if response time is not critical, same as above applies
- if response is critical, then you may want to
run some long jobs with parallelization to speed them up: your communication needs will be
R_comm = N_long_jobs * R_comm_1
i.e. the number of simultaneous long jobs you expect by the communication needs of an isolated, parallel job.
R_mem = Njobs * R_mem_1 R_disk = Njobs * R_disk_1
i.e. the other needs are calculated with the total number of jobs.
If you expect
Njobs < Ncpus
then you want to run parallel versions of the jobs, otherwise CPUs will sit iddle. It's a problem to determine network needs in advance, but there are not that many choices. Your best bet: ask people for figures. For short messages 100Mb may give better results than gigabit, for long data exchanges Gb is better.. YMMV. The Beowulf mailing lists have a wealth of data on this issue, check them out at www.beowulf.org
- Memory vs Disk
Since you say your jobs will be small, you need to consider whether it is worth to spend more money on RAM than on Disk. If your jobs will fit in memory, better go for more RAM, as much as needed for Njobs.
If your jobs won't fit in memory, then you want to spend money on the fastest disks you may get. This depends on Njobs: it also means that you must consider the time spent waiting for I/O vs. time used in CPU. As the ratio memory/job decreases, your jobs will spend more time waiting for I/O and you will want to run more jobs to keep the CPUs busy. There is an inflection point though: when you have too many jobs, the time spend doing I/O to secondary memory is excessive and the system starts thrashing, and stalls. If you have too many jobs, then you should consider trading disk space for distributed computing, i.e. disk I/O for network I/O. Depending on the amount of network I/O needed, it may be better or not.
So, to sum it up:
It is a complex equation with interrelated variables. There is no single answer that fits all needs: it all depends on your usage pattern and needs.
As an example:
I have here an OpenMosix cluster: it is OK for running many simultaneous small jobs with no parallelization, easy to set up and maintain, and with little requirements.
For a few big jobs, I'd advise spending the money on memory and network bandwidth
For many big jobs, disk is more important as parallelization will yield lesser benefits and there won't be enough memory.
You get the idea.
YMMV
j
