Multi-CPU Theory Question

MacMasta · Posted: Thu Sep 19, 2002 10:51 pm Post subject: Multi-CPU Theory Question

I've been reading up on dual-CPU benchmarks (I've always wanted a dual-athlon box), and I have a question about linux >1 cpu theory vs. windows >1 cpu theory.

My understanding is that windows requires the software to be dual-CPU aware. That is, kernel32 itself certainly knows about both CPUs (at least on NT-based systems that aren't xp home) but if software doesn't know about them, then it will only use one. Software like 3DSMax or FlasK that is really cpu-intensive is written to take advantage of both CPUs in order to do things faster. The idea of software needing to know about CPUs reminds me of back in the old DOS days when every piece of software needed its own sound card driver, causing no end of headaches.

Does this apply to linux, as well? Does transcode (for making divx) or seti need to know about having dual CPUs, or does it just say "system, do the following math" and let the kernel figure out which CPU should be doing what?

I know the 2.4.x has had some smp issues, too - presumably 2.6.x will fix that and we'll all be going even faster, if we ever have the money to get smp machines. (And I don't)

So: anybody have a real firm grip on the linux smp implementation?

~Mac~

rac · Posted: Thu Sep 19, 2002 10:57 pm Post subject:

As long as the software separates itself into different threads and/or processes, it can be scheduled on separate CPUs. There are lots of details in the SMP HOWTO.
_________________
For every higher wall, there is a taller ladder

anatolli · n00b Joined: 07 Sep 2002 Posts: 19

not entierly sure about the linux SMP implementation, but I am a convert from win2k on my dually athlon box (yes... I do love it). Win2k didn't require that the programs be threaded for smp, the process scheduler in win nt based machines would happily split processes up between processors. I am assuming that linux does something similar to this. IE: if one processor is getting hammered by a render or something, then the kernel should drop the other processes onto the less loaded processor. This leads to what I and many people call the "Silky smoothness of SMP".

anatolli

phong · Posted: Fri Sep 20, 2002 3:20 am Post subject:

To split up a non-multithreaded or multi-process program on to multiple processors is a Hard Problem(tm). Parallelization of software is an incredibly fascinating and complicated subject, so forgive me if this post runs really long and goes way beyond the scope of the original poster's question. If you don't care about all the science, skip to the end.

Ignoring the fact that different processors don't share many things (like registers) that would be necessary to split up a single thread of execution onto two processors, there are more complicated problems. Programs expect their instructions to execute in a specific order, and if their instructions were distributed among multiple processors, this wouldn't be guaranteed to be the case. Some insturctions COULD be executed in parallel because they don't depend on the results of others, but determining when this is the case, and when it's not is Very Complicated (tm).

That's one of the reasons why your compiler takes so long to compile software. It's spending a lot of that time reordering tweaking and rearranging instructions. Why? So that all those millions of transistors on your processor don't go to waste. Modern CPUs have more than one execution unit and they devote a large section of hardware to determining which instructions depend on one another and then sending them to separate execution units. Your compiler spends a lot of time reordering instructions in such a way that your processor can more effectively execute instructions in parallel. Of course, this is a duplication of effort, but it's necessary because of our old instruction sets. That was the motivation behind the VLIW (or as Intel calls it, EPIC) design of Intel's IA-64 processors.

So how does this apply to multiple CPUs? While it is possible to sometimes execute a single stream of instructions in parallel on a single CPU, it requires lots of work by the compiler and the CPU to make it happen. Not to mention all the comunication and delegation of execution units with an CPU. The overhead of doing it between two whole CPUs is much MUCH more than the benefit of parallel execution. It would be like running an emulator of your own computer inside your kernel, only worse. Not to mention that communication between the processors would be way too slow.

Of course, multiple CPUs *are* useful because you're often (usually in fact) not dealing with a single stream of instructions in a single thread of execution. Many problems can be broken down into smaller problems. A 3d rendering program might break down an image into two halves and render each in a separate thread. The OS can issue each thread to separate processors and nearly double the rendering speed. Of course, a smarter program would break it down into smaller pieces to take advantage of more processors. It would also be careful to break it up into pieces of the same size (in terms of execution time). If one half of the image is more complex than the other, it would be sub-optimal to just "break it in half". Some problems can't be broken up this way, and the program can't be written in such a way as to take advantage of multiple CPUs. This would happen, for example, in a series of calculations where each calculation depends on the results of previous operations.

Since your OS is the one responsible for deciding which CPUs get which processes, it is important that it does a good job too. Just doing it "randomly" and giving half the processes to each CPU isn't optimal. Some processes (like rendering software) spends most of its time in computation and spend most of their time using the CPU. Other programs are IO bound - they spend most of their time waiting for information to come into or send out of the CPU (to the hard drive, video card, in from the user, etc.) They don't need to use the CPU while they're waiting. Several of these can be executing on a single CPU without much of a penalty. Put two CPU bound processes on one CPU and they slow down by half or more. OTOH, having too many IO bound processes on one CPU could fill up it's IO channels. Distributing processes optimally onto two or more processors is also a Hard Problem(tm).

This is the sort of thing people devote their entire research careers to and write PhDs about. Off hand, I can recall at least five college CS courses that I took that spent a very large portion of their time on this very problem from the point of view of the software, OS, compiler and CPU.

So it comes down to which OS is better at choosing which processes are assigned to which CPUs. Any OS depends on multiple threads of execution to take advantage of multiple CPUs. Which is better depends mostly on which one does a better job of distributing processes/programs/threads among CPUs without too much overhead.
_________________
"An empty head is not really empty; it is stuffed with rubbish. Hence the difficulty of forcing anything into an empty head."
-- Eric Hoffer

rac · Posted: Fri Sep 20, 2002 4:28 am Post subject:

I would like to expand a bit on a couple of points that phong touched on - there are certain types of software that lend themselves more or less to parallelization. Anything where the problem can be broken into independent chunks is easy to parallelize. Factoring, rendering, HTTP servers. Most event-driven software is very hard to parallelize, because you spend a lot of time sleeping in an event loop, and then you have to do one task as fast as you can.

The hardest program I have written so far in my career was designed to get news stories from a mainframe into a database. It was split logically into two parts: one that got headlines and noticed which stories had been updated, and another that retrieved the contents of the stories and fed them to the database. The whole mainframe link went over a modem line and terminal interface that was designed to be used by journalists in the field. Every step of the connection was very unreliable.

Another design goal was to make the program run on whatever computers happened to be around. There could be no single point of failure, and the system had to take advantage of other CPUs that became available automatically, and learn to route around ones that were temporarily or permanently down.

The hardest part of the entire system was the communications between the various nodes. During normal operation, there was one headline thread and N article threads. If an article thread died, the headline thread would just drop it from its list of available article fetchers, and reschedule whatever was in its queue. If the headline thread died, whichever article thread noticed it first would try to spawn a headline thread on its host, and if it passed a collision detector, the other article threads would notice where it is and start reporting to it.

So another thing that can affect the performance of distributed software is the complexity of the interaction required among the various threads.
_________________
For every higher wall, there is a taller ladder