In this document the term ``threading'' is equivalent to ``parallelisation''.
Broadly speaking, there are two main approaches to making use of parallelism:
1) Divide the application into a number of smaller "jobs" and run each job on a separate CPU
2) Let the compiler parallelise the code for you
The former approach is the more traditional, since, until recently, compiler technology did not generally support parallelism. It is the approach of ``hand tuning''.
There is an increasing push to use compilers to parallelise code. However, while appealing to the application programmer, current attempts at automated parallelisation of code generally fall far behind the performance of ``hand tuned'' parallel code.
The Karma library attempts to solve the first approach so that the application programmer spends the least time necessary in ``hand tuning'' code. It does this in two ways:
1) Provides a simple, efficient and uniform interface to threading libraries by implementing a ``job launching'' facility
2) Internal parallelisation of CPU intensive routines in the library
Clearly, if an application makes heavy use of CPU intensive routines in the Karma library, the benefits gained from a multi-CPU platform are automatically gained without the need for any intervention by the application programmer.
Of greater interest is the support provided for the application programmer to parallelise code. This is all provided by the mt package. The mt package allows the application programmer to create a pool of threads onto which jobs may be launched. Even if a platform does not support multi-threading, the same interface may still be used (without any significant performance penalty).
To make use of multi-threading, you first need to create a pool of threads onto which you can launch jobs. This is done by calling mt_create_pool. You may then launch jobs onto the thread pool by calling mt_launch_job. It is important to understand the distinction between threads and jobs:
You may require that each thread has a private working space (other than the stack) which is available to the jobs that run on that thread. This private data is ``remembered'' for all the jobs run on the thread, and is a convenient way to merge the results of different jobs together.
One of the more common traps in threads programming is the incorrect use of global data. In particular, where more than one thread (or job) may update data which is not private to that thread. Examples of this kind of data are global variables and static variables inside functions. Because a function may be running more than once at the same time, you cannot update static variables in these functions if they are called from within a job function. Of course, reading global or function static data is not a problem. The problem occurs when updates are made. Instead use automatic variables (which go on the stack) and the thread-private working space discussed above.
The simplest class of threaded algorithm is where the next iteration of the algorithm does not depend on any previous results. An example of this is where two arrays are added together element-wise and stored in a third array. This algorithm is the simplest to thread: simply replace the outermost iteration loop with a loop which launches each iteration as a job. Each job will be run on any available thread, making maximum use of the system. Absolutely no synchonisation is required between the threads.
Another class of algorithm does require the previous result before the current value can be computed. At first glance, this algorithm cannot be threaded. However, if the algorithm is actually run many times with each run being independent of the other runs, the algorithm can be threaded. There is no global dependence on previous results, there is only local dependence on previous results. An example of this is computing the Mandebrot set. While the inner loop is iterative and cannot be threaded, computing the image values is a seperable process. The Karma module <mandel> does this. Once again no synchronisation between threads is required.
Yet another class of algorithm has both local and global dependence on previous results. An example of this is computing the maximum value in an array. It would appear this algorithm cannot be threaded. This is not the case. The array can be divided into smaller chunks and the maximum of each chunk can be computed independently from other chunks. Once all chunks have been processed one simply has to determine the maximum of all the chunk maxima. This too requires no synchonisation between threads, but does require the result of each thread to be collated (reaped). To support this requires each thread to have a private working space which is initialised and then reaped once all threads have finished. The mt_new_thread_info and mt_get_thread_info routines provide this support. Alternatively, once the maximum is computed for a chunk (each job works on one chunk), this value can be merged into the global value. However this requires some synchronisation between threads. When a thread wants to update the global value it needs to aquire exclusive access to it, modify it and then release it. This is termed ``locking'' a resource (the global maximum value). Provided the period the global variable is locked is small, contention between threads is kept to a minumum. The mt_setlock routine provides locking functionality. In general it is better to use the previous method of thread-private data and reaping the results.
A simple example is given below:
The following is a simple example program showing how an operation may be easily parallelised. Note that while many modern parallelising compiliers are able to parallelise simple nested loops, their efficiency is usually far less than this simple "hand tuned" code. The example creates a 2D ``Intelligent Array'' (1000x1000 floats) and writes the value 3.0 in the entire array. Each line of the array is filled in a separate job.
In real life one would actually use the iarray_fill routine instead, of course.
/*------------------------------------------------------*/ /* Multi-threading sample program */ /*------------------------------------------------------*/ #include <karma_iarray.h> #include <karma_mt.h> EXTERN_FUNCTION (void job_func, (void *pool_info, void *call_info1, void *call_info2, void *call_info3, void *call_info4, void *thread_info) ); static iarray a; main () { /* Declare variables */ int y; static KThreadPool pool = NULL; /* Create a thread pool for the application */ pool = mt_create_pool (NULL); /* Create a 1000x1000 float 2D iarray */ a = iarray_create_2D (1000, 1000, K_FLOAT); /* Fill array */ for (y = 0; y < 1000; ++y) mt_launch_job (pool, job_func, (void *) y, NULL, NULL, NULL); mt_wait_for_all_jobs (pool); } void job_func (void *pool_info, void *call_info1, void *call_info2, void *call_info3, void *call_info4, void *thread_info) { int y = (int) call_info1; int x; float val = 3.0; for (x = 0; x < 1000; ++x) F2 (a, y, x) = val; }
This example is similar to the one above, except that instead of dividing the array into lines and processing one line per job, the job is divided by the number of concurrent threads available (i.e. the number of CPUs). Hence each job processes a number of lines. The reason behind this is to reduce the (already small) overhead of launching jobs. Note that this assumes that each job takes the same amount of time. If this is not the case, some threads will be idle part of the time, and the work will take longer because of the bad load balancing. This can be improved by increasing the number of jobs (perhaps by a factor of 4), hence reducing the maximum time any thread spends idle. If the number of jobs is increased too much the overheads may become significant, so one needs to make a compromise. Hence I suggest 4.
/*------------------------------------------------------*/ /* Multi-threading sample program: more efficient */ /*------------------------------------------------------*/ #include <karma_iarray.h> #include <karma_mt.h> EXTERN_FUNCTION (void job_func, (void *pool_info, void *call_info1, void *call_info2, void *call_info3, void *call_info4) ); static iarray a; main () { /* Declare variables */ int y, ystep; int ylen = 1000; static KThreadPool pool = NULL; /* Create a thread pool for the application */ pool = mt_create_pool (NULL); /* Create a 1000x1000 float 2D iarray */ a = iarray_create_2D (ylen, 1000, K_FLOAT); /* Fill array */ ystep = ylen / mt_num_threads (pool); for (y = 0; y < ylen; y += ystep) { if (y + ystep > ylen) ystep = ylen - y; mt_launch_job (pool, job_func, (void *) y, (void *) ystep, NULL, NULL); } mt_wait_for_all_jobs (pool); } void job_func (void *pool_info, void *call_info1, void *call_info2, void *call_info3, void *call_info4, void *thread_info) { int y = (int) call_info1; int ystep = (int) call_info2; int ycount; int x; float val = 3.0; for (ycount = 0; y_count < ystep; ++ycount, ++y) { for (x = 0; x < 1000; ++x) F2 (a, y, x) = val; } }