This file is intended as a "getting-started" guide to multi-threading (parallelism) support in the Karma library. In this document the term "threading" is equivalent to "parallelisation". Purpose: -------- Broadly speaking, there are two main approaches to making use of parallelism: 1) Divide the application into a number of smaller "jobs" and run each job on a separate CPU 2) Let the compiler parallelise the code for you The former approach is the more traditional, since, until recently, compiler technology did not generally support parallelism. It is the approach of "hand tuning". There is an increasing push to use compilers to parallelise code. However, while appealing to the application programmer, current attempts at automated parallelisation of code generally fall far behind the performance of "hand tuned" parallel code. The Karma library attempts to solve the first approach so that the application programmer spends the least time necessary in "hand tuning" code. It does this in two ways: 1) Provides a simple, efficient and uniform interface to threading libraries by implementing a "job launching" facility 2) Internal parallelisation of CPU intensive routines in the library Clearly, if an application makes heavy use of CPU intensive routines in the Karma library, the benefits gained from a multi-CPU platform are automatically gained without the need for any intervention by the application programmer. Of greater interest is the support provided for the application programmer to parallelise code. This is all provided by the package. The package allows the application programmer to create a pool of threads onto which jobs may be launched. Even if a platform does not support multi-threading, the same interface may still be used (without any significant performance penalty). Overview: --------- To make use of multi-threading, you first need to create a pool of threads onto which you can launch jobs. This is done by calling . You may then launch jobs onto the thread pool by calling . It is important to understand the distinction between threads and jobs: - Threads are not destroyed (unless their pool is explicitely destroyed by ). Threads sit idle until they are given a job. - Jobs are just that: a job that must be done. It corresponds to a function that is called by the thread it runs on. They last a short time and are run on the first available idle thread. The actual execution order of jobs is unpredictable. Typically an algorithm divides up the work to be done and has one job function which is launched many times, once for each part of the work. - You may launch many more (or less) jobs than there are threads. Often it is not convenient to divide the work into N pieces (where N is the number of threads), and may actually be less efficient (the last thread may end up doing more work than others). It is usually better to divide the work into far more jobs than there are threads. This ensures that all threads are kept busy (this is termed "load balancing"). You may require that each thread has a private working space (other than the stack) which is available to the jobs that run on that thread. This private data is "remembered" for all the jobs run on the thread, and is a convenient way to merge the results of different jobs together. One of the more common traps in threads programming is the incorrect use of global data. In particular, where more than one thread (or job) may update data which is not private to that thread. Examples of this kind of data are global variables and static variables inside functions. Because a function may be running more than once at the same time, you cannot update static variables in these functions if they are called from within a job function. Of course, reading global or function static data is not a problem. The problem occurs when updates are made. Instead use automatic variables (which go on the stack) and the thread-private working space discussed above. Classes of threaded algorithms: ------------------------------- The simplest class of threaded algorithm is where the next iteration of the algorithm does not depend on any previous results. An example of this is where two arrays are added together element-wise and stored in a third array. This algorithm is the simplest to thread: simply replace the outermost iteration loop with a loop which launches each iteration as a job. Each job will be run on any available thread, making maximum use of the system. Absolutely no synchonisation is required between the threads. Another class of algorithm does require the previous result before the current value can be computed. At first glance, this algorithm cannot be threaded. However, if the algorithm is actually run many times with each run being independent of the other runs, the algorithm can be threaded. There is no global dependence on previous results, there is only local dependence on previous results. An example of this is computing the Mandebrot set. While the inner loop is iterative and cannot be threaded, computing the image values is a seperable process. The Karma module does this. Once again no synchronisation between threads is required. Yet another class of algorithm has both local and global dependence on previous results. An example of this is computing the maximum value in an array. It would appear this algorithm cannot be threaded. This is not the case. The array can be divided into smaller chunks and the maximum of each chunk can be computed independently from other chunks. Once all chunks have been processed one simply has to determine the maximum of all the chunk maxima. This too requires no synchonisation between threads, but does require the result of each thread to be collated (reaped). To support this requires each thread to have a private working space which is initialised and then reaped once all threads have finished. The and routines provide this support. Alternatively, once the maximum is computed for a chunk (each job works on one chunk), this value can be merged into the global value. However this requires some synchronisation between threads. When a thread wants to update the global value it needs to aquire exclusive access the it, modify it and then release it. This is termed "locking" a resource (the global maximum value). Provided the period the global variable is locked is small, contention between threads is kept to a minumum. The routine provides locking functionality. In general it is better to use the previous method of thread-private data and reaping the results. A simple example is given below: Example 1: ---------- The following is a simple example program showing how an operation may be easily parallelised. Note that while many modern parallelising compiliers are able to parallelise simple nested loops, their efficiency is usually far less than this simple "hand tuned" code. The example creates a 2D "Intelligent Array" (1000x1000 floats) and writes the value 3.0 in the entire array. Each line of the array is filled in a separate job. In real life one would actually use the routine instead, of course. /*------------------------------------------------------*/ /* Multi-threading sample program */ /*------------------------------------------------------*/ #include #include EXTERN_FUNCTION (void job_func, (void *pool_info, void *call_info1, void *call_info2, void *call_info3, void *call_info4, void *thread_info) ); static iarray a; main () { /* Declare variables */ int y; static KThreadPool pool = NULL; /* Create a thread pool for the application */ pool = mt_create_pool (NULL); /* Create a 1000x1000 float 2D iarray */ a = iarray_create_2D (1000, 1000, K_FLOAT); /* Fill array */ for (y = 0; y < 1000; ++y) mt_launch_job (pool, job_func, (void *) y, NULL, NULL, NULL); mt_wait_for_all_jobs (pool); } void job_func (void *pool_info, void *call_info1, void *call_info2, void *call_info3, void *call_info4, void *thread_info) { int y = (int) call_info1; int x; float val = 3.0; for (x = 0; x < 1000; ++x) F2 (a, y, x) = val; } Example 2: ---------- This example is similar to the one above, except that instead of dividing the array into lines and processing one line per job, the job is divided by the number of concurrent threads available (i.e. the number of CPUs). Hence each job processes a number of lines. The reason behind this is to reduce the (already small) overhead of launching jobs. Note that this assumes that each job takes the same amount of time. If this is not the case, some threads will be idle part of the time, and the work will take longer because of the bad load balancing. This can be improved by increasing the number of jobs (perhaps by 4), hence reducing the maximum time any thread spends idle. If the number of jobs is increased too much the overheads may become significant, so one needs to make a compromise. Hence I suggest 4. /*------------------------------------------------------*/ /* Multi-threading sample program: more efficient */ /*------------------------------------------------------*/ #include #include EXTERN_FUNCTION (void job_func, (void *pool_info, void *call_info1, void *call_info2, void *call_info3, void *call_info4) ); static iarray a; main () { /* Declare variables */ int y, ystep; int ylen = 1000; static KThreadPool pool = NULL; /* Create a thread pool for the application */ pool = mt_create_pool (NULL); /* Create a 1000x1000 float 2D iarray */ a = iarray_create_2D (ylen, 1000, K_FLOAT); /* Fill array */ ystep = ylen / mt_num_threads (pool); for (y = 0; y < ylen; y += ystep) { if (y + ystep > ylen) ystep = ylen - y; mt_launch_job (pool, job_func, (void *) y, (void *) ystep, NULL, NULL); } mt_wait_for_all_jobs (pool); } void job_func (void *pool_info, void *call_info1, void *call_info2, void *call_info3, void *call_info4, void *thread_info) { int y = (int) call_info1; int ystep = (int) call_info2; int ycount; int x; float val = 3.0; for (ycount = 0; y_count < ystep; ++ycount, ++y) { for (x = 0; x < 1000; ++x) F2 (a, y, x) = val; } }