Search

Multilayered & Multithreading

Description

Parallel architectures with hardware support for thread-level parallelism, such as fine-grain multithreaded processors, simultaneous multithreaded processors, multi-core designs and layered (e.g. multi-SMT-core designs) are the current trend from personal computing and high-end servers to layered large scale systems (supercomputers). In all these computer platforms thread-level parallel architectures are natural building blocks.Unfortunately the software tools are very slow catching up with the hardware and the programmer has exploit more than one levels of parallelism simultaneously using primitive support made available at a very low level system calls. In this project we focus on the parallelization and optimization strategies of mesh generation methods on layered parallel architectures.

Parallel mesh generation procedures decompose the original mesh generation problem into smaller sub-problems that can be solved (meshed) in parallel. The sub-problems can be formulated to be either tightly or partially coupled or even decoupled. The coupling of the sub-problems (i.e., the degree of dependency) determines the intensity of the communication and synchronization between processing elements working on different sub-problems. The three most widely used techniques for parallel mesh generation are Delaunay, Advancing Front, and Edge Subdivision.

In this project, we use the Delaunay technique because it can mathematically guarantee the quality of the mesh. More specifically, we focus on Constrained Delaunay meshing. The sequential execution time of our parallel implementation (PCDM) is comparable to that of the best known sequential implementation. At the same time, PCDM explores concurrency at three levels of granularity: (i) coarse-grain at the sub-domain level, (ii) medium-grain at the cavity level, and (iii) fine-grain at the element level. The Figure bellow depicts all three levels of granularity, one for each level of parallelization.

|Coarse grain

In the Figure above, in the coarse-grain (left) parallel implementation, the domain is decomposed into N/P sub-domains, where P is the number of processors. N/P sub-domains are mapped, using METIS, to processors in way that the ratio of interfaces to area is minimized i.e., improve affinity by assigning neighbor sub-domains to a single thread, core or processor. In the medium-grain parallel implementation (center) multiple cavities are expanded concurrently by multiple threads, one thread per cavity. Each thread expands the cavity of a bad-quality triangle. As soon as each cavity has been calculated, its triangles are deleted and the cavity is re-triangulated. In order to preserve the conformity of the mesh, the algorithm has to ensure that there are no conflicts between concurrently expanded cavities. In other words, concurrently expanded cavities are not allowed to share triangles. Finally, in the fine-grain (right) parallel implementation multiple threads work together for a single cavity expansion and thus the degree of parallelism is limited to two, for 2-dimentions and four, for 3-dimensions.

Results

In our experimental evaluation for the multi-layered and multithreaded approaches we used two representative architectures: (1) Dell PowerEdge 6650 server, with 4 Hyperthreaded Intel processors. Each Hyperthreaded processor has two execution contexts. (2) IBM OpenPower 720 node which has 2 IBM Power5 processors, each of which is a dual-core processor with 2-way SMT cores. Although our experimental platforms are small-scale shared-memory machines, they still provide valuable insight on scalability trends on systems with SMT and multicore components, as well as insights on the implications of multiprocessor memory allocators on both multithreaded codes and multi-process MPI codes. Our results are also relevant for clusters of small shared-memory systems, still quite a popular hardware platform for high-end computing.

|Speedups

In the figure above, the fixed and scaled speedups of the coarse grain and the coarse+medium grain PCDM+PODM on a cluster of 4 IBM OpenPower 720 nodes. The speedups have been calculated using as a reference either the single-thread execution time of PCDM or the execution time of the best known sequential mesher (Triangle). We present coarse-grain PCDM results using either one MPI process per core (Coarse) or one MPI process per SMT execution context (Coarse (2/core)). 60M triangles are created in the fixed problem size experiments. 15M triangles correspond to each processor core in the scaled problem size experiments. The table summarizes the corresponding execution times (in sec).

|Executiontimes1

The main implementation challenge of the medium-grain method is the detection of and recovery from conflicts, (in the case of PODM) which requires synchronization among threads working on the same problem (domain). Although one can employ low-overhead, non-blocking atomic operations, the frequency of these operations is extremely high ---approximately once every 500 nsec. Atomic operations impede memory accesses from other threads running on the same node. This results to performance penalties when more than four threads execute on each node (two cores, two SMT contexts/core). In the figure bellow the fixed and scaled speedups of the coarse+fine grain PCDM on a cluster of 4 IBM OpenPower 720 nodes. We report again the speedups of single-level, coarse-grain PCDM, with 1 MPI process per core (Coarse) as a basis for comparison. The speedups have been calculated using as a reference either the single-thread execution time of PCDM or the execution time of the best known sequential mesher (Triangle). 60M triangles are created in the fixed problem size experiments. 15M triangles correspond to each processor core in the scaled problem size experiments. The table summarizes the corresponding execution times (in sec).

|Executiontimes2

Related Publications

  1. Experiencxe with Memory Allocators for Parallel Mesh Generation on Multicore Architectures. Anrey Chernikov, Christos Antonopoulos, Nikos Chrisochoides, Scott Schneider, and Dimitrios Nikolopoulos. In Proceedings of the 10th ISGG Conferencde on Numerical Grid Generation, October 2007. [PDF] [BibTex]
  1. A Multigrain Delaunay Mesh Generation Method for Multicore SMT-based Architectures. C. Antonopoulos, F. Blagojevic, A. Chernikov, N. Chrisochoides, and D. Nikolopoulos and N. Chrisochoides. Journal of Parallel and Distributed Systems, August 2009 (in press). [PDF] [BibTex]
  1. Algorithm Software and Hardware Optimizations for Delaunay Mesh Generation on Simultaneous Multithreaded Architectures. C. Antonopoulos, F. Blagojeciv, A. Chernikov, N. Chrisochoides, and D. Nikolopoulos. Journal of Parallel and Distributed Systems, August 2009 (in press). [PDF] [BibTex]
  1. Multigrain Parallel Delaunay Mesh Generation: Challenges and Opportunities for Multithreaded Architectures Christos Antonopoulos, Xiaoning Ding, Andrey Chernikov, Filip Blagojevic, Nikos Chrisochoides, and Dimitrios Nikolopoulos. In Proceedings of the 19th ACM International Conference on Supercomputing (ICS2005), pp367-376. Cambridge, Massachusetts, June 2005. [PDF] [BibTex]