Search

Runtime Systems

Description

Distributed and parallel computing becomes ubiquitous in solving real-world problems of high complexity and computational demand. Our goal is to facilitate the use of parallel and distributed computing, and apply the enormous computational power readily available to life-ritical time-constrained applications.

We strive to this goal by designing new parallel algorithms, developing efficient implementations, testing them on variety of modern platforms, applying these implementations in real applications, and using application feedback to improve our design and code.

Results

  • software environment, PREMA (Parallel Runtime Environment for Multicomputer Applications), for easy development of distributed applications have been designed and implemented by Dr. Kevin Barker. PREMA makes the development of certain class of distributed applications easy. Parallel programming is intuitive with PREMA's approach to distributing the work among computing nodes. While greatly reducing the programming effort, PREMA introduces small computational and communication overheads, but transparently solves the problem of balancing the computational load!
  • an extremely challenging problem of parallelizing Delaunay mesh refinement has been solved by Dr. Andrey Chernikov. By exploring the properties of point insertion during the refinement process, and carefully scheduling insertion of new points, it is now possible to allow for data distribution and simultaneous refinement of mesh sub-areas. This result makes possible to generate guaranteed quality Delaunay meshes in two and three dimensions with exceptional parallel speed, and employ distributed memory of computational clusters to create meshes with billions of elements in-core!
  • yet another unique approach to parallelizing Delaunay meshing with zero communication between computing nodes has been recently introduced by Dr. Leonidas Linardakis. By approaching the problem from a different point, Dr. Linardakis identified necessary conditions and designed a robust software package, which partitions two dimensional geometric domain of arbitrary complexity into disjoint subdomains, which can be meshed independently. This approach shows superlinear speedups and takes advantage of existing code by reusing state of the art sequential mesh generators!
  • applications like intraoperative non-rigid registration during neurosurgery demand near to real-time delivery of results to the surgeon. Recently, our group designed and implemented software framework for easy and efficient use of distributed computing resources for the timely computation of non-rigid registration. located remotely from the surgery room in different administrative domains.

Distr comp approach

PREMA approach to parallelizing mesh generation

Software

The following software is used in this project.

  • PREMA
  • Clam
  • PDR
  • meDDec

Multilayered & Multithreading

Description

Parallel architectures with hardware support for thread-level parallelism, such as fine-grain multithreaded processors, simultaneous multithreaded processors, multi-core designs and layered (e.g. multi-SMT-core designs) are the current trend from personal computing and high-end servers to layered large scale systems (supercomputers). In all these computer platforms thread-level parallel architectures are natural building blocks.Unfortunately the software tools are very slow catching up with the hardware and the programmer has exploit more than one levels of parallelism simultaneously using primitive support made available at a very low level system calls. In this project we focus on the parallelization and optimization strategies of mesh generation methods on layered parallel architectures.

Parallel mesh generation procedures decompose the original mesh generation problem into smaller sub-problems that can be solved (meshed) in parallel. The sub-problems can be formulated to be either tightly or partially coupled or even decoupled. The coupling of the sub-problems (i.e., the degree of dependency) determines the intensity of the communication and synchronization between processing elements working on different sub-problems. The three most widely used techniques for parallel mesh generation are Delaunay, Advancing Front, and Edge Subdivision.

In this project, we use the Delaunay technique because it can mathematically guarantee the quality of the mesh. More specifically, we focus on Constrained Delaunay meshing. The sequential execution time of our parallel implementation (PCDM) is comparable to that of the best known sequential implementation. At the same time, PCDM explores concurrency at three levels of granularity: (i) coarse-grain at the sub-domain level, (ii) medium-grain at the cavity level, and (iii) fine-grain at the element level. The Figure bellow depicts all three levels of granularity, one for each level of parallelization.

|Coarse grain

In the Figure above, in the coarse-grain (left) parallel implementation, the domain is decomposed into N/P sub-domains, where P is the number of processors. N/P sub-domains are mapped, using METIS, to processors in way that the ratio of interfaces to area is minimized i.e., improve affinity by assigning neighbor sub-domains to a single thread, core or processor. In the medium-grain parallel implementation (center) multiple cavities are expanded concurrently by multiple threads, one thread per cavity. Each thread expands the cavity of a bad-quality triangle. As soon as each cavity has been calculated, its triangles are deleted and the cavity is re-triangulated. In order to preserve the conformity of the mesh, the algorithm has to ensure that there are no conflicts between concurrently expanded cavities. In other words, concurrently expanded cavities are not allowed to share triangles. Finally, in the fine-grain (right) parallel implementation multiple threads work together for a single cavity expansion and thus the degree of parallelism is limited to two, for 2-dimentions and four, for 3-dimensions.

Results

In our experimental evaluation for the multi-layered and multithreaded approaches we used two representative architectures: (1) Dell PowerEdge 6650 server, with 4 Hyperthreaded Intel processors. Each Hyperthreaded processor has two execution contexts. (2) IBM OpenPower 720 node which has 2 IBM Power5 processors, each of which is a dual-core processor with 2-way SMT cores. Although our experimental platforms are small-scale shared-memory machines, they still provide valuable insight on scalability trends on systems with SMT and multicore components, as well as insights on the implications of multiprocessor memory allocators on both multithreaded codes and multi-process MPI codes. Our results are also relevant for clusters of small shared-memory systems, still quite a popular hardware platform for high-end computing.

|Speedups

In the figure above, the fixed and scaled speedups of the coarse grain and the coarse+medium grain PCDM+PODM on a cluster of 4 IBM OpenPower 720 nodes. The speedups have been calculated using as a reference either the single-thread execution time of PCDM or the execution time of the best known sequential mesher (Triangle). We present coarse-grain PCDM results using either one MPI process per core (Coarse) or one MPI process per SMT execution context (Coarse (2/core)). 60M triangles are created in the fixed problem size experiments. 15M triangles correspond to each processor core in the scaled problem size experiments. The table summarizes the corresponding execution times (in sec).

|Executiontimes1

The main implementation challenge of the medium-grain method is the detection of and recovery from conflicts, (in the case of PODM) which requires synchronization among threads working on the same problem (domain). Although one can employ low-overhead, non-blocking atomic operations, the frequency of these operations is extremely high ---approximately once every 500 nsec. Atomic operations impede memory accesses from other threads running on the same node. This results to performance penalties when more than four threads execute on each node (two cores, two SMT contexts/core). In the figure bellow the fixed and scaled speedups of the coarse+fine grain PCDM on a cluster of 4 IBM OpenPower 720 nodes. We report again the speedups of single-level, coarse-grain PCDM, with 1 MPI process per core (Coarse) as a basis for comparison. The speedups have been calculated using as a reference either the single-thread execution time of PCDM or the execution time of the best known sequential mesher (Triangle). 60M triangles are created in the fixed problem size experiments. 15M triangles correspond to each processor core in the scaled problem size experiments. The table summarizes the corresponding execution times (in sec).

|Executiontimes2

Related Publications

  1. Experiencxe with Memory Allocators for Parallel Mesh Generation on Multicore Architectures. Anrey Chernikov, Christos Antonopoulos, Nikos Chrisochoides, Scott Schneider, and Dimitrios Nikolopoulos. In Proceedings of the 10th ISGG Conferencde on Numerical Grid Generation, October 2007. [PDF] [BibTex]
  1. A Multigrain Delaunay Mesh Generation Method for Multicore SMT-based Architectures. C. Antonopoulos, F. Blagojevic, A. Chernikov, N. Chrisochoides, and D. Nikolopoulos and N. Chrisochoides. Journal of Parallel and Distributed Systems, August 2009 (in press). [PDF] [BibTex]
  1. Algorithm Software and Hardware Optimizations for Delaunay Mesh Generation on Simultaneous Multithreaded Architectures. C. Antonopoulos, F. Blagojeciv, A. Chernikov, N. Chrisochoides, and D. Nikolopoulos. Journal of Parallel and Distributed Systems, August 2009 (in press). [PDF] [BibTex]
  1. Multigrain Parallel Delaunay Mesh Generation: Challenges and Opportunities for Multithreaded Architectures Christos Antonopoulos, Xiaoning Ding, Andrey Chernikov, Filip Blagojevic, Nikos Chrisochoides, and Dimitrios Nikolopoulos. In Proceedings of the 19th ACM International Conference on Supercomputing (ICS2005), pp367-376. Cambridge, Massachusetts, June 2005. [PDF] [BibTex]