Difference between revisions of "PREMA SW4"
Pthomadakis (talk | contribs) (→Cluster: Wahab) |
Pthomadakis (talk | contribs) |
||
Line 1: | Line 1: | ||
+ | In high-performance computing (HPC), proxy applications (“proxy apps”) are small, simplified codes that allow application developers to share important features of large applications without forcing collaborators to assimilate large and complex code bases. The Exascale Proxy Applications Project, part of the Exascale Computing Project (ECP), is a project where developers can share explore and share proxy apps. | ||
+ | |||
+ | Proxy apps are often used as models for performance-critical computations, but proxy apps can do more than just represent algorithms or computational characteristics of apps. They also capture programming methods and styles that drive requirements for compilers and other elements of the toolchain. Within ECP, application teams, co-design centers, software technology projects, and vendors all plan to use proxy apps as a major mechanism to drive collaborations and co-design solutions for exascale challenges. | ||
+ | |||
+ | A major goal of the Exascale Proxy Applications Project is to improve the quality of proxies created by ECP and maximize the benefit received from their use. To accomplish this goal, an ECP proxy app suite composed of proxies developed by ECP projects that represent the most important features (especially performance) of exascale applications has been created. To ensure high quality of ECP proxy apps, the Exascale Proxy Applications Project has defined standards for documentation, build and test systems, performance models and evaluations, etc. | ||
+ | |||
+ | One such proxy application is the SW4lite app. Sw4lite is a bare bone version of SW4 intended for testing performance optimizations in a few important numerical kernels of SW4. SW4 implements substantial capabilities for 3-D seismic modeling, with a free surface condition on the top boundary, absorbing super-grid conditions on the far-field boundaries, and an arbitrary number of point force and/or point moment tensor source terms. Each source time function can have one of many predefined analytical time dependencies, or interpolate a user-defined discrete time series | ||
+ | |||
== MPI Times == | == MPI Times == | ||
== PREMA Times == | == PREMA Times == |
Revision as of 21:53, 22 January 2020
In high-performance computing (HPC), proxy applications (“proxy apps”) are small, simplified codes that allow application developers to share important features of large applications without forcing collaborators to assimilate large and complex code bases. The Exascale Proxy Applications Project, part of the Exascale Computing Project (ECP), is a project where developers can share explore and share proxy apps.
Proxy apps are often used as models for performance-critical computations, but proxy apps can do more than just represent algorithms or computational characteristics of apps. They also capture programming methods and styles that drive requirements for compilers and other elements of the toolchain. Within ECP, application teams, co-design centers, software technology projects, and vendors all plan to use proxy apps as a major mechanism to drive collaborations and co-design solutions for exascale challenges.
A major goal of the Exascale Proxy Applications Project is to improve the quality of proxies created by ECP and maximize the benefit received from their use. To accomplish this goal, an ECP proxy app suite composed of proxies developed by ECP projects that represent the most important features (especially performance) of exascale applications has been created. To ensure high quality of ECP proxy apps, the Exascale Proxy Applications Project has defined standards for documentation, build and test systems, performance models and evaluations, etc.
One such proxy application is the SW4lite app. Sw4lite is a bare bone version of SW4 intended for testing performance optimizations in a few important numerical kernels of SW4. SW4 implements substantial capabilities for 3-D seismic modeling, with a free surface condition on the top boundary, absorbing super-grid conditions on the far-field boundaries, and an arbitrary number of point force and/or point moment tensor source terms. Each source time function can have one of many predefined analytical time dependencies, or interpolate a user-defined discrete time series
MPI Times
PREMA Times
Allocation: 300 cores in different configurations
Pure MPI Time: 182.32
Cluster: Wahab
Nodes:
d4-w6420b-[07-12], e1-w6420b-20, e2-w6420b-[02-04,06,08,17], e3-w6420b-[09-12,17,20]
#Nodes | #Cores | Total Time | Recv | Send | #m Sent | #m Recvd | App Handlers | Creating Work | P2P-MP | Handlers Executed | LB | ILB | Yieldables | Blocked | Steal | Steal_Succ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
300 | 1 | 170.65 | 0.66 | 0.65 | 12184.75 | 20892.91 | 135.64 | 147.99 | 2.99 | 12126.90 | 11.27 | 0.0 | 35.43 | 103.84 | 1.54 | 0.0 |
150 | 2 | 175.58 | 0.87 | 0.94 | 17895.50 | 30110.82 | 139.69 | 140.11 | 34.10 | 12126.90 | 0.68 | 0.0 | 66.44 | 73.16 | 0.32 | 0.25 |
100 | 3 | 176.58 | 1.08 | 1.28 | 24305.74 | 40606.23 | 143.09 | 143.49 | 31.79 | 12126.90 | 0.68 | 0.0 | 76.11 | 66.93 | 0.27 | 0.20 |
75 | 4 | 176.22 | 1.16 | 1.48 | 29436.79 | 48996.64 | 143.51 | 143.92 | 30.98 | 12126.90 | 0.69 | 0.0 | 75.51 | 67.91 | 0.32 | 0.24 |
60 | 5 | 178.04 | 1.29 | 1.71 | 35129.87 | 58484.55 | 138.08 | 138.57 | 37.84 | 12126.90 | 0.85 | 0.0 | 68.08 | 69.91 | 0.37 | 0.27 |
50 | 6 | 179.93 | 1.49 | 2.10 | 42224.16 | 70137.46 | 141.62 | 142.09 | 36.24 | 12126.90 | 0.83 | 0.0 | 71.45 | 70.12 | 0.32 | 0.21 |
30 | 10 | 187.78 | 1.67 | 3.07 | 64217.67 | 106014.10 | 145.65 | 146.21 | 39.85 | 12126.90 | 0.91 | 0.0 | 73.14 | 72.40 | 0.47 | 0.31 |
25 | 12 | 192.23 | 1.93 | 3.91 | 78517.24 | 129349.92 | 146.19 | 146.82 | 43.58 | 12126.90 | 1.03 | 0.0 | 71.80 | 74.22 | 0.56 | 0.37 |
20 | 15 | 193.54 | 1.93 | 4.53 | 96706.40 | 158971.15 | 143.63 | 144.33 | 47.32 | 12126.90 | 1.12 | 0.0 | 69.72 | 73.74 | 0.60 | 0.38 |
15 | 20 | 195.95 | 2.02 | 5.75 | 123265.27 | 201163.20 | 149.65 | 150.40 | 43.71 | 12126.90 | 1.08 | 0.0 | 72.42 | 76.95 | 0.79 | 0.50 |
12 | 25 | 210.77 | 1.89 | 6.07 | 126917.50 | 205646.75 | 155.94 | 156.87 | 51.74 | 12126.90 | 1.30 | 0.0 | 73.67 | 81.81 | 1.13 | 0.73 |
10 | 30 | 222.58 | 1.74 | 6.01 | 123972.0 | 199357.30 | 160.44 | 161.57 | 58.68 | 12126.90 | 1.48 | 0.0 | 76.66 | 83.16 | 1.45 | 0.91 |
Allocation: 1000 cores without dedicated thread for MPI. 1 MPI rank per socket -> 4 MPI ranks/node. Pure MPI Time: 69.8s
#Nodes | #Cores | Total Time | Recv | Send | #m Sent | #m Recvd | App Handlers | Creating Work | P2P-MP | Handlers Executed | LB | ILB | Yieldables | Blocked | Steal | Steal_Succ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100 | 10 | 65.69 | 0.75 | 2.47 | 68044.18 | 111721.23 | 49.68 | 52.72 | 6.27 | 12459.53 | 4.55 | 0.0 | 21.10 | 26.85 | 3.58 | 2.45 |
Cluster: Turing
Pure MPI Time: 268.56
Allocation: 310 cores (due to issues with the tools on Turing)
#Nodes | #Cores | Total Time | Recv | Send | #m Sent | #m Recvd | App Handlers | Creating Work | P2P-MP | Handlers Executed | LB | ILB | Yieldables | Blocked | Steal | Steal_Succ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 31 | 276.08 | 34.63 | 51.60 | 186837.90 | 300642.30 | 208.39 | 212.43 | 59.56 | 12026.52 | 2.06 | 0.0 | 106.82 | 100.66 | 3.34 | 2.39 |
496 cores
#Nodes | #Cores | Total Time | Recv | Send | #m Sent | #m Recvd | App Handlers | Creating Work | P2P-MP | Handlers Executed | LB | ILB | Yieldables | Blocked | Steal | Steal_Succ |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16 | 31 | 182.90 | 33.29 | 47.52 | 192765.06 | 313180.31 | 135.00 | 137.86 | 40.99 | 12267.95 | 1.44 | 0.0 | 66.71 | 67.84 | 2.19 | 1.53 |