Difference between revisions of "PREMA SW4"

Latest revision as of 19:23, 27 January 2020

SW4lite

Figure 1. The communication pattern of SW4lite proxy app.

Figure 2. The modified communication pattern of SW4lite proxy app to work with PREMA. The blue arrows represent the data requests, while the red ones show the actual data transfers.

In high-performance computing (HPC), proxy applications (“proxy apps”) are small, simplified codes that allow application developers to share important features of large applications without forcing collaborators to assimilate large and complex code bases. The Exascale Proxy Applications Project, part of the Exascale Computing Project (ECP), is a project where developers can share explore and share proxy apps.

Proxy apps are often used as models for performance-critical computations, but proxy apps can do more than just represent algorithms or computational characteristics of apps. They also capture programming methods and styles that drive requirements for compilers and other elements of the toolchain. Within ECP, application teams, co-design centers, software technology projects, and vendors all plan to use proxy apps as a major mechanism to drive collaborations and co-design solutions for exascale challenges.

A major goal of the Exascale Proxy Applications Project is to improve the quality of proxies created by ECP and maximize the benefit received from their use. To accomplish this goal, an ECP proxy app suite composed of proxies developed by ECP projects that represent the most important features (especially performance) of exascale applications has been created. To ensure high quality of ECP proxy apps, the Exascale Proxy Applications Project has defined standards for documentation, build and test systems, performance models and evaluations, etc.

One such proxy application is the SW4lite app. Sw4lite is a bare bone version of SW4 intended for testing performance optimizations in a few important numerical kernels of SW4. SW4 implements substantial capabilities for 3-D seismic modeling, with a free surface condition on the top boundary, absorbing super-grid conditions on the far-field boundaries, and an arbitrary number of point force and/or point moment tensor source terms. Each source time function can have one of many predefined analytical time dependencies, or interpolate a user-defined discrete time series.

The proxy starts by decomposing the original 2D grid into a number of partitions equal to the number of available processes. Each processor is assigned one equally sized partition of the grid, and thus the processors can be thought to be logically positioned in a 2D grid. Next, some preprocessing takes place on these partitions that includes some data sharing between them. Once this step completes an iterative process starts consisting of computations intercepted by two cycles of neighbor-to-neighbor communication per iteration. The communication pattern of each cycle is as follows: each processor sends a portion of the partition it holds to its left neighbor and waits to receive the respective data from its right neighbor. Once the respective data are received, the same pairs of processors share data in the opposite direction. Next, the same process takes place for the y-axis, each processor sends another portion of its data to its bottom neighbor and waits to receive the respective data from its top neighbor, and then they communicate in the opposite direction. Processors located in the edges of the grid only send/receive data to/from their remaining neighbors. Figure 1 shows the communication pattern schematically.

To port the application on top of PREMA we have undergone the following process:

Each of the partitions of the decomposed 2D grid is registered as a mobile object.
Each MPI rank holds partitions equal to the number of cores it utilizes.
Preprocessing computations are performed by invoking remote handlers on each of the partitions.
Two-sided communications are replaced with one-sided asynchronous remote method invocations.

The four-step communication pattern that takes place in the main iterative process of the application has been modified to ensure correctness. In contrast with two-sided MPI where the receiver can explicitly request for the data to be received, PREMA's message receiving is implicit and, as such, we need to make sure that the receiver is ready to accept the data without corrupting its state before sending it. To achieve this we have added two more steps to the communication pattern, one for each direction. The idea is that the communication in each direction begins with the receiver neighbor requesting the data to be sent to it when it's ready to receive them. The sender will send the data when it's ready to do so. In this way, we guarantee that both neighbors' data are consistent after the first step of communication in each direction. The requests for the second and fourth steps are implicit as they are received as part of the actual data sent in the first and third steps. Figure 2 demonstrates the modified communication patter.

MPI Times

PREMA Times

Allocation: 300 cores in different configurations

Pure MPI Time: 182.32

Cluster: Wahab

Nodes:

     d4-w6420b-[07-12],
     e1-w6420b-20,
     e2-w6420b-[02-04,06,08,17],
     e3-w6420b-[09-12,17,20]

#Nodes	#Cores	Total Time	Recv	Send	#m Sent	#m Recvd	App Handlers	Creating Work	P2P-MP	Handlers Executed	LB	Yieldables	Blocked	Steal	Steal_Succ
300	1	170.65	0.66	0.65	12184.75	20892.91	135.64	147.99	2.99	12126.90	11.27	35.43	103.84	1.54	0.0
150	2	175.58	0.87	0.94	17895.50	30110.82	139.69	140.11	34.10	12126.90	0.68	66.44	73.16	0.32	0.25
100	3	176.58	1.08	1.28	24305.74	40606.23	143.09	143.49	31.79	12126.90	0.68	76.11	66.93	0.27	0.20
75	4	176.22	1.16	1.48	29436.79	48996.64	143.51	143.92	30.98	12126.90	0.69	75.51	67.91	0.32	0.24
60	5	178.04	1.29	1.71	35129.87	58484.55	138.08	138.57	37.84	12126.90	0.85	68.08	69.91	0.37	0.27
50	6	179.93	1.49	2.10	42224.16	70137.46	141.62	142.09	36.24	12126.90	0.83	71.45	70.12	0.32	0.21
30	10	187.78	1.67	3.07	64217.67	106014.10	145.65	146.21	39.85	12126.90	0.91	73.14	72.40	0.47	0.31
25	12	192.23	1.93	3.91	78517.24	129349.92	146.19	146.82	43.58	12126.90	1.03	71.80	74.22	0.56	0.37
20	15	193.54	1.93	4.53	96706.40	158971.15	143.63	144.33	47.32	12126.90	1.12	69.72	73.74	0.60	0.38
15	20	195.95	2.02	5.75	123265.27	201163.20	149.65	150.40	43.71	12126.90	1.08	72.42	76.95	0.79	0.50
12	25	210.77	1.89	6.07	126917.50	205646.75	155.94	156.87	51.74	12126.90	1.30	73.67	81.81	1.13	0.73
10	30	222.58	1.74	6.01	123972.0	199357.30	160.44	161.57	58.68	12126.90	1.48	76.66	83.16	1.45	0.91

Allocation: 1000 cores without dedicated thread for MPI. 1 MPI rank per socket -> 4 MPI ranks/node. Pure MPI Time: 69.8s

#Nodes	#Cores	Total Time	Recv	Send	#m Sent	#m Recvd	App Handlers	Creating Work	P2P-MP	Handlers Executed	LB	ILB	Yieldables	Blocked	Steal	Steal_Succ
100	10	65.69	0.75	2.47	68044.18	111721.23	49.68	52.72	6.27	12459.53	4.55	0.0	21.10	26.85	3.58	2.45

No. Cores	Time (s)
No. Cores	MPI	PREMA
120	566.615	450.271
250	231.538	225.804
500	124.86	120.256
750	107.271	84.8418
1000	69.442	64.5786
1250	60.2811	54.3832
1500	47.7293	45.8788
1750	42.6359	40.794
2000	38.9251	37.0859

Cluster: Turing

Pure MPI Time: 268.56

Allocation: 310 cores (due to issues with the tools on Turing)

#Nodes	#Cores	Total Time	Recv	Send	#m Sent	#m Recvd	App Handlers	Creating Work	P2P-MP	Handlers Executed	LB	ILB	Yieldables	Blocked	Steal	Steal_Succ
10	31	276.08	34.63	51.60	186837.90	300642.30	208.39	212.43	59.56	12026.52	2.06	0.0	106.82	100.66	3.34	2.39

496 cores

#Nodes	#Cores	Total Time	Recv	Send	#m Sent	#m Recvd	App Handlers	Creating Work	P2P-MP	Handlers Executed	LB	ILB	Yieldables	Blocked	Steal	Steal_Succ
16	31	182.90	33.29	47.52	192765.06	313180.31	135.00	137.86	40.99	12267.95	1.44	0.0	66.71	67.84	2.19	1.53

Difference between revisions of "PREMA SW4"

Latest revision as of 19:23, 27 January 2020

Contents

SW4lite

MPI Times

PREMA Times

Cluster: Wahab

Cluster: Turing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Administration

Toolboxes

Tools

@@ Line 1: / Line 1: @@
-=PREMA SW4 Results=
+== SW4lite ==
+[[File:SW4-Comm.png| thumb| 300px| right|Figure 1. The communication pattern of SW4lite proxy app.]]
+[[File:PREMA SW4.png| thumb| 400px| right|Figure 2. The modified communication pattern of SW4lite proxy app to work with PREMA. The blue arrows represent the data requests, while the red ones show the actual data transfers.]]
+In high-performance computing (HPC), proxy applications (“proxy apps”) are small, simplified codes that allow application developers to share important features of large applications without forcing collaborators to assimilate large and complex code bases. The Exascale Proxy Applications Project, part of the Exascale Computing Project (ECP), is a project where developers can share explore and share proxy apps.
+Proxy apps are often used as models for performance-critical computations, but proxy apps can do more than just represent algorithms or computational characteristics of apps. They also capture programming methods and styles that drive requirements for compilers and other elements of the toolchain. Within ECP, application teams, co-design centers, software technology projects, and vendors all plan to use proxy apps as a major mechanism to drive collaborations and co-design solutions for exascale challenges.
+A major goal of the Exascale Proxy Applications Project is to improve the quality of proxies created by ECP and maximize the benefit received from their use. To accomplish this goal, an ECP proxy app suite composed of proxies developed by ECP projects that represent the most important features (especially performance) of exascale applications has been created. To ensure high quality of ECP proxy apps, the Exascale Proxy Applications Project has defined standards for documentation, build and test systems, performance models and evaluations, etc.
+One such proxy application is the [https://github.com/geodynamics/sw4lite| SW4lite app]. Sw4lite is a bare bone version of [https://geodynamics.org/cig/software/sw4/| SW4] intended for testing performance optimizations in a few important numerical kernels of SW4. SW4 implements substantial capabilities for 3-D seismic modeling, with a free surface condition on the top boundary, absorbing super-grid conditions on the far-field boundaries, and an arbitrary number of point force and/or point moment tensor source terms. Each source time function can have one of many predefined analytical time dependencies, or interpolate a user-defined discrete time series.
+The proxy starts by decomposing the original 2D grid into a number of partitions equal to the number of available processes. Each processor is assigned one equally sized partition of the grid, and thus
+the processors can be thought to be logically positioned in a 2D grid.  Next, some preprocessing takes place on these partitions that includes some
+data sharing between them. Once this step completes an iterative process starts consisting of computations intercepted by two cycles of neighbor-to-neighbor communication per iteration. The communication pattern of each cycle is as follows: each processor sends a portion of the partition it holds to its left neighbor and waits to receive the respective data from its right neighbor. Once the respective data are received, the same pairs of processors share data in the opposite direction. Next, the same process takes place for the y-axis, each processor sends another portion of its data to its bottom neighbor and waits to receive the respective data from its top neighbor, and then they communicate in the opposite direction. Processors located in the edges of the grid only send/receive data to/from their remaining neighbors. Figure 1 shows the communication pattern schematically.
+To port the application on top of PREMA we have undergone the following process:
+* Each of the partitions of the decomposed 2D grid is registered as a mobile object.
+* Each MPI rank holds partitions equal to the number of cores it utilizes.
+* Preprocessing computations are performed by invoking remote handlers on each of the partitions.
+* Two-sided communications are replaced with one-sided asynchronous remote method invocations.
+The four-step communication pattern that takes place in the main iterative process of the application has been modified to ensure correctness.
+In contrast with two-sided MPI where the receiver can explicitly request
+for the data to be received, PREMA's message receiving is implicit and, as such, we need to make sure that the receiver is ready to accept the data without corrupting its state before sending it.
+To achieve this we have added two more steps to the communication pattern, one for each direction. The idea is that the communication in each direction begins with the receiver neighbor requesting
+the data to be sent to it when it's ready to receive them. The sender will send the data when it's ready to do so. In this way, we guarantee that both neighbors' data are consistent after the first step
+of communication in each direction. The requests for the second and fourth steps are implicit as they are received as part of the actual data sent in the first and third steps. Figure 2 demonstrates the
+modified communication patter.
+== MPI Times ==
+== PREMA Times ==
+Allocation: 300 cores in different configurations
+Pure MPI Time: 182.32
+=== Cluster: Wahab ===
+Nodes:
+      d4-w6420b-[07-12],
+      e1-w6420b-20,
+      e2-w6420b-[02-04,06,08,17],
+      e3-w6420b-[09-12,17,20]
+{| class="wikitable"
+!#Nodes !! #Cores !! Total Time !! Recv !! Send !! #m Sent !! #m Recvd !! App Handlers !! Creating Work !! P2P-MP !! Handlers Executed !! LB !! ILB !! Yieldables !! Blocked !! Steal !! Steal_Succ
+|-
+|300
+|1
+|170.65
+|0.66
+|0.65
+|12184.75
+|20892.91
+|135.64
+|147.99
+|2.99
+|12126.90
+|11.27
+|0.0
+|35.43
+|103.84
+|1.54
+|0.0
+|-
+|150
+|2
+|175.58
+|0.87
+|0.94
+|17895.50
+|30110.82
+|139.69
+|140.11
+|34.10
+|12126.90
+|0.68
+|0.0
+|66.44
+|73.16
+|0.32
+|0.25
+|-
+|100
+|3
+|176.58
+|1.08
+|1.28
+|24305.74
+|40606.23
+|143.09
+|143.49
+|31.79
+|12126.90
+|0.68
+|0.0
+|76.11
+|66.93
+|0.27
+|0.20
+|-
+|75
+|4
+|176.22
+|1.16
+|1.48
+|29436.79
+|48996.64
+|143.51
+|143.92
+|30.98
+|12126.90
+|0.69
+|0.0
+|75.51
+|67.91
+|0.32
+|0.24
+|-
+|60
+|5
+|178.04
+|1.29
+|1.71
+|35129.87
+|58484.55
+|138.08
+|138.57
+|37.84
+|12126.90
+|0.85
+|0.0
+|68.08
+|69.91
+|0.37
+|0.27
+|-
+|50
+|6
+|179.93
+|1.49
+|2.10
+|42224.16
+|70137.46
+|141.62
+|142.09
+|36.24
+|12126.90
+|0.83
+|0.0
+|71.45
+|70.12
+|0.32
+|0.21
+|-
+|30
+|10
+|187.78
+|1.67
+|3.07
+|64217.67
+|106014.10
+|145.65
+|146.21
+|39.85
+|12126.90
+|0.91
+|0.0
+|73.14
+|72.40
+|0.47
+|0.31
+|-
+|25
+|12
+|192.23
+|1.93
+|3.91
+|78517.24
+|129349.92
+|146.19
+|146.82
+|43.58
+|12126.90
+|1.03
+|0.0
+|71.80
+|74.22
+|0.56
+|0.37
+|-
+|20
+|15
+|193.54
+|1.93
+|4.53
+|96706.40
+|158971.15
+|143.63
+|144.33
+|47.32
+|12126.90
+|1.12
+|0.0
+|69.72
+|73.74
+|0.60
+|0.38
+|-
+|15
+|20
+|195.95
+|2.02
+|5.75
+|123265.27
+|201163.20
+|149.65
+|150.40
+|43.71
+|12126.90
+|1.08
+|0.0
+|72.42
+|76.95
+|0.79
+|0.50
+|-
+|12
+|25
+|210.77
+|1.89
+|6.07
+|126917.50
+|205646.75
+|155.94
+|156.87
+|51.74
+|12126.90
+|1.30
+|0.0
+|73.67
+|81.81
+|1.13
+|0.73
+|-
+|10
+|30
+|222.58
+|1.74
+|6.01
+|123972.0
+|199357.30
+|160.44
+|161.57
+|58.68
+|12126.90
+|1.48
+|0.0
+|76.66
+|83.16
+|1.45
+|0.91
+|}
+Allocation: 1000 cores '''without dedicated thread for MPI'''. 1 MPI rank per socket -> 4 MPI ranks/node.
+Pure MPI Time: 69.8s
+{| class="wikitable"
+!#Nodes !! #Cores !! Total Time !! Recv !! Send !! #m Sent !! #m Recvd !! App Handlers !! Creating Work !! P2P-MP !! Handlers Executed !! LB !! ILB !! Yieldables !! Blocked !! Steal !! Steal_Succ
+|-
+|100
+|10
+|65.69
+|0.75
+|2.47
+|68044.18
+|111721.23
+|49.68
+|52.72
+|6.27
+|12459.53
+|4.55
+|0.0
+|21.10
+|26.85
+|3.58
+|2.45
+|}
+{| class="wikitable"
+|-
+! rowspan="2" | No. Cores
+! colspan="2" style="text-align: center;"| Time (s)
+|-
+! MPI !! PREMA
+|-
+|120  || 566.615 ||450.271
+|-
+|250  || 231.538 ||225.804
+|-
+|500  || 124.86 ||120.256
+|-
+|750  || 107.271  ||84.8418
+|-
+|1000 || 69.442 ||64.5786
+|-
+|1250 || 60.2811  ||54.3832
+|-
+|1500 || 47.7293 ||45.8788
+|-
+|1750 || 42.6359 || 40.794
+|-
+|2000 || 38.9251 ||37.0859
+|}
+=== Cluster: Turing ===
+Pure MPI Time: 268.56
+Allocation: 310 cores (due to issues with the tools on Turing)
+{| class="wikitable"
+!#Nodes !! #Cores !! Total Time !! Recv !! Send !! #m Sent !! #m Recvd !! App Handlers !! Creating Work !! P2P-MP !! Handlers Executed !! LB !! ILB !! Yieldables !! Blocked !! Steal !! Steal_Succ
+|-
+|10
+|31
+|276.08
+|34.63
+|51.60
+|186837.90
+|300642.30
+|208.39
+|212.43
+|59.56
+|12026.52
+|2.06
+|0.0
+|106.82
+|100.66
+|3.34
+|2.39
+|}
+cores
+{| class="wikitable"
+!#Nodes !! #Cores !! Total Time !! Recv !! Send !! #m Sent !! #m Recvd !! App Handlers !! Creating Work !! P2P-MP !! Handlers Executed !! LB !! ILB !! Yieldables !! Blocked !! Steal !! Steal_Succ
+|-
+|16
+|31
+|182.90
+|33.29
+|47.52
+|192765.06
+|313180.31
+|135.00
+|137.86
+|40.99
+|12267.95
+|1.44
+|0.0
+|66.71
+|67.84
+|2.19
+|1.53
+|}