Openwall GNU/*/Linux - a small security-enhanced Linux distro for servers
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 25 Aug 2015 14:42:27 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning

magnum, all -

On Tue, Aug 25, 2015 at 09:06:55AM +0300, Solar Designer wrote:
> We ought to do something about the auto-tuning.  Here are some ideas:
> 
> Maybe have a table of per card type likely optimal LWS (or multipliers
> for powers of 2).

Actually, this info can typically be queried, and we already had code to
do that - but it appeared mostly (or totally?) unused.

Specifically, there are opencl_find_best_workgroup() and
opencl_find_best_lws() functions in common-opencl.c.  The attached patch
#if 0's opencl_find_best_workgroup() (perhaps we need to drop it
completely, and remove from common-opencl.h too), and revises and makes
use of opencl_find_best_lws().

The new logic is, when neither GWS nor LWS env vars are specified:
pre-tune GWS (with a lower than usual maximum), tune LWS, and finally
tune GWS with the tuned LWS and considering the queried number of
compute units.  Obviously, this is far from perfect - we're trying to
find a maximum of a function of two variables, but are adjusting only
one at a time.  Yet it appears to work much better than the current
approach of tuning GWS only.

When either LWS or GWS is specified, then only the other is auto-tuned
(once).  When both are specified, nothing is auto-tuned.

For example, with md5crypt-opencl on GTX TITAN, where the previous
approach worked poorly:

[solar@...er run]$ ./john -test -form=md5crypt-opencl -dev=5
Device 5: GeForce GTX TITAN
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw:    1984K c/s real, 1984K c/s virtual

[solar@...er run]$ time ./john -test -form=md5crypt-opencl -dev=5 -v=4
Device 5: GeForce GTX TITAN
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15
Calculating best global worksize (GWS); max. 250ms single kernel invocation.
gws:      1024      160637 c/s   160637000 rounds/s   6.374ms per crypt_all()!
gws:      2048      306444 c/s   306444000 rounds/s   6.683ms per crypt_all()+
gws:      4096      572829 c/s   572829000 rounds/s   7.150ms per crypt_all()+
gws:      8192      957582 c/s   957582000 rounds/s   8.554ms per crypt_all()+
gws:     16384      989299 c/s   989299000 rounds/s  16.561ms per crypt_all()+
gws:     32768     1225015 c/s  1225015000 rounds/s  26.749ms per crypt_all()+
gws:     65536     1402179 c/s  1402179000 rounds/s  46.738ms per crypt_all()+
Calculating best local worksize (LWS)
Testing GWS=65536 LWS=32 ... 190469952ns
Testing GWS=65536 LWS=64 ... 107994464ns
Testing GWS=65472 LWS=96 ... 93050272ns
Testing GWS=65536 LWS=128 ... 92955840ns
Testing GWS=65440 LWS=160 ... 94382368ns
Testing GWS=65472 LWS=192 ... 93250048ns
Testing GWS=65408 LWS=224 ... 95941952ns
Testing GWS=65536 LWS=256 ... 93266272ns
Testing GWS=65536 LWS=512 ... 93425312ns
Testing GWS=65536 LWS=1024 ... 106644352ns
Calculating best global worksize (GWS); max. 500ms single kernel invocation.
gws:      1344      247774 c/s   247774000 rounds/s   5.424ms per crypt_all()!
gws:      2688      465121 c/s   465121000 rounds/s   5.779ms per crypt_all()+
gws:      5376      811578 c/s   811578000 rounds/s   6.624ms per crypt_all()+
gws:     10752     1335447 c/s  1335447000 rounds/s   8.051ms per crypt_all()+
gws:     21504     1963838 c/s  1963838000 rounds/s  10.949ms per crypt_all()+
gws:     43008     1978725 c/s  1978725000 rounds/s  21.735ms per crypt_all()
gws:     86016     1985954 c/s  1985954000 rounds/s  43.312ms per crypt_all()+
gws:    172032     1993503 c/s  1993503000 rounds/s  86.296ms per crypt_all()
gws:    344064     1996328 c/s  1996328000 rounds/s 172.348ms per crypt_all()
gws:    688128     2002809 c/s  2002809000 rounds/s 343.581ms per crypt_all()
Local worksize (LWS) 96, global worksize (GWS) 86016
DONE
Raw:    1978K c/s real, 1978K c/s virtual


real    0m5.642s
user    0m3.445s
sys     0m2.111s

Some other formats show speedups as well.  I didn't test all, though.
There might be regressions.

One known issue is that the LWS tuning probably needs a time limit, in
case the device supports a very high maximum LWS.  This may be
implemented similarly to how GWS tuning's time limit is.

Also, this code needs a cleanup.  My patch is a hack on top of other hacks.

Many formats provide their own idea of their desired LWS and GWS; maybe
we should drop most of this, as I suspect they are often less optimal
than the new auto-tuning.  Even md5crypt-opencl benchmarked above has a
boilerplate get_default_workgroup() in it, and the new auto-tuning
actually respects this initially (for the initial GWS tuning).  Maybe we
should instead start right with a device query to determine initial LWS
from that.  Those get_default_workgroup() copied to multiple format
files look ridiculous.

Alexander

diff --git a/src/common-opencl.c b/src/common-opencl.c
index 8da586f..7f0440c 100644
--- a/src/common-opencl.c
+++ b/src/common-opencl.c
@@ -1069,6 +1069,7 @@ void opencl_build_from_binary(int sequential_id)
 		fprintf(stderr, "Binary Build log: %s\n", opencl_log);
 }
 
+#if 0
 /*
  *   NOTE: Requirements for using this function:
  *
@@ -1278,6 +1279,7 @@ void opencl_find_best_workgroup_limit(struct fmt_main *self,
 	profilingEvent = firstEvent = lastEvent = NULL;
 	dyna_salt_remove(salt);
 }
+#endif
 
 // Do the proper test using different global work sizes.
 static void clear_profiling_events()
@@ -1480,12 +1482,9 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
 		benchEvent[i] = NULL;
 
 	if (options.verbosity > 3)
-		fprintf(stderr, "Max local worksize "Zu", ", group_size_limit);
+		fprintf(stderr, "Calculating best local worksize (LWS)\n");
 
-	/* Formats supporting vectorizing should have a default max keys per
-	   crypt that is a multiple of 2 and of 3 */
-	gws = global_work_size ? global_work_size :
-	      self->params.max_keys_per_crypt / opencl_v_width;
+	gws = global_work_size;
 
 	if (get_device_version(sequential_id) < 110) {
 		if (get_device_type(sequential_id) == CL_DEVICE_TYPE_GPU)
@@ -1584,8 +1583,13 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
 	        (int)my_work_group <= (int)max_group_size;
 	        my_work_group += wg_multiple) {
 
+		global_work_size = gws;
 		if (gws % my_work_group != 0)
-			continue;
+			global_work_size = GET_EXACT_MULTIPLE(gws, my_work_group);
+
+		if (options.verbosity > 3)
+			fprintf(stderr, "Testing GWS=" Zu " LWS=" Zu " ...",
+			    global_work_size, my_work_group);
 
 		sumStartTime = 0;
 		sumEndTime = 0;
@@ -1603,7 +1607,7 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
 				startTime = endTime = 0;
 
 				if (options.verbosity > 3)
-					fprintf(stderr, " Error occurred\n");
+					fprintf(stderr, " crypt_all() error\n");
 				break;
 			}
 
@@ -1626,9 +1630,25 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
 		}
 		if (!endTime)
 			break;
-		if ((sumEndTime - sumStartTime) < kernelExecTimeNs) {
+		if (options.verbosity > 3)
+			fprintf(stderr, " " Zu "ns\n", sumEndTime - sumStartTime);
+		if ((double)(sumEndTime - sumStartTime) / kernelExecTimeNs < 0.997) {
 			kernelExecTimeNs = sumEndTime - sumStartTime;
 			optimal_work_group = my_work_group;
+		} else {
+			if (my_work_group >= 256 ||
+			    (my_work_group >= 8 && wg_multiple < 8)) {
+				/* Jump to next power of 2 */
+				size_t x, y;
+				x = my_work_group;
+				while ((y = x & (x - 1)))
+					x = y;
+				x *= 2;
+				my_work_group =
+				    GET_MULTIPLE_OR_BIGGER(x, wg_multiple);
+				/* The loop logic will re-add wg_multiple */
+				my_work_group -= wg_multiple;
+			}
 		}
 	}
 	// Release profiling queue and create new with profiling disabled
@@ -1639,17 +1659,28 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
 	                         devices[sequential_id], 0, &ret_code);
 	HANDLE_CLERROR(ret_code, "Error creating command queue");
 	local_work_size = optimal_work_group;
+	global_work_size = GET_EXACT_MULTIPLE(gws, local_work_size);
 
 	dyna_salt_remove(salt);
 }
 
 void opencl_find_best_gws(int step, unsigned long long int max_run_time,
-                          int sequential_id, unsigned int rounds)
+                          int sequential_id, unsigned int rounds, int have_lws)
 {
 	size_t num = 0;
-	size_t optimal_gws = local_work_size;
+	size_t optimal_gws = local_work_size, soft_limit = 0;
 	unsigned long long speed, best_speed = 0, raw_speed;
 	cl_ulong run_time, min_time = CL_ULONG_MAX;
+	unsigned long long int save_duration_time = duration_time;
+	cl_uint core_count = get_max_compute_units(sequential_id);
+
+	if (have_lws) {
+		if (core_count > 2)
+			optimal_gws *= core_count;
+		default_value = optimal_gws;
+	} else {
+		soft_limit = local_work_size * core_count * 128;
+	}
 
 	/*
 	 * max_run_time is either:
@@ -1692,8 +1723,12 @@ void opencl_find_best_gws(int step, unsigned long long int max_run_time,
 
 		// Check if hardware can handle the size we are going
 		// to try now.
-		if ((gws_limit && (num > gws_limit)) || ((gws_limit == 0) &&
-		        (buffer_size * kpc * 1.1 > get_max_mem_alloc_size(gpu_id)))) {
+		if ((soft_limit && (num > soft_limit)) ||
+		    (gws_limit && (num > gws_limit)) || ((gws_limit == 0) &&
+		    (buffer_size * kpc * 1.1 > get_max_mem_alloc_size(gpu_id)))) {
+			if (!optimal_gws)
+				optimal_gws = num;
+
 			if (options.verbosity > 4)
 				fprintf(stderr, "Hardware resources exhausted\n");
 			break;
@@ -1743,6 +1778,8 @@ void opencl_find_best_gws(int step, unsigned long long int max_run_time,
 	                         devices[sequential_id], 0, &ret_code);
 	HANDLE_CLERROR(ret_code, "Error creating command queue");
 	global_work_size = optimal_gws;
+
+	duration_time = save_duration_time;
 }
 
 static void opencl_get_dev_info(int sequential_id)
diff --git a/src/common-opencl.h b/src/common-opencl.h
index 66969d5..1e76b1d 100644
--- a/src/common-opencl.h
+++ b/src/common-opencl.h
@@ -270,7 +270,7 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
  *   For raw formats it should be 1. For sha512crypt it is 5000.
  */
 void opencl_find_best_gws(int step, unsigned long long int max_run_time,
-                          int sequential_id, unsigned int rounds);
+                          int sequential_id, unsigned int rounds, int have_lws);
 
 /*
  * Shared function to initialize variables necessary by shared find(lws/gws) functions.
diff --git a/src/opencl-autotune.h b/src/opencl-autotune.h
index 6cfa22e..2f3ee95 100644
--- a/src/opencl-autotune.h
+++ b/src/opencl-autotune.h
@@ -49,7 +49,7 @@ size_t autotune_get_task_max_work_group_size(int use_local_memory,
   of keys per crypt for the given format
 -- */
 void autotune_find_best_gws(int sequential_id, unsigned int rounds, int step,
-	unsigned long long int max_run_time);
+	unsigned long long int max_run_time, int have_lws);
 
 /* --
   This function could be used to calculated the best local
@@ -78,11 +78,11 @@ static void find_best_lws(struct fmt_main * self, int sequential_id)
   of keys per crypt for the given format
 -- */
 static void find_best_gws(struct fmt_main * self, int sequential_id, unsigned int rounds,
-	unsigned long long int max_run_time)
+	unsigned long long int max_run_time, int have_lws)
 {
 	//Call the common function.
 	autotune_find_best_gws(
-		sequential_id, rounds, STEP, max_run_time
+		sequential_id, rounds, STEP, max_run_time, have_lws
 	);
 
 	create_clobj(global_work_size, self);
@@ -108,13 +108,16 @@ static void find_best_gws(struct fmt_main * self, int sequential_id, unsigned in
 static void autotune_run_extra(struct fmt_main * self, unsigned int rounds,
 	size_t gws_limit, unsigned long long int max_run_time, cl_uint lws_is_power_of_two)
 {
+	int need_best_lws, need_best_gws;
+
 	/* Read LWS/GWS prefs from config or environment */
 	opencl_get_user_preferences(FORMAT_LABEL);
 
 	if (!global_work_size && !getenv("GWS"))
 		global_work_size = get_task_max_size();
 
-	if (!local_work_size && !getenv("LWS"))
+	need_best_lws = !local_work_size && !getenv("LWS");
+	if (need_best_lws)
 		local_work_size = get_default_workgroup();
 
 	if (gws_limit && (global_work_size > gws_limit))
@@ -134,14 +137,27 @@ static void autotune_run_extra(struct fmt_main * self, unsigned int rounds,
 		local_work_size = get_task_max_work_group_size();
 
 	/* Enumerate GWS using *LWS=NULL (unless it was set explicitly) */
-	if (!global_work_size)
-		find_best_gws(self, gpu_id, rounds, max_run_time);
-	else
+	need_best_gws = !global_work_size;
+	if (need_best_gws) {
+		unsigned long long int max_run_time1;
+		int have_lws = !(!local_work_size || need_best_lws);
+		if (have_lws) {
+			max_run_time1 = max_run_time;
+			need_best_gws = 0;
+		} else {
+			max_run_time1 = (max_run_time + 1) / 2;
+		}
+		find_best_gws(self, gpu_id, rounds, max_run_time1, have_lws);
+	} else {
 		create_clobj(global_work_size, self);
+	}
 
-	if (!local_work_size)
+	if (!local_work_size || need_best_lws)
 		find_best_lws(self, gpu_id);
 
+	if (need_best_gws)
+		find_best_gws(self, gpu_id, rounds, max_run_time, 1);
+
 	/* Adjust to the final configuration */
 	release_clobj();
 	global_work_size = GET_EXACT_MULTIPLE(global_work_size, local_work_size);
diff --git a/src/opencl_autotune.c b/src/opencl_autotune.c
index f42303b..c03c21a 100644
--- a/src/opencl_autotune.c
+++ b/src/opencl_autotune.c
@@ -83,7 +83,7 @@ void autotune_find_best_lws(size_t group_size_limit,
    of keys per crypt for the given format
    -- */
 void autotune_find_best_gws(int sequential_id, unsigned int rounds, int step,
-                          unsigned long long int max_run_time)
+                          unsigned long long int max_run_time, int have_lws)
 {
 	char *tmp_value;
 
@@ -93,7 +93,7 @@ void autotune_find_best_gws(int sequential_id, unsigned int rounds, int step,
 	step = GET_MULTIPLE_OR_ZERO(step, local_work_size);
 
 	//Call the default function.
-	opencl_find_best_gws(step, max_run_time, sequential_id, rounds);
+	opencl_find_best_gws(step, max_run_time, sequential_id, rounds, have_lws);
 }
 
 #endif

Powered by blists - more mailing lists

Your e-mail address:

Powered by Openwall GNU/*/Linux - Powered by OpenVZ