opencl - Do global_work_size and local_work_size have any effect on application logic? -


i trying understand how of different parameters dimensions fit in opencl. if question isn't clear that's partly because formed question requires bits of answer don't have.

how work_dim, global_work_size, , local_work_size work create execution space use in kernel? example, if make work_dim 2 can

get_global_id(0); get_global_id(1); 

i can divide 2 dimensions n work groups using global_work_size, right? if make global_work_size so

size_t global_work_size[] = { 4 }; 

then each dimension have 4 work groups total of 8? but, beginner, using global_id indices global id's matter anyway. can tell pretty confused of can offer ...help.


image made try understand question

image decribing work groups found on google

since stated bit confused concepts involved in execution space, i'm gonna try summary them before answering question , give examples.

the threads/workitems organized in ndrange can viewed grid of 1, 2, 3 dims. ndrange used map each thread piece of data each of them have manipulate. therefore each thread should uniquely identified , thread should know 1 , stands in ndrange. , there come work-item built-in functions. these functions can called threads give them info themself , ndrange stand.

the dimensions:

as stated, ndrange can have 3 dimensions. if set dimensions way:

size_t global_work_size[2] = { 4, 4 }; 

it doesn't mean each dimension have 4 work groups total of 8, you'll have 4 * 4 i.e. 16 threads in ndrange. these threads arranged in "square" sides of 4 units. workitems can know how many dimensions ndrange made of, using uint get_work_dim () function.

the global size:

threads can query how big ndrange specific dimension size_t get_global_size (uint d). therefore can know how big "line/square/rectangle/cube" ndrange.

the global unique identifiers:

thanks organization, each thread can uniquely identified indexes corresponding specific dimensions. hence thread (2, 1) refers thread in 3rd column , second row of 2d range. function size_t get_global_id (uint d) used in kernel query id of threads.

the workgroup (or local) size:

the ndrange can split in smaller groups called workgroups. local_work_size referring has (and logically) 3 dimensions. note opencl version below 2.0, the ndrange size in given dimension must multiple of workgroup size in dimension. keep example, since in dimension 0 have 4 threads, workgroup size in dimension 0 can 1, 2, 4 not 3. global size, threads can query local size size_t get_local_size (uint d).

the local unique identifiers:

sometime important thread can uniquely identified within workgroup. hence function size_t get_local_id (uint d). note "within" in previous sentence. thread local id (1, 0) only 1 have id in workgroup (of 2d). there many threads local id (1, 0) there workgroups in ndrange.

the number of groups:

speaking of groups sometime thread might need know how many groups there are. that's why function size_t get_num_groups (uint d) exists. note again have pass parameter dimension interested in.

each group has id:

...that can query within kernel function size_t get_group_id (uint d). note format of group ids similar of threads: tuples of 3 elements.

summary:

to wrap things bit, if have 2d ndrange of global work size of (4, 6) , local work size of (2, 2) means that:

  • the global size in dimension 0 4
  • the global size in dimension 1 6
  • the local size (or workgroup size) in dimension 0 2
  • the local size (or workgroup size) in dimension 1 2
  • the thread global ids in dimension 0 range 0 3
  • the thread global ids in dimension 1 range 0 5
  • the thread local ids in dimension 0 range 0 1
  • the thread local ids in dimension 1 range 0 1
  • the total number of threads in ndrange 4 * 6 = 24
  • the total number of threads in workgroup 2 * 2 = 4
  • the total number of workgroups (4/2) * (6/2) = 6
  • the group ids in dimension 0 range 0 1
  • the group ids in dimension 1 range 0 2
  • there 1 thread global id (0, 0) there 6 threads local id (0, 0) because there 6 groups.

example:

here dummy example use these concepts (note performance terrible, it's stupid example).

let's have 2d array of 6 rows , 4 columns of int. want group these elements in square of 2 2 elements , sum them in such way instance, elements (0, 0), (0, 1), (1, 0), (1, 1) in 1 group (hope it's clear enough). because you'll have 6 "squares" you'll have 6 results sums, you'll need array of 6 elements store these results.

to solve this, use our 2d ndrange detailed above. each thread fetch global memory 1 element, , store in local memory. after synchronization, 1 thread per workgroup, let each local(0, 0) threads sum elements (in local) , store result @ specific place in 6 elements array (in global).

//in 24 int array, result 6 int array, temp 4 int array  kernel void foo(global int *in, global int *result, local int *temp){     //use vectors conciseness     int2 globalid = (int2)(get_global_id(0), get_global_id(1));     int2 localid = (int2)(get_local_id(0), get_local_id(1));     int2 groupid = (int2)(get_group_id (0), get_group_id (1));     int2 globalsize = (int2)(get_global_size(0), get_global_size(1));     int2 locallsize = (int2)(get_local_size(0), get_local_size(1));     int2 numberofgrp = (int2)(get_num_groups (0), get_num_groups (1));      //read global , store local     temp[localid.x + localid.y * localsize.x] = in[globalid.x + globalid.y * globalsize.x];     //sync     barrier(clk_local_mem_fence);         //only threads local id (0, 0) sum elements     if(localid.x == 0 && localid.y == 0){     int sum = 0;         for(int = 0; < locallsize.x * locallsize.y ; i++){             sum += temp[i];         }     //store result in global     result[groupid.x + numberofgrp.x * groupid.y] = sum;      } } 

and answer question: global_work_size , local_work_size have effect on application logic?

usually yes because it's part of way design algo. note size of workgroup not taken randomly matches need (here 2 2 square).

note if decide use ndrange of 1 dimension size of 24 , local size of 4 in 1 dim, it'll screw things because kernel designed use 2 dimensions.


Comments

Popular posts from this blog

ios - RestKit 0.20 — CoreData: error: Failed to call designated initializer on NSManagedObject class (again) -

java - Digest auth with Spring Security using javaconfig -

laravel - PDOException in Connector.php line 55: SQLSTATE[HY000] [1045] Access denied for user 'root'@'localhost' (using password: YES) -