Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorization issue Intel compiler #308

Open
EnricoDeg opened this issue Jan 8, 2024 · 5 comments
Open

Vectorization issue Intel compiler #308

EnricoDeg opened this issue Jan 8, 2024 · 5 comments

Comments

@EnricoDeg
Copy link

EnricoDeg commented Jan 8, 2024

Hello, I'm having problem vectorizing this very simple loop where dolic_c and tke_Lmix are mdspan objects:

for (int jc = start_index; jc <= end_index; jc++)
    tke_Lmix(blockNo, dolic_c(blockNo, jc), jc) = 0.0;

I'm using intel compiler and from the report I get the following output:

   remark #15389: vectorization support: reference dolic_c.__members.__members[blockNo*dolic_c.__members.__members.__members.__members.__members.__members.__members[1]+jc] has unaligned access   [ /home/k/k202136/YAOP/src/backends/CPU/cpu_kernels.hpp(68,38) ]
   remark #15381: vectorization support: unaligned access used inside loop body
   remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
   remark #15329: vectorization support: irregularly indexed store was emulated for the variable <tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe>, part of index is read from memory   [ /home/k/k202136/YAOP/src/backends/CPU/cpu_kernels.hpp(68,21) ]
   remark #15305: vectorization support: vector length 4
   remark #15309: vectorization support: normalized vectorization overhead 0.049
   remark #15450: unmasked unaligned unit stride loads: 1
   remark #15463: unmasked indexed (or scatter) stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 11
   remark #15477: vector cost: 25.750
   remark #15478: estimated potential speedup: 0.420
   remark #15488: --- end vector cost summary ---
   remark #25439: unrolled with remainder by 2

I think that this is simple enough that should be handled by the compiler and I also don't understand the "irregularly indexed store" issue mentioned in the report

@EnricoDeg EnricoDeg changed the title Vectorization issue Intel Vectorization issue Intel compiler Jan 8, 2024
@crtrott
Copy link
Member

crtrott commented Jan 10, 2024

So it looks like you are using a 2D View to index into a 3D view: that is where it goes off the rails. If we break it down into something simpler (not using mdspan) you got this:

int* idx = ....;
int N = ...;
int block_offset_idx = ...;
int block_offset_a = ...;

for(int jc = start; jc<= end; jc++) a[block_offset_a + N*idx[block_offset_idx + jc] + jc];

So the idx[block_offset_idx] is read from memory, and that is what they consider an irregular index access. It is not clear that this loop can be vectorized with non-scatter store.
The compiler actually identified everything correct: it sees a unit-stride load for dolic_c(blockNo, jc) and then correctly says that you have a indexed (scatter) store into tke_Lmix.
Using scatter stores here is (depending on the architecture) more expensive than simply executing this loop in scalar. And that is what the compiler things would happen, so it will not vectorize the code for you, unless you tell it to do so even in cases where the compiler estimates that it would slow your code down (with -vec-threshold0).

@EnricoDeg
Copy link
Author

Okay thanks for the explanation but in principle the loop is over jc and the iteration is over a contiguous part of the memory so shouldn't it vectorize it anyway? And the vectorization shouldn't be better than scalar execution?

@EnricoDeg
Copy link
Author

Using the option you suggested the loop is vectorized but with

estimated potential speedup: 0.490

which is bad. However when I try something like this:

for (int jc = start_index; jc <= end_index; jc++) {
    tke_Lmix(blockNo, 0, jc) = 0.0;
    tke_Lmix(blockNo, dolic_c(blockNo, jc), jc) = 0.0;
}

I get the following output from the compiler reports:

   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed OUTPUT dependence between tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe (66:21) and tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe (67:21)
   remark #15346: vector dependence: assumed OUTPUT dependence between tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe (67:21) and tke_Lmix.__members.__members[(blockNo*tke_Lmix.__members.__membe (66:21)
   remark #25439: unrolled with remainder by 2

Do you have any other suggestions?

@crtrott
Copy link
Member

crtrott commented Jan 16, 2024

Okay thanks for the explanation but in principle the loop is over jc and the iteration is over a contiguous part of the memory so shouldn't it vectorize it anyway? And the vectorization shouldn't be better than scalar execution?

It is not over a contiguous part of memory because you don't just change the last index in every iteration, you also change the middle index of tke_Lmix for every iteration.

Consider this:

blockNo = 0;

dolic_c(0, 0) = 0;
dolic_c(0, 1) = 1;

// then:

&tke_Lmix(0,dolic_c(0,jc),jc) == tkeLmix.data() + jc * tkeLmix.extent(2) + jc;

So if you increment jc by 1 your memory jumps by tkeLmix.extent(2) + 1

Adding a second statement in there also adds a potential OUTPUT dependency, because dolic_c(blockNo, jc) could be zero. So your second statement may overwrite the first one in an iteration. But if it gets vectorized there is a potential ordering issue. The compiler is not clever enough to realize that you are writing 0 in both cases so it wouldn't matter which way you overwrite stuff.

@EnricoDeg
Copy link
Author

Okay this makes sense, thanks for the explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants