Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn on MaterializeEncodingIntoNop pass to a later stage and for all backends #17817

Merged
merged 1 commit into from
Jul 13, 2024

Conversation

lialan
Copy link
Contributor

@lialan lialan commented Jul 7, 2024

This is a prerequisite for disabling early materialization pass.

Issue: #17719

@lialan lialan force-pushed the lialan/nop branch 4 times, most recently from 6768f78 to 3bbf88f Compare July 8, 2024 03:52
@lialan lialan marked this pull request as ready for review July 8, 2024 04:02
@hanhanW hanhanW added benchmarks:cuda Run default CUDA benchmarks benchmarks:x86_64 Run default x86_64 benchmarks benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:android-gpu Run default Android GPU benchmarks benchmarks:vulkan-nvidia Run default Vulkan benchmarks on NVIDIA GPU labels Jul 8, 2024
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I don't expect that there are changes in MaterializeHomogeneousEncodings.cpp because it could introduce new graph to other backends on default path. E.g., the padding and encodings ops are formed into a single dispatch if there are multi targets. We will delete the data-tiling passes from GlobalOpt (or move them to preprocessing phase) once we finish #17722. I'd suggest to not change this file for multi-device project stability. We should be able to develop data-tiling under some flags. In terms of test coverage, I think we can add few e2e matmul suite to other backends like

# LLVMCPU, data-tiling, data-tiling + ukernels + late materialization.
[iree_generated_e2e_runner_test(
name = "e2e_matmul_cpu_experimental_dt%s_%s_%s_%s" % (
("_uk" if use_uk else ""),
lhs_rhs_type,
acc_type,
size,
),
compiler_flags = [
"--iree-opt-data-tiling",
"--iree-global-opt-enable-early-materialization=false",

I.e., we can create one suite for each backend and add below two additional compilation flags to the suites.

        "--iree-opt-data-tiling",
        "--iree-global-opt-enable-early-materialization=false",

@lialan lialan force-pushed the lialan/nop branch 12 times, most recently from a9a52ee to d43c1f5 Compare July 10, 2024 15:55
@lialan
Copy link
Contributor Author

lialan commented Jul 10, 2024

@hanhanW Can you review again?

I added specific tests to other backends, but hit problems with spirv with unsupported data types in legalizing pass. Took me a while but haven't figured a simple datatype choice to pass spriv.

I think this is a spirv specific issue (that some data types are not being supported) and is not related to the scope of this patch. So I think we should just skip this test.

@lialan lialan requested a review from hanhanW July 10, 2024 17:51
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Can you also add tests for amdgpu? Unlike other targets, the amdgpu tests are only available in CMakeLists.txt. That's why you don't see any of them in the BUILD.bazel. I think we can add the test suite for both CDNA and RDNA3.

# To distinguish between CDNA(gfx9) and RDNA3(gfx11)

compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp Outdated Show resolved Hide resolved
Comment on lines 61 to 64
{
FunctionLikeNest newFuncPassManager(funcPassManager);
addEncodingToNopPasses(newFuncPassManager);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need the change for VMVX side either. It should already be handled by CPUMaterializeEncoding pass.

addCommonTargetExecutablePreprocessingPasses(funcPassManager);
}
modulePassManager.addPass(createMaterializeUserConfigsPass());
FunctionLikeNest(modulePassManager)
.addPass([&]() { return createCPUMaterializeEncodingPass(); })
// TODO: Remove the following pass the plumb support for
// #hal.descriptor_type memory space through the stack.
.addPass(createEraseHALDescriptorTypeFromMemRefPass);
modulePassManager.addPass(createVMVXSelectLoweringStrategyPass());
}

Note: The VMVX pipeline is quite different from other backends, so you don't find the pass in this file.

tests/e2e/matmul/BUILD.bazel Outdated Show resolved Hide resolved
tests/e2e/matmul/BUILD.bazel Outdated Show resolved Hide resolved
@ScottTodd
Copy link
Collaborator

Nice! Can you also add tests for amdgpu? Unlike other targets, the amdgpu tests are only available in CMakeLists.txt. That's why you don't see any of them in the BUILD.bazel. I think we can add the test suite for both CDNA and RDNA3.

# To distinguish between CDNA(gfx9) and RDNA3(gfx11)

I'm currently refactoring tests/e2e/ so those tests will be possible to define in Bazel too, FYI. Probably looking at tests/e2e/matmul/ last though, since it's more complicated.

hanhanW
hanhanW previously approved these changes Jul 11, 2024
Copy link
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@hanhanW hanhanW dismissed their stale review July 11, 2024 03:07

I want to take a look at spirv failure, dismiss my approval for now

tests/e2e/matmul/CMakeLists.txt Outdated Show resolved Hide resolved
tests/e2e/matmul/CMakeLists.txt Outdated Show resolved Hide resolved
@lialan
Copy link
Contributor Author

lialan commented Jul 11, 2024

@ScottTodd perhaps can you also help a bit here? https://github.com/iree-org/iree/actions/runs/9898353931/job/27345667940?pr=17817#step:15:144

It has reached binary size limit.

@ScottTodd
Copy link
Collaborator

@ScottTodd perhaps can you also help a bit here? https://github.com/iree-org/iree/actions/runs/9898353931/job/27345667940?pr=17817#step:15:144

It has reached binary size limit.

Working on it: #17873

@lialan
Copy link
Contributor Author

lialan commented Jul 11, 2024

@hanhanW CI is now good, please review it again.

@lialan lialan force-pushed the lialan/nop branch 2 times, most recently from e446f54 to 11be982 Compare July 12, 2024 18:19
@lialan lialan removed benchmarks:android-cpu Run default Android CPU benchmarks benchmarks:android-gpu Run default Android GPU benchmarks labels Jul 13, 2024
Copy link

Abbreviated Benchmark Summary

@ commit d196a9b084d5d5aaf4266bbec9e7c706ecac834e (vs. base be461bd0c17d9e607a316b8312bdc0f62298f581)

Data-Tiling Comparison Table

Click to show
Name No-DT (baseline) DT-Only DT-UK
BertLargeTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 756.300 (1.0X) N/A 223.320 (3.4X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 7.000 (1.0X) N/A 8.524 (0.8X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 36.158 (1.0X) N/A 34.587 (1.0X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.780 (1.0X) N/A 5.018 (1.2X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 9.259 (1.0X) N/A 8.561 (1.1X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.040 (1.0X) N/A 9.005 (1.2X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 12.121 (1.0X) N/A 13.947 (0.9X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 33.796 (1.0X) N/A 61.777 (0.5X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.655 (1.0X) N/A 62.163 (0.6X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[15-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 68.602 (1.0X) N/A 65.147 (1.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.561 (1.0X) N/A 4.624 (1.0X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 3.699 (1.0X) N/A 4.903 (0.8X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 5.890 (1.0X) N/A 5.451 (1.1X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 2.906 (1.0X) N/A 2.847 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 8.429 (1.0X) N/A 9.936 (0.8X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 0.779 (1.0X) N/A 0.611 (1.3X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[8-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 4.191 (1.0X) N/A 5.213 (0.8X)
matmul_256x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 7.562 (1.0X) N/A 7.594 (1.0X)
matmul_256x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 6.675 (1.0X) N/A 1.806 (3.7X)
BertForMaskedLMTF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[30-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 219.923 (1.0X) N/A 109.339 (2.0X)
DeepLabV3_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 32.396 (1.0X) N/A 29.873 (1.1X)
EfficientNetV2STF(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 275.271 (1.0X) N/A 229.977 (1.2X)
EfficientNet_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 27.031 (1.0X) N/A 13.054 (2.1X)
GPT2_117M_TF_1X1XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 71.003 (1.0X) N/A 40.126 (1.8X)
GPT2_117M_TF_1X4XI32(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 87.679 (1.0X) N/A 42.062 (2.1X)
MiniLML12H384Uncased(stablehlo) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 79.812 (1.0X) N/A 56.773 (1.4X)
MobileBertSquad_fp16(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 180.675 (1.0X) N/A 186.919 (1.0X)
MobileBertSquad_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 182.686 (1.0X) N/A 192.263 (1.0X)
MobileBertSquad_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 516.662 (1.0X) N/A 240.640 (2.1X)
MobileNetV1_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 25.220 (1.0X) N/A 17.978 (1.4X)
MobileNetV2_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 11.622 (1.0X) N/A 11.525 (1.0X)
MobileNetV2_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 21.505 (1.0X) N/A 11.910 (1.8X)
MobileNetV3Small_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 2.849 (1.0X) N/A 2.717 (1.0X)
MobileSSD_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 34.206 (1.0X) N/A 31.411 (1.1X)
PersonDetect_int8(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.710 (1.0X) N/A 0.549 (1.3X)
PoseNet_fp32(tflite) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_task(embedded_elf)[1-thread,full-inference,default-flags] with default @ c2-standard-60[cpu] 17.597 (1.0X) N/A 19.056 (0.9X)
matmul_1x256x2048_i8_i4_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.054 (1.0X) N/A 0.054 (1.0X)
matmul_1x256x2048_i8_i8_i32_tile_config_default(linalg) [x86_64-cascadelake-linux_gnu-llvm_cpu] local_sync(embedded_elf)[full-inference,default-flags] with default @ c2-standard-60[cpu] 0.042 (1.0X) N/A 0.022 (2.0X)

No improved or regressed benchmarks 🏖️

No improved or regressed compilation metrics 🏖️

For more information:

Source Workflow Run

@lialan lialan merged commit 2ed3f92 into iree-org:main Jul 13, 2024
62 checks passed
@lialan lialan deleted the lialan/nop branch July 13, 2024 05:00
@ScottTodd
Copy link
Collaborator

When merging pull requests, please keep the PR title and description in the merged commit. That should be the default behavior.

https://iree.dev/developers/general/contributing/#merging-approved-changes

This was merged as 2ed3f92, with a confusing commit title and no commit body:
image

(btw, during code reviews please also prefer to push new commits instead of force pushing, as separate commits make it easier for reviewers to see what changed after review rounds)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks:comp-stats Run default compilation statistics benchmarks benchmarks:cuda Run default CUDA benchmarks benchmarks:vulkan-nvidia Run default Vulkan benchmarks on NVIDIA GPU benchmarks:x86_64 Run default x86_64 benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants