Intel compiler setup and usage tips
Background
Intel's proprietary compiler (ICC) generates optimized code for IA-32 and Intel 64 architectures, and non-optimized code for non-Intel but compatible processors, such as certain AMD processors [Wikipedia Intel C++ Compiler].
Context to DOSBox Staging
Of notable value is that ICC generates helpful reports for why it could (or couldn't) optimize or vectorize a particular chunk of code or loop. These reports, combined with its Advisor tool, can provide insight on how to make small adjustments that allow GCC, Clang, and the Visual Studio (compilers we do care about) to better optimize the code.
Case study
ICC's reports lead to a handful of small changes in the GUS feature branch that resulted in all but one data-processing loop being vectorized (the remaining loop is inter-dependent and cannot be unrolled). The GUS feature branch was roughly three-fold more CPU-intensive, due to its extensive use floating-point math to retain dynamic range in the signal, versus the original bit-shifting and truncating code. After vectorizing, the GUS feature branch is now 12% faster than the bit-shifting code, simply leveraging the SIMD registers to perform between 4 and 8 floating point calculations per iteration.
Acquiring ICC for Linux
Intel allows their compiler to be used for non-commercial purposes. Fill out the application here: https://registrationcenter.intel.com/en/forms/?programid=OPNSRC&productid=2908 , and Intel will respond within a week with your license.
Offline installers
Installing the compiler
After receiving your license number:
- Download the full installer
- Untar it and launch its installer using
sudo bash install-gui.sh - Paste in your license when asked
- Disable features to save space, if desired. For example, I typically disable IA32, threaded building blocks, MKL, deep learning, cluster, fortran support, GDB, and Python. I enable x86-64 C++ compiler, Advisor, and Tuning Profiler.
- Allow it to install to its default location, /opt/intel`.
Using the compiler
Once installed in opt/intel, you can now build dosbox-staging using ICC:
- Debug build: ./script/build.sh -c icc -t debug
- Release build: ./script/build.sh -c icc -t release
- Optimization info build: ./script/build.sh -c icc -t optinfo
In the last case, each .c and .cpp file will have an .optrpt sitting along side it.
Using the optimization reports
The report is divided by function, so it's easy to search by function name.
Here's an example of a failed vectoring attempt:
===========================================================================
Begin optimization report for: Gus::SoftLimit(Gus *, const float *, int16_t *)
Report from: Loop nest & Vector optimizations [loop, vec]
LOOP BEGIN at gus.cpp(755,2) inlined into gus.cpp(763,2)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed ANTI dependence between in[i] (756:3) and this->right (757:3)
remark #15346: vector dependence: assumed FLOW dependence between this->right (757:3) and in[i] (756:3)
remark #25438: unrolled without remainder by 2
LOOP END
The report mentions it couldn't vectorize the code due to "dependence" ANTI and FLOW issues; and this gives us a hint of what's happening. The for loop in question is pretty darn straight forward though; so what's going on?
void Gus::UpdatePeakAmplitudes(const float *stream)
{
for (int i = 0; i < BUFFER_SAMPLES - 1; i += 2) {
peak.left = std::max(peak.left, fabsf(stream[i]));
peak.right = std::max(peak.right, fabsf(stream[i + 1]));
}
}
The problem is the result from one pass becomes the input to the next pass - which prevents the compiler from unrolling the loop and performing many passes all at once, simultaneously, buy filling the SIMD registers with the work from say 4 passes through the loop. Unfortunately, each of those passes requires content from the prior.. so they can't be done at once.
That said - we could help the compiler vectorize this, but the code would get quite ugly.
Frame peaks[4];
for (int i = 0; i < BUFFER_SAMPLES - 1; i += 4) {
peaks[i + 0].left = std::max(peaks[i + 0].left, fabsf(stream[i]));
peaks[i + 0].right = std::max(peaks[i + 0].right, fabsf(stream[i + 1]));
peaks[i + 1].left = std::max(peaks[i + 1].left, fabsf(stream[i + 2]));
peaks[i + 1].right = std::max(peaks[i + 1].right, fabsf(stream[i + 3]));
peaks[i + 2].left = std::max(peaks[i + 2].left, fabsf(stream[i + 4]));
peaks[i + 2].right = std::max(peaks[i + 2].right, fabsf(stream[i + 5]));
peaks[i + 3].left = std::max(peaks[i + 3].left, fabsf(stream[i + 6]));
peaks[i + 3].right = std::max(peaks[i + 3].right, fabsf(stream[i + 7]));
}
This would let the compiler vectorize the loop allowing for a (perhaps) a 4x speed-up, and require 12 loops. We would then need a final serialized for loop to find the peak amplitudes among the peaks[4]s which would "cost" us the efficiency gained from one of the 12 loops - so the speedup might be 4 * (11 / 12) ~= 3.6x.
But that's ugly, and we don't want code like that. Here's an example when things go right:
LOOP BEGIN at gus.cpp(767,3)
<Peeled loop for vectorization>
remark #25015: Estimate of max trip count of loop=7
LOOP END
LOOP BEGIN at gus.cpp(767,3)
remark #25264: Loop rerolled by 2
remark #15388: vectorization support: reference out[i] has aligned access [ gus.cpp(768,4) ]
remark #15388: vectorization support: reference in[i] has aligned access [ gus.cpp(768,34) ]
remark #15305: vectorization support: vector length 8
remark #15309: vectorization support: normalized vectorization overhead 2.333
remark #15300: LOOP WAS VECTORIZED
remark #15442: entire loop may be executed in remainder
remark #15448: unmasked aligned unit stride loads: 1
remark #15449: unmasked aligned unit stride stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 7
remark #15477: vector cost: 0.750
remark #15478: estimated potential speedup: 4.940
remark #15487: type converts: 1
remark #15488: --- end vector cost summary ---
remark #25015: Estimate of max trip count of loop=12
LOOP END
The code in question is about as straight forward as it gets:
// If our peaks are under the max, then there's no need to limit
if (peak.left < AUDIO_SAMPLE_MAX && peak.right < AUDIO_SAMPLE_MAX) {
for (int i = 0; i < BUFFER_SAMPLES - 1; i += 2) { // vectorized
out[i] = static_cast<int16_t>(in[i]);
out[i + 1] = static_cast<int16_t>(in[i + 1]);
}
return;
}
Other reasons why the compiler might not be able to vectorize:
- branches within the loop
- exceptions inside the loop
- the estimated cost is greater than the serial loop. Costs can include:
- unaligned accesses (which are less efficient and need more pre-load time)
- a small loop count, making the SIMD preparation cost carry a greater portion of the overall cost
- not being able to figure out the loop count to find a safe unroll size; such as complex or unrelated
whileconditions
General
How-to's
- Adding utilities
- Applications
- Config file examples
- Dual-mouse gaming
- Getting started
- Instant launch
- Joysticks and Gamepads
- Keymapper
- Multiplayer & serial ports
- Windows
Lists
- AUTOTYPE candidates
- CDDA / GUS / MIDI games
- DOS/32A compatibility
- Dual OPL2 and OPL3 games
- Games with enhanced Tandy & PCjr graphics and sound
- Shaders
- Special keys
Audio
- Audio mixer signal flow diagram
- Audio configuration recommendations
- GUS enhancements
- MIDI
- Sound cards
- True 16-bit audio games
Video
Issues
Dev
- How to contribute
- Release process
- Audio tests
- CPU tests
- DOS tests
- Input tests
- Performance tests
- Video tests — Video modes
- Video tests — CRT shaders
- Video tests — Presentation
- Learning DOS programming
- Intel compiler tips