Intel compiler setup and usage tips

Background

Intel's proprietary compiler (ICC) generates optimized code for IA-32 and Intel 64 architectures, and non-optimized code for non-Intel but compatible processors, such as certain AMD processors [Wikipedia Intel C++ Compiler].

Context to DOSBox Staging

Of notable value is that ICC generates helpful reports for why it could (or couldn't) optimize or vectorize a particular chunk of code or loop. These reports, combined with its Advisor tool, can provide insight on how to make small adjustments that allow GCC, Clang, and the Visual Studio (compilers we do care about) to better optimize the code.

Case study

ICC's reports lead to a handful of small changes in the GUS feature branch that resulted in all but one data-processing loop being vectorized (the remaining loop is inter-dependent and cannot be unrolled). The GUS feature branch was roughly three-fold more CPU-intensive, due to its extensive use floating-point math to retain dynamic range in the signal, versus the original bit-shifting and truncating code. After vectorizing, the GUS feature branch is now 12% faster than the bit-shifting code, simply leveraging the SIMD registers to perform between 4 and 8 floating point calculations per iteration.

Acquiring ICC for Linux

Intel allows their compiler to be used for non-commercial purposes. Fill out the application here: https://registrationcenter.intel.com/en/forms/?programid=OPNSRC&productid=2908 , and Intel will respond within a week with your license.

Offline installers

Linux: Parallel Studio XE Professional 2020 Update 4
Linux: vTune and Collector 2021
MacOS: vTune and Collector 2021
Windows: vTune and Collector 2021

Installing the compiler

After receiving your license number:

Download the full installer
Untar it and launch its installer using sudo bash install-gui.sh
Paste in your license when asked
Disable features to save space, if desired. For example, I typically disable IA32, threaded building blocks, MKL, deep learning, cluster, fortran support, GDB, and Python. I enable x86-64 C++ compiler, Advisor, and Tuning Profiler.
Allow it to install to its default location, /opt/intel`.

Using the compiler

Once installed in opt/intel, you can now build dosbox-staging using ICC:

Debug build: ./script/build.sh -c icc -t debug
Release build: ./script/build.sh -c icc -t release
Optimization info build: ./script/build.sh -c icc -t optinfo

In the last case, each .c and .cpp file will have an .optrpt sitting along side it.

Using the optimization reports

The report is divided by function, so it's easy to search by function name.

Here's an example of a failed vectoring attempt:

===========================================================================

Begin optimization report for: Gus::SoftLimit(Gus *, const float *, int16_t *)

    Report from: Loop nest & Vector optimizations [loop, vec]


LOOP BEGIN at gus.cpp(755,2) inlined into gus.cpp(763,2)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization
   remark #15346: vector dependence: assumed ANTI dependence between in[i] (756:3) and this->right (757:3)
   remark #15346: vector dependence: assumed FLOW dependence between this->right (757:3) and in[i] (756:3)
   remark #25438: unrolled without remainder by 2
LOOP END

The report mentions it couldn't vectorize the code due to "dependence" ANTI and FLOW issues; and this gives us a hint of what's happening. The for loop in question is pretty darn straight forward though; so what's going on?

void Gus::UpdatePeakAmplitudes(const float *stream)
{
	for (int i = 0; i < BUFFER_SAMPLES - 1; i += 2) {
		peak.left = std::max(peak.left, fabsf(stream[i]));
		peak.right = std::max(peak.right, fabsf(stream[i + 1]));
	}
}

The problem is the result from one pass becomes the input to the next pass - which prevents the compiler from unrolling the loop and performing many passes all at once, simultaneously, buy filling the SIMD registers with the work from say 4 passes through the loop. Unfortunately, each of those passes requires content from the prior.. so they can't be done at once.

That said - we could help the compiler vectorize this, but the code would get quite ugly.

	Frame peaks[4];
	for (int i = 0; i < BUFFER_SAMPLES - 1; i += 4) {
		peaks[i + 0].left = std::max(peaks[i + 0].left, fabsf(stream[i]));
		peaks[i + 0].right = std::max(peaks[i + 0].right, fabsf(stream[i + 1]));
		peaks[i + 1].left = std::max(peaks[i + 1].left, fabsf(stream[i + 2]));
		peaks[i + 1].right = std::max(peaks[i + 1].right, fabsf(stream[i + 3]));
		peaks[i + 2].left = std::max(peaks[i + 2].left, fabsf(stream[i + 4]));
		peaks[i + 2].right = std::max(peaks[i + 2].right, fabsf(stream[i + 5]));
		peaks[i + 3].left = std::max(peaks[i + 3].left, fabsf(stream[i + 6]));
		peaks[i + 3].right = std::max(peaks[i + 3].right, fabsf(stream[i + 7]));
	}

This would let the compiler vectorize the loop allowing for a (perhaps) a 4x speed-up, and require 12 loops. We would then need a final serialized for loop to find the peak amplitudes among the peaks[4]s which would "cost" us the efficiency gained from one of the 12 loops - so the speedup might be 4 * (11 / 12) ~= 3.6x.

But that's ugly, and we don't want code like that. Here's an example when things go right:

LOOP BEGIN at gus.cpp(767,3)
<Peeled loop for vectorization>
   remark #25015: Estimate of max trip count of loop=7
LOOP END

LOOP BEGIN at gus.cpp(767,3)
   remark #25264: Loop rerolled by 2
   remark #15388: vectorization support: reference out[i] has aligned access   [ gus.cpp(768,4) ]
   remark #15388: vectorization support: reference in[i] has aligned access   [ gus.cpp(768,34) ]
   remark #15305: vectorization support: vector length 8
   remark #15309: vectorization support: normalized vectorization overhead 2.333
   remark #15300: LOOP WAS VECTORIZED
   remark #15442: entire loop may be executed in remainder
   remark #15448: unmasked aligned unit stride loads: 1
   remark #15449: unmasked aligned unit stride stores: 1
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 7
   remark #15477: vector cost: 0.750
   remark #15478: estimated potential speedup: 4.940
   remark #15487: type converts: 1
   remark #15488: --- end vector cost summary ---
   remark #25015: Estimate of max trip count of loop=12
LOOP END

The code in question is about as straight forward as it gets:

	// If our peaks are under the max, then there's no need to limit
	if (peak.left < AUDIO_SAMPLE_MAX && peak.right < AUDIO_SAMPLE_MAX) {
		for (int i = 0; i < BUFFER_SAMPLES - 1; i += 2) { // vectorized
			out[i] = static_cast<int16_t>(in[i]);
			out[i + 1] = static_cast<int16_t>(in[i + 1]);
		}
		return;
	}

Other reasons why the compiler might not be able to vectorize:

branches within the loop
exceptions inside the loop
the estimated cost is greater than the serial loop. Costs can include:
- unaligned accesses (which are less efficient and need more pre-load time)
- a small loop count, making the SIMD preparation cost carry a greater portion of the overall cost
not being able to figure out the loop count to find a safe unroll size; such as complex or unrelated while conditions

General

How-to's

Lists

Audio

Video

Issues

Dev

Misc

Wiki Home · Website · Discord · Open Collective · Mastodon
Latest stable binary releases for Windows, macOS, or Linux
Unstable development snapshots also available for testing