Investigating a RHEL7 vs. RHEL10 speed difference

TL’DNR

In this article I will look into the details of a special case with fully emulated systems, where an older Linux distro performs better than a new one. I will use RHEL7 and RHEL10 here, but the difference can also be seen with other distros. This is limited to the not very common case of the system running as a fully emulated environment, for example with QEMU. Most environments run virtualized guests instead.

Terms

I will use the terms like this:

  • Virtualization in our context here means that when a guest operating system does computations, these get directly offloaded to the hypervisors cpus. One of the benefits is performance, one of the downsides is that the guests cpus need to exactly match the hypervisors cpus, or a subset.
  • Emulation means that the computations are done in software. With this, host/guest architectures can be different. As a downside, this is slower than virtualization.

The initial observation

Why would one run emulation instead of virtualization? I use an aarch64 based laptop for work. To replicate issues of x86 systems I use full system emulation. This research helped me to understand speed differences between the various x86 cpu/chipsets which can be emulated, and I made an observation:

Let’s consider 2 emulated guests, both emulating high end x86-64 chipsets. After installing RHEL7 in one and RHEL10 in the other, we run a cpu bound workload in the guests, and see that the RHEL7 guest performs better.

dd if=/dev/urandom of=randfile bs=1M count=512
bzip2 randfile
time bzcat randfile.bz2 >/dev/null

The last command, so uncompression of the 512MB file, took 342sec on rhel10, and 223sec on rhel7.9. These values are averaged over 5 runs.

The command for emulating a guest:

$ qemu-system-x86_64 -cpu Broadwell-v1 -smp cpus=4 -m 4G \
    -drive file=image-v3.qcow2,if=none,id=hd0 \
    -device virtio-blk-pci,drive=hd0 \
    -netdev bridge,br=virbr0,id=n1 -device virtio-net,netdev=n1 \
    -object rng-random,filename=/dev/urandom,id=viornd0 \
    -device virtio-rng-pci,rng=viornd0 \
    -boot order=c

Why is that?

We see here a side affect of the distros getting optimized to newer CPUs. Let’s consider RHEL7: released in 2014, it’s built for the x86-64 architecture. RHEL10 was released in 2025, and has glibc, one of it’s core libraries, built in a way to require CPUs of the x86-64-v3 architecture. In these 11 years, various new CPU generations have been released. Among other changes the new CPUs also got new features, so more efficient ways of doing certain computations. These are not used “automatically”, but code needs to explicitly utilize the features.

When emulating a full system with QEMU, one gets to choose which cpu features should be emulated - the recommendation is to configure as many features as possible. Pur 2 VMs here are both configured with “-cpu Broadwell-v1”: that’s a CPU following the x86-64-v3 spec. RHEL7 is not utilizing the additional features over plain “x86-64”, but as RHEL10’s glibc (and thus whole RHEL10) are optimized for -v3 we really need that.

The decision makes totally sense: systems which implement -v3 and higher are available since many years, so we should utilize these features. Florian Weimer has written on this.

Now back to our emulated systems: both support -v3, but just on the emulated RHEL10 it’s getting used. RHEL7 is getting along with fewer CPU functionality, and for this whole emulated stack that turns out more performance efficient than RHEL10 where we offload things to the CPU which gets also just emulated on the host system. Qemu has a nice overview of which chipsets has which feature set.

Of course, a nieche and rare situation.. but we have here no choice but emulating the whole system. :)

Fascinating to see.. and took me a ticket on qemu-devel, and explanations from colleagues to really understand this.

What do debug tools see?

So now with the explanation at hand, our overall expectation is that RHEL7 does more workload in it’s userland, and for RHEL10 we are doing more in QEMU emulating these extra registers.

For the curious, here is a text file with

  • the details of the single uncompress runs
  • output from “strace -c bzcat [..]” in the guests, and from “strace -c -p pidofqemu” watching from the host while data is extracted

Not sure if one can see from QEMU which cpu features are triggered, but with perf inside the guest one would be able to see that.

Containers to the rescue

As final confirmation of the above theory, we can run the workload in a RHEL7 container inside the RHEL10 guest: this will utilize the containers glibc, so do computations in the way which is more effective in our situation.
Works indeed:

[rhel10]$ podman login registry.redhat.io
[rhel10]$ podman pull registry.redhat.io/ubi7/ubi:7.9-1445
Trying to pull registry.redhat.io/ubi7/ubi:7.9-1445...
Getting image source signatures
[..]
[rhel10] podman run -it  registry.redhat.io/ubi7/ubi:7.9-1445 bash
[root@dae7ac8c8859 /]#
[root@dae7ac8c8859 ~]# for i in {1..5}; do time bzcat randfile.bz2 >/dev/null; done
real	3m50.920s
user	3m24.340s
sys	0m12.341s
[..]

Average over 5 runs: 249sec, quite near to the 223sec from plain RHEL7!

Closing

To repeat: this is an exotic situation to be in, but it’s very satisfying to have understood the situation. Interestingly, AlmaLinux10 for x86 is available as a -v3 build (like RHEL10), and also as -v2 for compatability also with older CPUs. I’ll compare the download stats in one or 2 years, curious if there is much demand for -v2.

Corrections? Questions? -> Fediverse thread


Last modified on 2025-06-27