Language Selection

English French German Italian Portuguese Spanish

Kernel Planet

Syndicate content
Kernel Planet - http://planet.kernel.org
Updated: 1 week 4 days ago

Matthew Garrett: Making hibernation work under Linux Lockdown

Sunday 21st of February 2021 08:37:25 AM
Linux draws a distinction between code running in kernel (kernel space) and applications running in userland (user space). This is enforced at the hardware level - in x86-speak[1], kernel space code runs in ring 0 and user space code runs in ring 3[2]. If you're running in ring 3 and you attempt to touch memory that's only accessible in ring 0, the hardware will raise a fault. No matter how privileged your ring 3 code, you don't get to touch ring 0.

Kind of. In theory. Traditionally this wasn't well enforced. At the most basic level, since root can load kernel modules, you could just build a kernel module that performed any kernel modifications you wanted and then have root load it. Technically user space code wasn't modifying kernel space code, but the difference was pretty semantic rather than useful. But it got worse - root could also map memory ranges belonging to PCI devices[3], and if the device could perform DMA you could just ask the device to overwrite bits of the kernel[4]. Or root could modify special CPU registers ("Model Specific Registers", or MSRs) that alter CPU behaviour via the /dev/msr interface, and compromise the kernel boundary that way.

It turns out that there were a number of ways root was effectively equivalent to ring 0, and the boundary was more about reliability (ie, a process running as root that ends up misbehaving should still only be able to crash itself rather than taking down the kernel with it) than security. After all, if you were root you could just replace the on-disk kernel with a backdoored one and reboot. Going deeper, you could replace the bootloader with one that automatically injected backdoors into a legitimate kernel image. We didn't have any way to prevent this sort of thing, so attempting to harden the root/kernel boundary wasn't especially interesting.

In 2012 Microsoft started requiring vendors ship systems with UEFI Secure Boot, a firmware feature that allowed[5] systems to refuse to boot anything without an appropriate signature. This not only enabled the creation of a system that drew a strong boundary between root and kernel, it arguably required one - what's the point of restricting what the firmware will stick in ring 0 if root can just throw more code in there afterwards? What ended up as the Lockdown Linux Security Module provides the tooling for this, blocking userspace interfaces that can be used to modify the kernel and enforcing that any modules have a trusted signature.

But that comes at something of a cost. Most of the features that Lockdown blocks are fairly niche, so the direct impact of having it enabled is small. Except that it also blocks hibernation[6], and it turns out some people were using that. The obvious question is "what does hibernation have to do with keeping root out of kernel space", and the answer is a little convoluted and is tied into how Linux implements hibernation. Basically, Linux saves system state into the swap partition and modifies the header to indicate that there's a hibernation image there instead of swap. On the next boot, the kernel sees the header indicating that it's a hibernation image, copies the contents of the swap partition back into RAM, and then jumps back into the old kernel code. What ensures that the hibernation image was actually written out by the kernel? Absolutely nothing, which means a motivated attacker with root access could turn off swap, write a hibernation image to the swap partition themselves, and then reboot. The kernel would happily resume into the attacker's image, giving the attacker control over what gets copied back into kernel space.

This is annoying, because normally when we think about attacks on swap we mitigate it by requiring an encrypted swap partition. But in this case, our attacker is root, and so already has access to the plaintext version of the swap partition. Disk encryption doesn't save us here. We need some way to verify that the hibernation image was written out by the kernel, not by root. And thankfully we have some tools for that.

Trusted Platform Modules (TPMs) are cryptographic coprocessors[7] capable of doing things like generating encryption keys and then encrypting things with them. You can ask a TPM to encrypt something with a key that's tied to that specific TPM - the OS has no access to the decryption key, and nor does any other TPM. So we can have the kernel generate an encryption key, encrypt part of the hibernation image with it, and then have the TPM encrypt it. We store the encrypted copy of the key in the hibernation image as well. On resume, the kernel reads the encrypted copy of the key, passes it to the TPM, gets the decrypted copy back and is able to verify the hibernation image.

That's great! Except root can do exactly the same thing. This tells us the hibernation image was generated on this machine, but doesn't tell us that it was done by the kernel. We need some way to be able to differentiate between keys that were generated in kernel and ones that were generated in userland. TPMs have the concept of "localities" (effectively privilege levels) that would be perfect for this. Userland is only able to access locality 0, so the kernel could simply use locality 1 to encrypt the key. Unfortunately, despite trying pretty hard, I've been unable to get localities to work. The motherboard chipset on my test machines simply doesn't forward any accesses to the TPM unless they're for locality 0. I needed another approach.

TPMs have a set of Platform Configuration Registers (PCRs), intended for keeping a record of system state. The OS isn't able to modify the PCRs directly. Instead, the OS provides a cryptographic hash of some material to the TPM. The TPM takes the existing PCR value, appends the new hash to that, and then stores the hash of the combination in the PCR - a process called "extension". This means that the new value of the TPM depends not only on the value of the new data, it depends on the previous value of the PCR - and, in turn, that previous value depended on its previous value, and so on. The only way to get to a specific PCR value is to either (a) break the hash algorithm, or (b) perform exactly the same sequence of writes. On system reset the PCRs go back to a known value, and the entire process starts again.

Some PCRs are different. PCR 23, for example, can be reset back to its original value without resetting the system. We can make use of that. The first thing we need to do is to prevent userland from being able to reset or extend PCR 23 itself. All TPM accesses go through the kernel, so this is a simple matter of parsing the write before it's sent to the TPM and returning an error if it's a sensitive command that would touch PCR 23. We now know that any change in PCR 23's state will be restricted to the kernel.

When we encrypt material with the TPM, we can ask it to record the PCR state. This is given back to us as metadata accompanying the encrypted secret. Along with the metadata is an additional signature created by the TPM, which can be used to prove that the metadata is both legitimate and associated with this specific encrypted data. In our case, that means we know what the value of PCR 23 was when we encrypted the key. That means that if we simply extend PCR 23 with a known value in-kernel before encrypting our key, we can look at the value of PCR 23 in the metadata. If it matches, the key was encrypted by the kernel - userland can create its own key, but it has no way to extend PCR 23 to the appropriate value first. We now know that the key was generated by the kernel.

But what if the attacker is able to gain access to the encrypted key? Let's say a kernel bug is hit that prevents hibernation from resuming, and you boot back up without wiping the hibernation image. Root can then read the key from the partition, ask the TPM to decrypt it, and then use that to create a new hibernation image. We probably want to prevent that as well. Fortunately, when you ask the TPM to encrypt something, you can ask that the TPM only decrypt it if the PCRs have specific values. "Sealing" material to the TPM in this way allows you to block decryption if the system isn't in the desired state. So, we define a policy that says that PCR 23 must have the same value at resume as it did on hibernation. On resume, the kernel resets PCR 23, extends it to the same value it did during hibernation, and then attempts to decrypt the key. Afterwards, it resets PCR 23 back to the initial value. Even if an attacker gains access to the encrypted copy of the key, the TPM will refuse to decrypt it.

And that's what this patchset implements. There's one fairly significant flaw at the moment, which is simply that an attacker can just reboot into an older kernel that doesn't implement the PCR 23 blocking and set up state by hand. Fortunately, this can be avoided using another aspect of the boot process. When you boot something via UEFI Secure Boot, the signing key used to verify the booted code is measured into PCR 7 by the system firmware. In the Linux world, the Shim bootloader then measures any additional keys that are used. By either using a new key to tag kernels that have support for the PCR 23 restrictions, or by embedding some additional metadata in the kernel that indicates the presence of this feature and measuring that, we can have a PCR 7 value that verifies that the PCR 23 restrictions are present. We then seal the key to PCR 7 as well as PCR 23, and if an attacker boots into a kernel that doesn't have this feature the PCR 7 value will be different and the TPM will refuse to decrypt the secret.

While there's a whole bunch of complexity here, the process should be entirely transparent to the user. The current implementation requires a TPM 2, and I'm not certain whether TPM 1.2 provides all the features necessary to do this properly - if so, extending it shouldn't be hard, but also all systems shipped in the past few years should have a TPM 2, so that's going to depend on whether there's sufficient interest to justify the work. And we're also at the early days of review, so there's always the risk that I've missed something obvious and there are terrible holes in this. And, well, given that it took almost 8 years to get the Lockdown patchset into mainline, let's not assume that I'm good at landing security code.

[1] Other architectures use different terminology here, such as "supervisor" and "user" mode, but it's broadly equivalent
[2] In theory rings 1 and 2 would allow you to run drivers with privileges somewhere between full kernel access and userland applications, but in reality we just don't talk about them in polite company
[3] This is how graphics worked in Linux before kernel modesetting turned up. XFree86 would just map your GPU's registers into userland and poke them directly. This was not a huge win for stability
[4] IOMMUs can help you here, by restricting the memory PCI devices can DMA to or from. The kernel then gets to allocate ranges for device buffers and configure the IOMMU such that the device can't DMA to anything else. Except that region of memory may still contain sensitive material such as function pointers, and attacks like this can still cause you problems as a result.
[5] This describes why I'm using "allowed" rather than "required" here
[6] Saving the system state to disk and powering down the platform entirely - significantly slower than suspending the system while keeping state in RAM, but also resilient against the system losing power.
[7] With some handwaving around "coprocessor". TPMs can't be part of the OS or the system firmware, but they don't technically need to be an independent component. Intel have a TPM implementation that runs on the Management Engine, a separate processor built into the motherboard chipset. AMD have one that runs on the Platform Security Processor, a small ARM core built into their CPU. Various ARM implementations run a TPM in Trustzone, a special CPU mode that (in theory) is able to access resources that are entirely blocked off from anything running in the OS, kernel or otherwise.

comments

Rusty Russell: Protected: Bitcoin Consensus and Solidarity

Thursday 18th of February 2021 03:29:02 AM

This content is password protected. To view it please enter your password below:

Password:

Kees Cook: security things in Linux v5.8

Tuesday 9th of February 2021 12:47:47 AM

Previously: v5.7

Linux v5.8 was released in August, 2020. Here’s my summary of various security things that caught my attention:

arm64 Branch Target Identification
Dave Martin added support for ARMv8.5’s Branch Target Instructions (BTI), which are enabled in userspace at execve() time, and all the time in the kernel (which required manually marking up a lot of non-C code, like assembly and JIT code).

With this in place, Jump-Oriented Programming (JOP, where code gadgets are chained together with jumps and calls) is no longer available to the attacker. An attacker’s code must make direct function calls. This basically reduces the “usable” code available to an attacker from every word in the kernel text to only function entries (or jump targets). This is a “low granularity” forward-edge Control Flow Integrity (CFI) feature, which is important (since it greatly reduces the potential targets that can be used in an attack) and cheap (implemented in hardware). It’s a good first step to strong CFI, but (as we’ve seen with things like CFG) it isn’t usually strong enough to stop a motivated attacker. “High granularity” CFI (which uses a more specific branch-target characteristic, like function prototypes, to track expected call sites) is not yet a hardware supported feature, but the software version will be coming in the future by way of Clang’s CFI implementation.

arm64 Shadow Call Stack
Sami Tolvanen landed the kernel implementation of Clang’s Shadow Call Stack (SCS), which protects the kernel against Return-Oriented Programming (ROP) attacks (where code gadgets are chained together with returns). This backward-edge CFI protection is implemented by keeping a second dedicated stack pointer register (x18) and keeping a copy of the return addresses stored in a separate “shadow stack”. In this way, manipulating the regular stack’s return addresses will have no effect. (And since a copy of the return address continues to live in the regular stack, no changes are needed for back trace dumps, etc.)

It’s worth noting that unlike BTI (which is hardware based), this is a software defense that relies on the location of the Shadow Stack (i.e. the value of x18) staying secret, since the memory could be written to directly. Intel’s hardware ROP defense (CET) uses a hardware shadow stack that isn’t directly writable. ARM’s hardware defense against ROP is PAC (which is actually designed as an arbitrary CFI defense — it can be used for forward-edge too), but that depends on having ARMv8.3 hardware. The expectation is that SCS will be used until PAC is available.

Kernel Concurrency Sanitizer infrastructure added
Marco Elver landed support for the Kernel Concurrency Sanitizer, which is a new debugging infrastructure to find data races in the kernel, via CONFIG_KCSAN. This immediately found real bugs, with some fixes having already landed too. For more details, see the KCSAN documentation.

new capabilities
Alexey Budankov added CAP_PERFMON, which is designed to allow access to perf(). The idea is that this capability gives a process access to only read aspects of the running kernel and system. No longer will access be needed through the much more powerful abilities of CAP_SYS_ADMIN, which has many ways to change kernel internals. This allows for a split between controls over the confidentiality (read access via CAP_PERFMON) of the kernel vs control over integrity (write access via CAP_SYS_ADMIN).

Alexei Starovoitov added CAP_BPF, which is designed to separate BPF access from the all-powerful CAP_SYS_ADMIN. It is designed to be used in combination with CAP_PERFMON for tracing-like activities and CAP_NET_ADMIN for networking-related activities. For things that could change kernel integrity (i.e. write access), CAP_SYS_ADMIN is still required.

network random number generator improvements
Willy Tarreau made the network code’s random number generator less predictable. This will further frustrate any attacker’s attempts to recover the state of the RNG externally, which might lead to the ability to hijack network sessions (by correctly guessing packet states).

fix various kernel address exposures to non-CAP_SYSLOG
I fixed several situations where kernel addresses were still being exposed to unprivileged (i.e. non-CAP_SYSLOG) users, though usually only through odd corner cases. After refactoring how capabilities were being checked for files in /sys and /proc, the kernel modules sections, kprobes, and BPF exposures got fixed. (Though in doing so, I briefly made things much worse before getting it properly fixed. Yikes!)

RISCV W^X detection
Following up on his recent work to enable strict kernel memory protections on RISCV, Zong Li has now added support for CONFIG_DEBUG_WX as seen for other architectures. Any writable and executable memory regions in the kernel (which are lovely targets for attackers) will be loudly noted at boot so they can get corrected.

execve() refactoring continues
Eric W. Biederman continued working on execve() refactoring, including getting rid of the frequently problematic recursion used to locate binary handlers. I used the opportunity to dust off some old binfmt_script regression tests and get them into the kernel selftests.

multiple /proc instances
Alexey Gladkov modernized /proc internals and provided a way to have multiple /proc instances mounted in the same PID namespace. This allows for having multiple views of /proc, with different features enabled. (Including the newly added hidepid=4 and subset=pid mount options.)

set_fs() removal continues
Christoph Hellwig, with Eric W. Biederman, Arnd Bergmann, and others, have been diligently working to entirely remove the kernel’s set_fs() interface, which has long been a source of security flaws due to weird confusions about which address space the kernel thought it should be accessing. Beyond things like the lower-level per-architecture signal handling code, this has needed to touch various parts of the ELF loader, and networking code too.

READ_IMPLIES_EXEC is no more for native 64-bit
The READ_IMPLIES_EXEC flag was a work-around for dealing with the addition of non-executable (NX) memory when x86_64 was introduced. It was designed as a way to mark a memory region as “well, since we don’t know if this memory region was expected to be executable, we must assume that if we need to read it, we need to be allowed to execute it too”. It was designed mostly for stack memory (where trampoline code might live), but it would carry over into all mmap() allocations, which would mean sometimes exposing a large attack surface to an attacker looking to find executable memory. While normally this didn’t cause problems on modern systems that correctly marked their ELF sections as NX, there were still some awkward corner-cases. I fixed this by splitting READ_IMPLIES_EXEC from the ELF PT_GNU_STACK marking on x86 and arm/arm64, and declaring that a native 64-bit process would never gain READ_IMPLIES_EXEC on x86_64 and arm64, which matches the behavior of other native 64-bit architectures that correctly didn’t ever implement READ_IMPLIES_EXEC in the first place.

array index bounds checking continues
As part of the ongoing work to use modern flexible arrays in the kernel, Gustavo A. R. Silva added the flex_array_size() helper (as a cousin to struct_size()). The zero/one-member into flex array conversions continue with over a hundred commits as we slowly get closer to being able to build with -Warray-bounds.

scnprintf() replacement continues
Chen Zhou joined Takashi Iwai in continuing to replace potentially unsafe uses of sprintf() with scnprintf(). Fixing all of these will make sure the kernel avoids nasty buffer concatenation surprises.

That’s it for now! Let me know if there is anything else you think I should mention here. Next up: Linux v5.9.

© 2021, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

Greg Kroah-Hartman: 8 bits are enough for a version number...

Friday 5th of February 2021 02:29:20 PM

As was pointed out to us stable kernel maintainers last week, the overflow of the .y release number was going to happen soon, and our proposed solution for it (use 16 bits instead of 8), turns out to be breaking a userspace-visable api.

As we can’t really break this, I did a release of the 4.4.256 and 4.9.256 releases today that contain nothing but a new version number. See the links for the full technical details if curious.

Right now I’m asking that everyone who uses these older kernel releases to upgrade to this release, and do a full rebuild of their systems in order to see what might, or might not, break. If problems happen, please let us know on the stable@vger.kernel.org mailing list as soon as possible as I can only hold off on doing new stable releases for these branches for a single week only (i.e. February 12, 2021).

Greg Kroah-Hartman: Helping out with LTS kernel releases

Wednesday 3rd of February 2021 04:37:33 PM

A recent email thread about “Why isn’t the 5.10 stable kernel listed as supported for 6 years yet!” on the linux-kernel mailing list ended up generating a bunch of direct emails to me asking what could different companies and individuals due to help out. What exactly was I looking for here?

Instead of having to respond to private emails with the same information over and over, I figured it was better to just put it here so that everyone can see what exactly I am expecting with regards to support in order to be able to maintain a kernel for longer than 2 years:

What I need help with

All I request is that people test the -rc releases when I announce them, and let me know if they work or not for their systems/workloads/tests/whatever.

If you look at the -rc announcements today, you will see a number of different people/groups responding with this information. If they want, they can provide a Tested-by: ... line that I will add to the release commit, or not, that’s up to them.

Here and here and here and here and here are all great examples of how people let me know that all is ok with the -rc kernels so that I know it is “safe” to do the release.

I also have a few companies send me private emails that all is good, there’s no requirement to announce this in public if you don’t want to (but it is nice, as kernel development should be done in public.)

Some companies can’t do tests on -rc releases due to their build infrastructures not handling that very well, so they email me after the stable release is out, saying all is good. Worst case, we end up reverting a patch in a released kernel, but it’s better to quickly do that based on testing than to miss it entirely because no one is testing at all.

And that’s it!

But, if you want to do more, I always really appreciate when people email me, or stable@vger.kernel.org, git commit ids that are needed to be backported to specific stable kernel trees because they found them in their testing/development efforts. You know what problems you hit better than anyone, and once those issues are found and fixed, making sure they get backported is a good thing, so I always want to know that.

Again, if you look on the stable@vger.kernel.org list, you will see different companies and developers providing backports of things they want backported, or just a list of the git commit ids if the backports apply cleanly.

Does that sound reasonable? I want to make sure that the LTS kernels that you rely on actually work for you without regressions, so testing is key, as is finding any fixes that are needed for them.

It’s not much, but I can’t do it alone :)

So, 6 years or not for 5.10?

The above is what I need in order to be able to support a kernel for 6 years, constant testing by users of the kernels. If we don’t have that, then why even do these releases because that must mean that no one is using them? So email me and let me know.

As of this point in time (February 3, 2021), I do not have enough committments by companies to help out with this effort to be able to say I can do this for 6 years right now (note, no response yet from the company that originally asked this question…) Hopefully that changes soon, and if it does, the kernel.org release page will be updated with the new date.

James Bottomley: Deploying Encrypted Images for Confidential Computing

Thursday 31st of December 2020 10:40:15 PM

In the previous post I looked at how you build an encrypted image that can maintain its confidentiality inside AMD SEV or Intel TDX. In this post I’ll discuss how you actually bring up a confidential VM from an encrypted image while preserving secrecy. However, first a warning: This post represents the state of the art and includes patches that are certainly not deployed in distributions and may not even be upstream, so if you want to follow along at home you’ll need to patch things like qemu, grub and OVMF. I should also add that, although I’m trying to make everything generic to confidential environments, this post is based on AMD SEV, which is the only confidential encrypted1 environment currently shipping.

The Basics of a Confidential Computing VM

At its base, current confidential computing environments are about using encrypted memory to run the virtual machine and guarding the encryption key so that the owner of the host system (the cloud service provider) can’t get access to it. Both SEV and TDX have the encryption technology inside the main memory controller meaning the L1 cache isn’t encrypted (still vulnerable to cache side channels) and DMA to devices must also be done via unencryped memory. This latter also means that both the BIOS and the Operating System of the guest VM must be enlightened to understand which pages to encrypted and which must not. For this reason, all confidential VM systems use OVMF2 to boot because this contains the necessary enlightening. To a guest, the VM encryption looks identical to full memory encryption on a physical system, so as long as you have a kernel which supports Intel or AMD full memory encryption, it should boot.

Each confidential computing system has a security element which sits between the encrypted VM and the host. In SEV this is an aarch64 processor called the Platform Security Processor (PSP) and in TDX it is an SGX enclave running Intel proprietary code. The job of the PSP is to bootstrap the VM, including encrypting the initial OVMF and inserting the encrypted pages. The security element also includes a validation certificate, which incorporates a Diffie-Hellman (DH) key. Once the guest owner obtains and validates the DH key it can use it to construct a one time ECDH encrypted bundle that can be passed to the security element on bring up. This bundle includes an encryption key which can be used to encrypt secrets for the security element and a validation key which can be used to verify measurements from the security element.

The way QEMU boots a Q35 machine is to set up all the configuration (including a disk device attached to the VM Image) load up the OVMF into rom memory and start the system running. OVMF pulls in the QEMU configuration and constructs the necessary ACPI configuration tables before executing grub and the kernel from the attached storage device. In a confidential VM, the first task is to establish a Guest Owner (the person whose encrypted VM it is) which is usually different from the Host Owner (the person running or controlling the Physical System). Ownership is established by transferring an encrypted bundle to the Secure Element before the VM is constructed.

The next step is for the VMM (QEMU in this case) to ask the secure element to provision the OVMF Firmware. Since the initial OVMF is untrusted, the Guest Owner should ask the Secure Element for an attestation of the memory contents before the VM is started. Since all paths lead through the Host Owner, who is also untrusted, the attestation contains a random nonce to prevent replay and is HMAC’d with a Guest Supplied key from the Launch Bundle. Once the Guest Owner is happy with the VM state, it supplies the Wrapped Key to the secure element (along with the nonce to prevent replay) and the Secure Element unwraps the key and provisions it to the VM where the Guest OS can use it for disc encryption. Finally, the enlightened guest reads the encrypted disk to unencrypted memory using DMA but uses the disk encryptor to decrypt it to encrypted memory, so the contents of the Encrypted VM Image are never visible to the Host Owner.

The Gaps in the System

The most obvious gap is that EFI booting systems don’t go straight from the OVMF firmware to the OS, they have to go via an EFI bootloader (grub, usually) which must be an efi binary on an unencrypted vFAT partition. The second gap is that grub must be modified to pick the disk encryption key out of wherever the Secure Element has stashed it. The third is that the key is currently stashed in VM memory before OVMF starts, so OVMF must know not to use or corrupt the memory. A fourth problem is that the current recommended way of booting OVMF has a flash drive for persistent variable storage which is under the control of the host owner and which isn’t part of the initial measurement.

Plugging The Gaps: OVMF

To deal with the problems in reverse order: the variable issue can be solved simply by not having a persistent variable store, since any mutable configuration information could be used to subvert the boot and leak the secret. This is achieved by stripping all the mutable variable handling out of OVMF. Solving key stashing simply means getting OVMF to set aside a page for a secret area and having QEMU recognise where it is for the secret injection. It turns out AMD were already working on a QEMU configuration table at a known location by the Reset Vector in OVMF, so the secret area is added as one of these entries. Once this is done, QEMU can retrieve the injection location from the OVMF binary so it doesn’t have to be specified in the QEMU Machine Protocol (QMP) command. Finally OVMF can protect the secret and package it up as an EFI configuration table for later collection by the bootloader.

The final OVMF change (which is in the same patch set) is to pull grub inside a Firmware Volume and execute it directly. This certainly isn’t the only possible solution to the problem (adding secure boot or an encrypted filesystem were other possibilities) but it is the simplest solution that gives a verifiable component that can be invariant across arbitrary encrypted boots (so the same OVMF can be used to execute any encrypted VM securely). This latter is important because traditionally OVMF is supplied by the host owner rather than being part of the VM image supplied by the guest owner. The grub script that runs from the combined volume must still be trusted to either decrypt the root or reboot to avoid leaking the key. Although the host owner still supplies the combined OVMF, the measurement assures the guest owner of its correctness, which is why having a fairly invariant component is a good idea … so the guest owner doesn’t have potentially thousands of different measurements for approved firmware.

Plugging the Gaps: QEMU

The modifications to QEMU are fairly simple, it just needs to scan the OVMF file to determine the location for the injected secret and inject it correctly using a QMP command.. Since secret injection is already upstream, this is a simple find and make the location optional patch set.

Plugging the Gaps: Grub

Grub today only allows for the manual input of the cryptodisk password. However, in the cloud we can’t do it this way because there’s no guarantee of a secure tty channel to the VM. The solution, therefore, is to modify grub so that the cryptodisk can use secrets from a provider, in addition to the manual input. We then add a provider that can read the efi configuration tables and extract the secret table if it exists. The current incarnation of the proposed patch set is here and it allows cryptodisk to extract a secret from an efisecret provider. Note this isn’t quite the same as the form expected by the upstream OVMF patch in its grub.cfg because now the provider has to be named on the cryptodisk command line thus

cryptodisk -s efisecret

but in all other aspects, Grub/grub.cfg works. I also discovered several other deviations from the initial grub.cfg (like Fedora uses /boot/grub2 instead of /boot/grub like everyone else) so the current incarnation of grub.cfg is here. I’ll update it as it changes.

Putting it All Together

Once you have applied all the above patches and built your version of OVMF with grub inside, you’re ready to do a confidential computing encrypted boot. However, you still need to verify the measurement and inject the encrypted secret. As I said before, this isn’t easy because, due to replay defeat requirements, the secret bundle must be constructed on the fly for each VM boot. From this point on I’m going to be using only AMD SEV as the example because the Intel hardware doesn’t yet exist and AMD kindly gave IBM research a box to play with (Anyone with a new EPYC 7xx1 or 7xx2 based workstation can likely play along at home, but check here). The first thing you need to do is construct a launch bundle. AMD has a tool called sev-tool to do this for you and the first thing you need to do is obtain the platform Diffie Hellman certificate (pdh.cert). The tool will extract this for you

sevtool --pdh_cert_export

Or it can be given to you by the cloud service provider (in this latter case you’ll want to verify the provenance using sevtool –validate_cert_chain, which contacts the AMD site to verify all the details). Once you have a trusted pdh.cert, you can use this to generate your own guest owner DH cert (godh.cert) which should be used only one time to give a semblance of ECDHE. godh.cert is used with pdh.cert to derive an encryption key for the launch bundle. You can generate this with

sevtool --generate_launch_blob <policy>

The gory details of policy are in the SEV manual chapter 3, but most guests use 1 which means no debugging. This command will generate the godh.cert, the launch_blob.bin and a tmp_tk.bin file which you must save and keep secure because it contains the Transport Encryption and Integrity Keys (TEK and TIK) which will be used to encrypt the secret. Figuring out the qemu command line options needed to launch and pause a SEV guest is a bit of a palaver, so here is mine. You’ll likely need to change things, like the QMP port and the location of your OVMF build and the launch secret.

Finally you need to get the launch measure from QMP, verify it against the sha256sum of OVMF.fd and create the secret bundle with the correct GUID headers. Since this is really fiddly to do with sevtool, I wrote this python script3 to do it all (note it requires qmp.py from the qemu git repository). You execute it as

sevsecret.py --passwd <disk passwd> --tiktek-file <location of tmp_tk.bin> --ovmf-hash <hash> --socket <qmp socket>

And it will verify the launch measure and encrypt the secret for the VM if the measure is correct and start the VM. If you got everything correct the VM will simply boot up without asking for a password (if you inject the wrong secret, it will still ask). And there you have it: you’ve booted up a confidential VM from an encrypted image file. If you’re like me, you’ll also want to fire up gdb on the qemu process just to show that the entire memory of the VM is encrypted …

Conclusions and Caveats

The above script should allow you to boot an encrypted VM anywhere: locally or in the cloud, provided you can access the QMP port (most clouds use libvirt which introduces yet another additional layering pain). The biggest drawback, if you refer to the diagram, is the yellow box: you must trust the secret element, which in both Intel and AMD is proprietary4, in order to get confidential computing to work. Although there is hope that in future the secret element could be fully open source, it isn’t today.

The next annoyance is that launching a confidential VM is high touch requiring collaboration from both the guest owner and the host owner (due to the anti-replay nonce). For a single launch, this is a minor annoyance but for an autoscaling (launch VMs as needed) platform it becomes a major headache. The solution seems to be to have some Hardware Security Module (HSM), like the cloud uses today to store encryption keys securely, and have it understand how to measure and launch encrypted VMs on behalf of the guest owner.

The final conclusion to remember is that confidentiality is not security: your VM is as exploitable inside a confidential encrypted VM as it was outside. In many ways confidentiality and security are opposites, in that security in part requires reducing the trusted code and confidentiality requires pulling as much as possible inside. Confidential VMs do have an answer to the Cloud trust problem since the enterprise can now deploy VMs without fear of tampering by the cloud provider, but those VMs are as insecure in the cloud as they were in the Enterprise Data Centre. All of this argues that Confidential Computing, while an important milestone, is only one step on the journey to cloud security.

Patch Status

The OVMF patches are upstream (including modifications requested by Intel for TDX). The QEMU and grub patch sets are still on the lists.

Paul E. Mc Kenney: Parallel Programming: December 2020 Update

Wednesday 30th of December 2020 05:33:25 AM
This release of Is Parallel Programming Hard, And, If So, What Can You Do About It? features numerous improvments:

 


  1. LaTeX and build-system upgrades (including helpful error checking and reporting), formatting improvements (including much nicer display of hyperlinks and of Quick Quizzes, polishing of numerous figures and tables, plus easier builds for A4 paper), refreshing of numerous broken URLs, an improved “make help” command (see below), improved FAQ-BUILD material, and a prototype index, all courtesy of Akira Yokosawa.
  2. A lengthy Quick Quiz on the relationship of half-barriers, compilers, CPUs, and locking primitives, courtesy of Patrick Yingxi Pan.
  3. Updated performance results throughout the book, courtesy of a large x86 system kindly provided by Facebook.
  4. Compiler tricks, RCU semantics, and other material from the Linux-kernel memory model added to the memory-ordering and tools-of-the-trade chapters.
  5. Improved discussion of non-blocking-synchronization algorithms.
  6. Many new citations, cross-references, fixes, and touchups throughout the book.
A number of issues were spotted by Motohiro Kanda in the course of his translation of this book to Japanese, and Borislav Petkov, Igor Dzreyev, and Junchang Wang also provided much-appreciated fixes.

The output of the aforementioned make help is as follows:
Official targets (Latin Modern Typewriter for monospace font): Full, Abbr. perfbook.pdf, 2c: (default) 2-column layout perfbook-1c.pdf, 1c: 1-column layout Set env variable PERFBOOK_PAPER to change paper size: PERFBOOK_PAPER=A4: a4paper PERFBOOK_PAPER=HB: hard cover book other (default): letterpaper make help-full" will show the full list of available targets.
The following excerpt of the make help-full command's output might be of interest to those who find Quick Quizzes distracting:
Experimental targets: Full, Abbr. perfbook-qq.pdf, qq: framed Quick Quizzes perfbook-nq.pdf, nq: no inline Quick Quizzes (chapterwise Answers)
Thus, the make nq command creates a perfbook-nq.pdf with Quick Quizzes and their answers grouped at the end of each chapter, in the usual textbook style, while still providing PDF navigation from each Quick Quiz to the relevant portion of that chapter.

Finally, this release also happens to be the first release candidate for the long-awaited Second Edition, which should be available shortly.

James Bottomley: Building Encrypted Images for Confidential Computing

Wednesday 23rd of December 2020 06:10:08 PM

With both Intel and AMD announcing confidential computing features to run encrypted virtual machines, IBM research has been looking into a new format for encrypted VM images. The first question is why a new format, after all qcow2 only recently deprecated its old encrypted image format in favour of luks. The problem is that in confidential computing, the guest VM runs inside the secure envelope but the host hypervisor (including the QEMU process) is untrusted and thus runs outside the secure envelope and, unfortunately, even for the new luks format, the encryption of the image is handled by QEMU and so the encryption key would be outside the secure envelope. Thus, a new format is needed to keep the encryption key (and, indeed, the encryption mechanism) within the guest VM itself. Fortunately, encrypted boot of Linux systems has been around for a while, and this can be used as a practical template for constructing a fully confidential encrypted image format and maintaining that confidentiality within a hostile cloud environment. In this article, I’ll explore the state of the art in encrypted boot, constructing EFI encrypted boot images, and finally, in the follow on article, look at deploying an encrypted image into a confidential environment and maintaining key secrecy in the cloud.

Encrypted Boot State of the Art

Luks and the cryptsetup toolkit have been around for a while and recently (in 2018), the luks format was updated to version 2. However, actually booting a linux kernel from an encrypted partition has always been a bit of a systems problem, primarily because the bootloader (grub) must decrypt the partition to actually load the kernel. Fortunately, grub can do this, but unfortunately the current grub in most distributions (2.04) can only read the version 1 luks format. Secondly, the user must type the decryption passphrase into grub (so it can pull the kernel and initial ramdisk out of the encrypted partition to boot them), but grub currently has no mechanism to pass it on to the initial ramdisk for mounting root, meaning that either the user has to type their passphrase twice (annoying) or the initial ramdisk itself has to contain a file with the disk passphrase. This latter is the most commonly used approach and only has minor security implications when the system is in motion (the ramdisk and the key file must be root read only) and the password is protected at rest by the fact that the initial ramdisk is also on the encrypted volume. Even more annoying is the fact that there is no distribution standard way of creating the initial ramdisk. Debian (and Ubuntu) have the most comprehensive documentation on how to do this, so the next section will look at the much less well documented systemd/dracut mechanism.

Encrypted Boot for Systemd/Dracut

Part of the problem here seems to be less that stellar systems co-ordination between the two components. Additionally, the way systemd supports passphraseless encrypted volumes has been evolving for a while but changed again in v246 to mirror the Debian method. Since cloud images are usually pretty up to date, I’ll describe this new way. Each encrypted volume is referred to by UUID (which will be the UUID of the containing partition returned by blkid). To get dracut to boot from an encrypted partition, you must pass in

rd.luks.uuid=<UUID>

but you must also have a key file named

/etc/cryptsetup-keys.d/luks-<UUID>.key

And, since dracut hasn’t yet caught up with this, you usually need a cryptodisk.conf file in /etc/dracut.conf.d/ which contains

install_items+=" /etc/cryptsetup-keys.d/* " Grub and EFI Booting Encrypted Images

Traditionally grub is actually installed into the disk master boot record, but for EFI boot that changed and the disk (or VM image) must have an EFI System partition which is where the grub.efi binary is installed. Part of the job of the grub.efi binary is to find the root partition and source the /boot/grub1/grub.cfg. When you install grub on an EFI partition a search for the root by UUID is actually embedded into the grub binary. Another problem is likely that your distribution customizes the location of grub and updates the boot variables to tell the system where it is. However, a cloud image can’t rely on the boot variables and must be installed in the default location (\EFI\BOOT\bootx64.efi). This default location can be achieved by adding the –removable flag to grub-install.

For encrypted boot, this becomes harder because the grub in the EFI partition must set up the cryptographic location by UUID. However, if you add

GRUB_ENABLE_CRYPTODISK=y

To /etc/default/grub it will do the necessary in grub-install and grub-mkconfig. Note that on Fedora, where every other GRUB_ENABLE parameter is true/false, this must be ‘y’, unfortunately grub-install will look for =y not =true.

Putting it all together: Encrypted VM Images

Start by extracting the root of an existing VM image to a tar file. Make sure it has all the tools you will need, like cryptodisk and grub-efi. Create a two partition raw image file and loopback mount it (I usually like 4GB) with a small efi partition (p1) and an encrypted root (p2):

truncate -s 4GB disk.img parted disk.img mklabel gpt parted disk.img mkpart primary 1Mib 100Mib parted disk.img mkpart primary 100Mib 100% parted disk.img set 1 esp on parted disk.img set 1 boot on

Now setup the efi and cryptosystem (I use ext4, but it’s not required). Note at this time luks will require a password. Use a simple one and change it later. Also note that most encrypted boot documents advise filling the encrypted partition with random numbers. I don’t do this because the additional security afforded is small compared with the advantage of converting the raw image to a smaller qcow2 one.

losetup -P -f disk.img # assuming here it uses loop0 l=($(losetup -l|grep disk.img)) # verify with losetup -l mkfs.vfat ${l}p1 blkid ${l}p1 # remember the EFI partition UUID cryptsetup --type luks1 luksFormat ${l}p2 # choose temp password blkid ${l}p2 # remember this as <UUID> you'll need it later cryptsetup luksOpen ${l}p2 cr_root mkfs.ext4 /dev/mapper/cr_root mount /dev/mapper/cr_root /mnt tar -C /mnt -xpf <vm root tar file> for m in run sys proc dev; do mount --bind /$m /mnt/$m; done chroot /mnt

Create or modify /etc/fstab to have root as /dev/disk/cr_root and the EFI partition by label under /boot/efi. Now set up grub for encrypted boot2

echo "GRUB_ENABLE_CRYPTODISK=y" >> /etc/default/grub mount /boot/efi grub-install --removable --target=x86_64-efi grub-mkconfig -o /boot/grub/grub.cfg

For Debian, you’ll need to add an /etc/crypttab entry for the encrypted disk:

cr_root UUID=<uuid> luks none

And then re-create the initial ramdisk. For dracut systems, you’ll have to modify /etc/default/grub so the GRUB_CMDLINE_LINUX has a rd.luks.uuid=<UUID> entry. If this is a selinux based distribution, you may also have to trigger a relabel.

Now would also be a good time to make sure you have a root password you know or to install /root/.ssh/authorized_keys. You should unmount all the binds and /mnt and try EFI booting the image. You’ll still have to type the password a couple of times, but once the image boots you’re operating inside the encrypted envelope. All that remains is to create a fast boot high entropy low iteration password and replace the existing one with it and set the initial ramdisk to use it. This example assumes your image is mounted as SCSI disk sda, but it may be a virtual disk or some other device.

dd if=/dev/urandom bs=1 count=33|base64 -w 0 > /etc/cryptsetup-keys.d/luks-<UUID>.key chmod 600 /etc/cryptsetup-keys.d/luks-<UUID>.key cryptsetup --key-slot 1 luksAddKey /dev/sda2 # permanent recovery key cryptsetup --key-slot 0 luksRemoveKey /dev/sda2 # remove temporary cryptsetup --key-slot 0 --iter-time 1 luksAddKey /dev/sda2 /etc/cryptsetup-keys.d/luks-<UUID>.key

Note the “-w 0” is necessary to prevent the password from having a trailing newline which will make it difficult to use. For mkinitramfs systems, you’ll now need to modify the /etc/crypttab entry

cr_root UUID=<UUID> /etc/cryptsetup-keys.d/luks-<UUID>.key luks

For dracut you need the key install hook in /etc/dracut.conf.d as described above and for Debian you need the keyfile pattern:

echo "KEYFILE_PATTERN=\"/etc/cryptsetup-keys.d/*\"" >>/etc/cryptsetup-initramfs/conf-hook

You now rebuild the initial ramdisk and you should now be able to boot the cryptosystem using either the high entropy password or your rescue one and it should only prompt in grub and shouldn’t prompt again. This image file is now ready to be used for confidential computing.

Michael Kerrisk (manpages): man-pages-5.10 is released

Tuesday 22nd of December 2020 09:56:44 AM
Starting with this release, Alejandro (Alex) Colomar has joined me as project comaintainer, and we've released man-pages-5.10. The release tarball is available on kernel.org. The browsable online pages can be found on man7.org. The Git repository for man-pages is available on kernel.org.

This release resulted from patches, bug reports, reviews, and comments from around 25 contributors. The release includes just over 150 commits that changed around 140 pages.

The most notable of the changes in man-pages-5.10 are the following:
  • I've added documentation of the faccessat2() system call (new in Linux 5.10) to the access(2) manual page.
  •  I've added a new subsection to the signal(7) manual page that provides a "big picture" of what happens when a signal handler is executed.

Pete Zaitcev: Google outage

Wednesday 16th of December 2020 06:20:27 AM

It's very funny to hear about people who were unable to turn on their lights because their houses were "smart". Not a good look for Google Nest! But I had a real problem:

Google outage crashed my Thunderbird so good that the only fix is to delete the ~/.thunderbird and re-add all accounts.

Yes, really.

Dave Airlie (blogspot): lavapipe: a *software* swrast vulkan layer FAQ

Friday 13th of November 2020 02:16:10 AM

(project was renamed from vallium to lavapipe)

I had some requirements for writing a vulkan software rasterizer within the Mesa project. I took some time to look at the options and realised that just writing a vulkan layer on top of gallium's llvmpipe would be a good answer for this problem. However in doing so I knew people would ask why this wouldn't work for a hardware driver.

tl;dr DO NOT USE LAVAPIPE OVER A GALLIUM HW DRIVER,

What is lavapipe?The lavapipe layer is a gallium frontend. It takes the Vulkan API and roughly translates it into the gallium API.How does it do that?Vulkan is a lowlevel API, it allows the user to allocate memory, create resources, record command buffers amongst other things. When a hw vulkan driver is recording a command buffer, it is putting hw specific commands into it that will be run directly on the GPU. These command buffers are submitted to queues when the app wants to execute them.

Gallium is a context level API, i.e. like OpenGL/D3D10. The user has to create resources and contexts and the driver internally manages command buffers etc. The driver controls internal flushing and queuing of command buffers.
 In order to bridge the gap, the lavapipe layer abstracts the gallium context into a separate thread of execution. When recording a vulkan command buffer it creates a CPU side command buffer containing an encoding of the Vulkan API. It passes that recorded CPU command buffer to the thread on queue submission. The thread then creates a gallium context, and replays the whole CPU recorded command buffer into the context, one command at a time.That sounds horrible, isn't it slow?Yes.Why doesn't that matter for *software* drivers?Software rasterizers are a very different proposition from an overhead point of view than real hardware. CPU rasterization is pretty heavy on the CPU load, so nearly always 90% of your CPU time will be in the rasterizer and fragment shader. Having some minor CPU overheads around command submission and queuing isn't going to matter in the overall profile of the user application. CPU rasterization is already slow, the Vulkan->gallium translation overhead isn't going to be the reason for making it much slower. For real HW drivers which are meant to record their own command buffers in the GPU domain and submit them direct to the hw, adding in a CPU layer that just copies the command buffer data is a massive overhead and one that can't easily be removed from the lavapipe layer.
The lavapipe execution context is also pretty horrible, it has to connect all the state pieces like shaders etc to the gallium context, and disconnect them all at the end of each command buffer. There is only one command submission queue, one context to be used. A lot of hardware exposes more queues etc that this will never model.
I still don't want to write a vulkan driver, give me more reasons.Pipeline barriers:
Pipeline barriers in Vulkan are essential to efficient driver hw usage. They are one of the most difficult to understand and hard to get right pieces of writing a vulkan driver. For a software rasterizer they are also mostly unneeded. When I get a barrier I just completely hardflush the gallium context because I know the sw driver behind it. For a real hardware driver this would be a horrible solution. You spend a lot of time trying to make anything optimal here.Memory allocation:
Vulkan is built around the idea of separate memory allocation and objects binding to those allocations. Gallium is built around object allocation with the memory allocs happening implicitly. I've added some simple memory allocation objects to the gallium API for swrast. These APIs are in no way useful for hw drivers. There is no way to expose memory types or heaps from gallium usefully. The current memory allocation API works for software drivers because I know all they want is an aligned_malloc. There is no decent way to bridge this gap without writing a new gallium API that looks like Vulkan. (in which case just write a vulkan driver already).
Can this make my non-Vulkan capable hw run Vulkan?

No. If the hardware can't do virtual memory properly, or expose features for vulkan this can't be fixed with a software layer that just introduces overhead.


Dave Airlie (blogspot): Linux graphics, why sharing code with Windows isn't always a win.

Thursday 12th of November 2020 12:05:45 AM

A recent article on phoronix has some commentary about sharing code between Windows and Linux, and how this seems to be a metric that Intel likes.

I'd like to explore this idea a bit and explain why I believe it's bad for Linux based distros and our open source development models in the graphics area.

tl;dr there is a big difference between open source released and open source developed projects in terms of sustainability and community.

The Linux graphics stack from a distro vendor point of view is made up of two main projects, the Linux kernel and Mesa userspace. These two projects are developed in the open with completely open source vendor agnostic practices. There is no vendor controlling either project and both projects have a goal of try to maximise shared code and shared processes/coding standards across drivers from all vendors.

This cross-vendor synergy is very important to the functioning ecosystem that is the Linux graphics stack. The stack also relies in some places on the LLVM project, but again LLVM upstream is vendor agnostic and open source developed.

The value to distros is they have central places to pick up driver stacks with good release cycles and a minimal number of places they have to deal with to interact with those communities. Now usually hardware vendors don't see the value in the external communities as much as Linux distros do. From a hardware vendor internal point of view they see more benefit in creating a single stack shared between their Windows and Linux to maximise their return on investment, or make their orgchart prettier or produce less powerpoints about why their orgchart isn't optimal.

A shared Windows/Linux stack as such is a thing the vendors want more for their own reasons than for the benefit of the Linux community.

Why is it a bad idea?

I'll start by saying it's not always a bad idea. In theory it might be possible to produce such a stack with the benefits of open source development model, however most vendors seem to fail at this. They see open source as a release model, they develop internally and shovel the results over the fence into a github repo every X weeks after a bunch of cycles. They build products containing these open source pieces, but they never expend the time building projects or communities around them.

As an example take AMDVLK vs radv. I started radv because AMD had been promising the world an open source Vulkan driver for Linux that was shared with their Windows stack. Even when it was delivered it was open source released but internally developed. There was no avenue for community participation in the driver development. External contributors were never on the same footing as an AMD employee. Even AMD employees on different teams weren't on the same footing. Compare this to the radv project in Mesa where it allowed Valve to contribute the ACO backend compiler and provide better results than AMD vendor shared code could ever have done, with far less investement and manpower.

Intel have a non-mesa compiler called Intel Graphics Compiler mentioned in the article. This is fully developed by intel internally, there is little info on project direction or how to get involved or where the community is. There doesn't seem to be much public review, patches seem to get merged to the public repo by igcbot which may mean they are being mirrored from some internal repo. There are not using github merge requests etc. Compare this to development of a Mesa NIR backend where lots of changes are reviewed and maximal common code sharing is attempted so that all vendors benefit from the code.

One area where it has mostly sort of worked out what with the AMD display code in the kernel. I believe this code to be shared with their Windows driver (but I'm not 100% sure). They do try to engage with community changes to the code, but the code is still pretty horrible and not really optimal on Linux. Integrating it with atomic modesetting and refactoring was a pain. So even in the best case it's not an optimal outcome even for the vendor. They have to work hard to make the shared code be capable of supporting different OS interactions.

How would I do it?

If I had to share Windows/Linux driver stack I'd (biased opinion) start from the most open project and bring that into the closed projects. I definitely wouldn't start with a new internal project that tries to disrupt both. For example if I needed to create a Windows GL driver, I could:

a) write a complete GL implementation and throw it over the wall every few weeks. and make Windows/Linux use it, Linux users lose out on the shared stack, distros lose out on one dependency instead having to build a stack of multiple per vendor deps, Windows gains nothing really, but I'm so in control of my own destiny (communities don't matter).

b) use Mesa and upstream my driver to share with the Linux stack, add the Windows code to the Mesa stack. I get to share the benefits of external development by other vendors and Windows gains that benefit, and Linux retains the benefits to it's ecosystem.

A warning then to anyone wishing for more vendor code sharing between OSes it generally doesn't end with Linux being better off, it ends up with Linux being more fragmented, harder to support and in the long run unsustainable.


Brendan Gregg: BPF binaries: BTF, CO-RE, and the future of BPF perf tools

Wednesday 4th of November 2020 08:00:00 AM
Two new technologies, BTF and CO-RE, are paving the way for BPF to become a billion dollar industry. Right now there are many BPF (eBPF) startups building networking, security, and performance products (and more in stealth), yet requiring customers to install LLVM, Clang, and kernel headers – which can consume over 100 Mbytes of storage – to use BPF is an adoption drag. BTF and CO-RE eliminate these dependencies at runtime, not only making BPF more practical for embedded Linux environments, but for adoption everywhere. These technologies are: - BTF: BPF Type Format, which provides struct information to avoid needing Clang and kernel headers. - CO-RE: BPF Compile-Once Run-Everywhere, which allows compiled BPF bytecode to be relocatable, avoiding the need for recompilation by LLVM. Clang and LLVM are still required for compilation, but the result is a lightweight ELF binary that includes the precompiled BPF bytecode and can be run everywhere. The BCC project has a collection of these, called libbpf tools. As an example, I ported over my opensnoop(8) tool: # ./opensnoop PID COMM FD ERR PATH 27974 opensnoop 28 0 /etc/localtime 1482 redis-server 7 0 /proc/1482/stat 1657 atlas-system-ag 3 0 /proc/stat […] This opensnoop(8) is an ELF binary that doesn't use libLLVM or libclang: # file opensnoop opensnoop: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/l, for GNU/Linux 3.2.0, BuildID[sha1]=b4b5320c39e5ad2313e8a371baf5e8241bb4e4ed, with debug_info, not stripped # ldd opensnoop linux-vdso.so.1 (0x00007ffddf3f1000) libelf.so.1 => /usr/lib/x86_64-linux-gnu/libelf.so.1 (0x00007f9fb7836000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9fb7619000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9fb7228000) /lib64/ld-linux-x86-64.so.2 (0x00007f9fb7c76000) # ls -lh opensnoop opensnoop.stripped -rwxr-xr-x 1 root root 645K Feb 28 23:18 opensnoop -rwxr-xr-x 1 root root 151K Feb 28 23:33 opensnoop.stripped ... and stripped is only 151 Kbytes. Now imagine a BPF product: instead of requiring customers install various heavyweight (and brittle) dependencies, a BPF agent may now be a single tiny binary that works on any kernel that has BTF. ## How this works It's not just a matter of saving the BPF bytecode in ELF and then sending it to any other kernel. Many BPF programs walk kernel structs that can change from one kernel version to another. Your BPF bytecode may still execute on different kernels, but it may be reading the wrong struct offsets and printing garbage output! opensnoop(8) doesn't walk kernel structs since it instruments stable tracepoints and their arguments, but many other tools do. This is an issue of *relocation*, and both BTF and CO-RE solve this for BPF binaries. BTF provides type information so that struct offsets and other details can be queried as needed, and CO-RE records which parts of a BPF program need to be rewritten, and how. CO-RE developer Andrii Nakryiko has written long posts explaining this in more depth: [BPF Portability and CO-RE] and [BTF Type Information]. ## CONFIG_DEBUG_INFO_BTF=y These new BPF binaries are only possible if this kernel config option is set. It adds about 1.5 Mbytes to the kernel image (this is tiny in comparison to DWARF debuginfo, which can be hundreds of Mbytes). Ubuntu 20.10 has already made this config option the default, and all other distros should follow. Note to distro maintainers: it requires pahole >= 1.16. ## The future of BPF performance tools, BCC Python, and bpftrace For BPF performance tools, you should start with running [BCC] and [bpftrace] tools, and then coding in bpftrace. The BCC tools should eventually be switched from Python to libbpf C under the hood, but will work the same. **Coding performance tools in BCC Python is now considered deprecated** as we move to libbpf C with BTF and CO-RE (although we still have library work to do, such as for USDT support, so the Python versions will be needed for a while). Note that there are other use cases of BCC that may continue to use the Python interface; both BPF co-maintainer Alexei Starovoitov and myself briefly discussed this on [iovisor-dev]. My [BPF Performance Tools] book focused on running BCC tools and coding in bpftrace, and that doesn't change. However, **Appendix C's Python programming examples are now considered deprecated.** Apologies for the inconvenience. Fortunately it's only 15 pages of appendix material out of the 880-page book. What about bpftrace? It does support BTF, and in the future we're looking at reducing its installation footprint as well (it can currently get to [29 Mbytes], and we think it can go a lot smaller). Given an average libbpf program size of 229 Kbytes (based on the current libbpf tools, stripped), and an average bpftrace program size of 1 Kbyte (my book tools), a large collection of bpftrace tools plus the bpftrace binary may become a smaller installation footprint than the equivalent in libbpf. Plus the bpftrace versions can be modified on the fly. libbpf is better suited for more complex and mature tools that needs custom arguments and libraries. As screenshots, the future of BPF performance tools is this: # ls /usr/share/bcc/tools /usr/sbin/*.bt argdist drsnoop mdflush pythongc tclobjnew bashreadline execsnoop memleak pythonstat tclstat [...] /usr/sbin/bashreadline.bt /usr/sbin/mdflush.bt /usr/sbin/tcpaccept.bt /usr/sbin/biolatency.bt /usr/sbin/naptime.bt /usr/sbin/tcpconnect.bt [...] ... and this: # bpftrace -e 'BEGIN { printf("Hello, World!\n"); }' Attaching 1 probe... Hello, World! ^C ... and **not** this: #!/usr/bin/python from bcc import BPF from bcc.utils import printb prog = """ int hello(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; } """ [...] Thanks to Yonghong Song (Facebook) for leading development of BTF, Andrii Nakryiko (Facebook) for leading development of CO-RE, and everyone else involved in making this happen. [BPF Portability and CO-RE]: https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html [BTF Type Information]: https://facebookmicrosites.github.io/bpf/blog/2018/11/14/btf-enhancement.html [BPF Performance Tools]: /bpf-performance-tools-book.html [29 Mbytes]: https://github.com/iovisor/bpftrace/issues/342 [iovisor-dev]: https://lists.iovisor.org/g/iovisor-dev/topic/future_of_bcc_python_tools/77827559?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,77827559 [BCC]: https://github.com/iovisor/bcc [bpftrace]: https://github.com/iovisor/bpftrace

More in Tux Machines

IBM/Red Hat/Fedora Leftovers

  • Siemens, IBM, Red Hat Launch Hybrid Cloud Initiative to Increase Real-time Value of Industrial IoT Data

    Siemens, IBM and Red Hat today announced a new collaboration that will use a hybrid cloud designed to deliver an open, flexible and more secure solution for manufacturers and plant operators to drive real-time value from operational data. In one month, a single manufacturing site can generate more than 2,200 terabytes of data according to a report by IBM – yet most data goes unanalyzed. Through the joint initiative, Siemens Digital Industries Software will apply IBM’s open hybrid cloud approach, built on Red Hat OpenShift, to extend the deployment flexibility of MindSphere®, the industrial IoT as a service solution from Siemens. This will enable customers to run MindSphere on-premise, unlocking speed and agility in factory and plant operations, as well as through the cloud for seamless product support, updates and enterprise connectivity.

  • IT leaders see open source as higher quality

    While enterprises believe open source software provides benefits including higher quality software and innovations, they also perceive barriers to adoption including levels of support and compatibility, according to a Red Hat report assessing enterprise open source usage. Curiously, security shows up as both a positive and negative in the report, with open source seen as offering better security but the security of the code seen as a barrier. Released on March 2, the 2021 State of Enterprise Open Source report covers data collected from interviews with 1,250 IT leaders worldwide, who were not necessarily Red Hat customers, Red Hat said.

  • Operators over easy: an introduction to Kubernetes Operators

    You've probably been hearing a lot about Kubernetes Operators, but if you don't work directly with Red Hat OpenShift or another Kubernetes distribution you may not know precisely what an Operator is. In this post, we'll explain what Operators are and why they're important. To better understand the "what" and the "how" about Kubernetes Operators, we need to understand the problem(s) that motivated the need for Kubernetes Operators. Kubernetes is notorious in its ability to integrate and facilitate declarative configuration and automation. This was out-of-the-box manageable for most stateless applications. However, for stateful applications this was a bit problematic. How do you manage and persist the state of your application and it’s dependencies? How do you keep the rest of your application going when you add or remove dependencies? Of course, much of this management was done manually and/or required additional personnel resources to help manage (i.e., DevOps) and, in general, required more of your attention. Much of these pains, boiled down to one ultimate question at hand: How do you effectively automate stateful applications on Kubernetes? That answer came in the form of what we call Kubernetes Operators.

  • Friday’s Fedora Facts: 2021-09

    Here’s your weekly Fedora report. Read what happened this week and what’s coming up. Your contributions are welcome (see the end of the post)! The Beta freeze is underway. The Fedora Linux 34 Beta Go/No-Go meeting is Thursday. I have weekly office hours on Wednesdays in the morning and afternoon (US/Eastern time) in #fedora-meeting-1. Drop by if you have any questions or comments about the schedule, Changes, elections, or anything else. See the upcoming meetings for more information.

  • Colin Walters: Why I work on OpenShift and Fedora/RHEL

    Every weekday for many years now I’ve woken up, dropped my kids off at school, then grabbed a coffee and sat down at my computer to work on OpenShift and Fedora+RHEL. Doing this for so long, over time I’ve thought about and refined the why I do this, and I want to write it down so that I can refer to this in various places. Some of this is a more condensed/rephrased variant of this blog post. I was inspired to be here originally (over 20 years ago) by the Free Software movement – one thing in particular I remember is seeing the Emacs start screen linking to the FSF website on our school’s Solaris workstations (In app advertisement worked!). Along with that, one thing I always found fascinating about software in general is the feeling of "the power of creation" – I can type something and make it happen. Since then, software has become much, much more foundational to our society (in some cases, probably too much re: social media, etc). In particular here the rise of software-as-a-service and the public clouds. And while we say "public" which has connotations of "openness" – these are all very proprietary clouds.

Games: Proton, Crusader Kings III, Rogue State Revolution, and plus-x

  • Proton Has Enabled 7000 Windows Games to Run on Linux

    We are reaching another milestone with ProtonDB: we are very close to 7000 Windows games confirmed to be working out of the box with Proton on Linux. Proton has been receiving many updates in the past few months as well, with the introduction of the Soldier Linux runtime container and Proton Experimental on top of the regular Proton releases. We are still getting about 100 new titles working flawlessly (according to user reports) on a monthly basis, which is a very healthy and steady growth. Another point is the percentage of Windows games working out of the box in Proton over time. The number has been close to 50% since for a long time and seems to be fairly stable.

  • Brace yourself, Winter is coming to Crusader Kings III | GamingOnLinux

    As if you didn't have enough problems with backstabbers, finding someone to marry and keeping your kingdom together - winter is coming to Crusader Kings III in the 1.3 update. This will be a free update for everyone that drops along side the first DLC. While there's no date, we should find out a little bit more when the Paradox Insider event happens on March 13. Fear not though, we'll keep an eye out for any interesting announcements and let you know after the event. When it comes to winter, snow will be heading to Crusader Kings III making the already difficult world much harsher overall. The map will gradually get covered in snow and Paradox said their system is pretty flexible so they can control where it flows. It's not just a cosmetic change though and does a few interesting things. For starters, there's going to be variants of it like mild and harsh winters, including visual effects to show the differences.

  • Political strategy game Rogue State Revolution gets a demo and a release date

    Rogue State Revolution from LRDGames, Inc. (Deep Sixed, Precipice) and publisher Modern Wolf is an upcoming challenging roguelike geopolitical thriller strategy game. In the game you take control of the presidency and rebuild, reform and prepare for new challenges as the Glorious People's Republic of Basenji becomes a new political, economic and cultural hotspot. The developer just announced it's going to release on March 18 with full Linux support.

  • plus-x is a simple tool to help developers on Windows set Linux permissions for games | GamingOnLinux

    Here's a small and very useful sounding application from game developer Cheeseness. It's called plus-x and the aim is to allow developers on Windows to set the correct permissions on Linux executables. The problem: when game developers put out a Linux build and then zip it up for download, Linux users download it and then often need to manually set permissions on the executable for it to be launched. plus-x gets around that by allowing developers to inspect the package and then set the correct permissions.



Qubes OS 4.0.4 has been released!

We’re pleased to announce the release of Qubes OS 4.0.4! This is the fourth stable release of Qubes 4.0. Read more Also: XSAs released on 2021-03-04

Linux Magazine Latest Issue (Paywall)