123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213 |
- The PPC KVM paravirtual interface
- =================================
- The basic execution principle by which KVM on PowerPC works is to run all kernel
- space code in PR=1 which is user space. This way we trap all privileged
- instructions and can emulate them accordingly.
- Unfortunately that is also the downfall. There are quite some privileged
- instructions that needlessly return us to the hypervisor even though they
- could be handled differently.
- This is what the PPC PV interface helps with. It takes privileged instructions
- and transforms them into unprivileged ones with some help from the hypervisor.
- This cuts down virtualization costs by about 50% on some of my benchmarks.
- The code for that interface can be found in arch/powerpc/kernel/kvm*
- Querying for existence
- ======================
- To find out if we're running on KVM or not, we leverage the device tree. When
- Linux is running on KVM, a node /hypervisor exists. That node contains a
- compatible property with the value "linux,kvm".
- Once you determined you're running under a PV capable KVM, you can now use
- hypercalls as described below.
- KVM hypercalls
- ==============
- Inside the device tree's /hypervisor node there's a property called
- 'hypercall-instructions'. This property contains at most 4 opcodes that make
- up the hypercall. To call a hypercall, just call these instructions.
- The parameters are as follows:
- Register IN OUT
- r0 - volatile
- r3 1st parameter Return code
- r4 2nd parameter 1st output value
- r5 3rd parameter 2nd output value
- r6 4th parameter 3rd output value
- r7 5th parameter 4th output value
- r8 6th parameter 5th output value
- r9 7th parameter 6th output value
- r10 8th parameter 7th output value
- r11 hypercall number 8th output value
- r12 - volatile
- Hypercall definitions are shared in generic code, so the same hypercall numbers
- apply for x86 and powerpc alike with the exception that each KVM hypercall
- also needs to be ORed with the KVM vendor code which is (42 << 16).
- Return codes can be as follows:
- Code Meaning
- 0 Success
- 12 Hypercall not implemented
- <0 Error
- The magic page
- ==============
- To enable communication between the hypervisor and guest there is a new shared
- page that contains parts of supervisor visible register state. The guest can
- map this shared page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE.
- With this hypercall issued the guest always gets the magic page mapped at the
- desired location. The first parameter indicates the effective address when the
- MMU is enabled. The second parameter indicates the address in real mode, if
- applicable to the target. For now, we always map the page to -4096. This way we
- can access it using absolute load and store functions. The following
- instruction reads the first field of the magic page:
- ld rX, -4096(0)
- The interface is designed to be extensible should there be need later to add
- additional registers to the magic page. If you add fields to the magic page,
- also define a new hypercall feature to indicate that the host can give you more
- registers. Only if the host supports the additional features, make use of them.
- The magic page layout is described by struct kvm_vcpu_arch_shared
- in arch/powerpc/include/asm/kvm_para.h.
- Magic page features
- ===================
- When mapping the magic page using the KVM hypercall KVM_HC_PPC_MAP_MAGIC_PAGE,
- a second return value is passed to the guest. This second return value contains
- a bitmap of available features inside the magic page.
- The following enhancements to the magic page are currently available:
- KVM_MAGIC_FEAT_SR Maps SR registers r/w in the magic page
- KVM_MAGIC_FEAT_MAS0_TO_SPRG7 Maps MASn, ESR, PIR and high SPRGs
- For enhanced features in the magic page, please check for the existence of the
- feature before using them!
- Magic page flags
- ================
- In addition to features that indicate whether a host is capable of a particular
- feature we also have a channel for a guest to tell the guest whether it's capable
- of something. This is what we call "flags".
- Flags are passed to the host in the low 12 bits of the Effective Address.
- The following flags are currently available for a guest to expose:
- MAGIC_PAGE_FLAG_NOT_MAPPED_NX Guest handles NX bits correctly wrt magic page
- MSR bits
- ========
- The MSR contains bits that require hypervisor intervention and bits that do
- not require direct hypervisor intervention because they only get interpreted
- when entering the guest or don't have any impact on the hypervisor's behavior.
- The following bits are safe to be set inside the guest:
- MSR_EE
- MSR_RI
- If any other bit changes in the MSR, please still use mtmsr(d).
- Patched instructions
- ====================
- The "ld" and "std" instructions are transformed to "lwz" and "stw" instructions
- respectively on 32 bit systems with an added offset of 4 to accommodate for big
- endianness.
- The following is a list of mapping the Linux kernel performs when running as
- guest. Implementing any of those mappings is optional, as the instruction traps
- also act on the shared page. So calling privileged instructions still works as
- before.
- From To
- ==== ==
- mfmsr rX ld rX, magic_page->msr
- mfsprg rX, 0 ld rX, magic_page->sprg0
- mfsprg rX, 1 ld rX, magic_page->sprg1
- mfsprg rX, 2 ld rX, magic_page->sprg2
- mfsprg rX, 3 ld rX, magic_page->sprg3
- mfsrr0 rX ld rX, magic_page->srr0
- mfsrr1 rX ld rX, magic_page->srr1
- mfdar rX ld rX, magic_page->dar
- mfdsisr rX lwz rX, magic_page->dsisr
- mtmsr rX std rX, magic_page->msr
- mtsprg 0, rX std rX, magic_page->sprg0
- mtsprg 1, rX std rX, magic_page->sprg1
- mtsprg 2, rX std rX, magic_page->sprg2
- mtsprg 3, rX std rX, magic_page->sprg3
- mtsrr0 rX std rX, magic_page->srr0
- mtsrr1 rX std rX, magic_page->srr1
- mtdar rX std rX, magic_page->dar
- mtdsisr rX stw rX, magic_page->dsisr
- tlbsync nop
- mtmsrd rX, 0 b <special mtmsr section>
- mtmsr rX b <special mtmsr section>
- mtmsrd rX, 1 b <special mtmsrd section>
- [Book3S only]
- mtsrin rX, rY b <special mtsrin section>
- [BookE only]
- wrteei [0|1] b <special wrteei section>
- Some instructions require more logic to determine what's going on than a load
- or store instruction can deliver. To enable patching of those, we keep some
- RAM around where we can live translate instructions to. What happens is the
- following:
- 1) copy emulation code to memory
- 2) patch that code to fit the emulated instruction
- 3) patch that code to return to the original pc + 4
- 4) patch the original instruction to branch to the new code
- That way we can inject an arbitrary amount of code as replacement for a single
- instruction. This allows us to check for pending interrupts when setting EE=1
- for example.
- Hypercall ABIs in KVM on PowerPC
- =================================
- 1) KVM hypercalls (ePAPR)
- These are ePAPR compliant hypercall implementation (mentioned above). Even
- generic hypercalls are implemented here, like the ePAPR idle hcall. These are
- available on all targets.
- 2) PAPR hypercalls
- PAPR hypercalls are needed to run server PowerPC PAPR guests (-M pseries in QEMU).
- These are the same hypercalls that pHyp, the POWER hypervisor implements. Some of
- them are handled in the kernel, some are handled in user space. This is only
- available on book3s_64.
- 3) OSI hypercalls
- Mac-on-Linux is another user of KVM on PowerPC, which has its own hypercall (long
- before KVM). This is supported to maintain compatibility. All these hypercalls get
- forwarded to user space. This is only useful on book3s_32, but can be used with
- book3s_64 as well.
|