123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691 |
- ==============================
- UNEVICTABLE LRU INFRASTRUCTURE
- ==============================
- ========
- CONTENTS
- ========
- (*) The Unevictable LRU
- - The unevictable page list.
- - Memory control group interaction.
- - Marking address spaces unevictable.
- - Detecting Unevictable Pages.
- - vmscan's handling of unevictable pages.
- (*) mlock()'d pages.
- - History.
- - Basic management.
- - mlock()/mlockall() system call handling.
- - Filtering special vmas.
- - munlock()/munlockall() system call handling.
- - Migrating mlocked pages.
- - mmap(MAP_LOCKED) system call handling.
- - munmap()/exit()/exec() system call handling.
- - try_to_unmap().
- - try_to_munlock() reverse map scan.
- - Page reclaim in shrink_*_list().
- ============
- INTRODUCTION
- ============
- This document describes the Linux memory manager's "Unevictable LRU"
- infrastructure and the use of this to manage several types of "unevictable"
- pages.
- The document attempts to provide the overall rationale behind this mechanism
- and the rationale for some of the design decisions that drove the
- implementation. The latter design rationale is discussed in the context of an
- implementation description. Admittedly, one can obtain the implementation
- details - the "what does it do?" - by reading the code. One hopes that the
- descriptions below add value by provide the answer to "why does it do that?".
- ===================
- THE UNEVICTABLE LRU
- ===================
- The Unevictable LRU facility adds an additional LRU list to track unevictable
- pages and to hide these pages from vmscan. This mechanism is based on a patch
- by Larry Woodman of Red Hat to address several scalability problems with page
- reclaim in Linux. The problems have been observed at customer sites on large
- memory x86_64 systems.
- To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
- main memory will have over 32 million 4k pages in a single zone. When a large
- fraction of these pages are not evictable for any reason [see below], vmscan
- will spend a lot of time scanning the LRU lists looking for the small fraction
- of pages that are evictable. This can result in a situation where all CPUs are
- spending 100% of their time in vmscan for hours or days on end, with the system
- completely unresponsive.
- The unevictable list addresses the following classes of unevictable pages:
- (*) Those owned by ramfs.
- (*) Those mapped into SHM_LOCK'd shared memory regions.
- (*) Those mapped into VM_LOCKED [mlock()ed] VMAs.
- The infrastructure may also be able to handle other conditions that make pages
- unevictable, either by definition or by circumstance, in the future.
- THE UNEVICTABLE PAGE LIST
- -------------------------
- The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
- called the "unevictable" list and an associated page flag, PG_unevictable, to
- indicate that the page is being managed on the unevictable list.
- The PG_unevictable flag is analogous to, and mutually exclusive with, the
- PG_active flag in that it indicates on which LRU list a page resides when
- PG_lru is set.
- The Unevictable LRU infrastructure maintains unevictable pages on an additional
- LRU list for a few reasons:
- (1) We get to "treat unevictable pages just like we treat other pages in the
- system - which means we get to use the same code to manipulate them, the
- same code to isolate them (for migrate, etc.), the same code to keep track
- of the statistics, etc..." [Rik van Riel]
- (2) We want to be able to migrate unevictable pages between nodes for memory
- defragmentation, workload management and memory hotplug. The linux kernel
- can only migrate pages that it can successfully isolate from the LRU
- lists. If we were to maintain pages elsewhere than on an LRU-like list,
- where they can be found by isolate_lru_page(), we would prevent their
- migration, unless we reworked migration code to find the unevictable pages
- itself.
- The unevictable list does not differentiate between file-backed and anonymous,
- swap-backed pages. This differentiation is only important while the pages are,
- in fact, evictable.
- The unevictable list benefits from the "arrayification" of the per-zone LRU
- lists and statistics originally proposed and posted by Christoph Lameter.
- The unevictable list does not use the LRU pagevec mechanism. Rather,
- unevictable pages are placed directly on the page's zone's unevictable list
- under the zone lru_lock. This allows us to prevent the stranding of pages on
- the unevictable list when one task has the page isolated from the LRU and other
- tasks are changing the "evictability" state of the page.
- MEMORY CONTROL GROUP INTERACTION
- --------------------------------
- The unevictable LRU facility interacts with the memory control group [aka
- memory controller; see Documentation/cgroups/memory.txt] by extending the
- lru_list enum.
- The memory controller data structure automatically gets a per-zone unevictable
- list as a result of the "arrayification" of the per-zone LRU lists (one per
- lru_list enum element). The memory controller tracks the movement of pages to
- and from the unevictable list.
- When a memory control group comes under memory pressure, the controller will
- not attempt to reclaim pages on the unevictable list. This has a couple of
- effects:
- (1) Because the pages are "hidden" from reclaim on the unevictable list, the
- reclaim process can be more efficient, dealing only with pages that have a
- chance of being reclaimed.
- (2) On the other hand, if too many of the pages charged to the control group
- are unevictable, the evictable portion of the working set of the tasks in
- the control group may not fit into the available memory. This can cause
- the control group to thrash or to OOM-kill tasks.
- MARKING ADDRESS SPACES UNEVICTABLE
- ----------------------------------
- For facilities such as ramfs none of the pages attached to the address space
- may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE
- address space flag is provided, and this can be manipulated by a filesystem
- using a number of wrapper functions:
- (*) void mapping_set_unevictable(struct address_space *mapping);
- Mark the address space as being completely unevictable.
- (*) void mapping_clear_unevictable(struct address_space *mapping);
- Mark the address space as being evictable.
- (*) int mapping_unevictable(struct address_space *mapping);
- Query the address space, and return true if it is completely
- unevictable.
- These are currently used in two places in the kernel:
- (1) By ramfs to mark the address spaces of its inodes when they are created,
- and this mark remains for the life of the inode.
- (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called.
- Note that SHM_LOCK is not required to page in the locked pages if they're
- swapped out; the application must touch the pages manually if it wants to
- ensure they're in memory.
- DETECTING UNEVICTABLE PAGES
- ---------------------------
- The function page_evictable() in vmscan.c determines whether a page is
- evictable or not using the query function outlined above [see section "Marking
- address spaces unevictable"] to check the AS_UNEVICTABLE flag.
- For address spaces that are so marked after being populated (as SHM regions
- might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
- the page tables for the region as does, for example, mlock(), nor need it make
- any special effort to push any pages in the SHM_LOCK'd area to the unevictable
- list. Instead, vmscan will do this if and when it encounters the pages during
- a reclamation scan.
- On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan
- the pages in the region and "rescue" them from the unevictable list if no other
- condition is keeping them unevictable. If an unevictable region is destroyed,
- the pages are also "rescued" from the unevictable list in the process of
- freeing them.
- page_evictable() also checks for mlocked pages by testing an additional page
- flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked,
- and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is
- VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and
- update the appropriate statistics if the vma is VM_LOCKED. This method allows
- efficient "culling" of pages in the fault path that are being faulted in to
- VM_LOCKED VMAs.
- VMSCAN'S HANDLING OF UNEVICTABLE PAGES
- --------------------------------------
- If unevictable pages are culled in the fault path, or moved to the unevictable
- list at mlock() or mmap() time, vmscan will not encounter the pages until they
- have become evictable again (via munlock() for example) and have been "rescued"
- from the unevictable list. However, there may be situations where we decide,
- for the sake of expediency, to leave a unevictable page on one of the regular
- active/inactive LRU lists for vmscan to deal with. vmscan checks for such
- pages in all of the shrink_{active|inactive|page}_list() functions and will
- "cull" such pages that it encounters: that is, it diverts those pages to the
- unevictable list for the zone being scanned.
- There may be situations where a page is mapped into a VM_LOCKED VMA, but the
- page is not marked as PG_mlocked. Such pages will make it all the way to
- shrink_page_list() where they will be detected when vmscan walks the reverse
- map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK,
- shrink_page_list() will cull the page at that point.
- To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
- using putback_lru_page() - the inverse operation to isolate_lru_page() - after
- dropping the page lock. Because the condition which makes the page unevictable
- may change once the page is unlocked, putback_lru_page() will recheck the
- unevictable state of a page that it places on the unevictable list. If the
- page has become unevictable, putback_lru_page() removes it from the list and
- retries, including the page_unevictable() test. Because such a race is a rare
- event and movement of pages onto the unevictable list should be rare, these
- extra evictabilty checks should not occur in the majority of calls to
- putback_lru_page().
- =============
- MLOCKED PAGES
- =============
- The unevictable page list is also useful for mlock(), in addition to ramfs and
- SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
- NOMMU situations, all mappings are effectively mlocked.
- HISTORY
- -------
- The "Unevictable mlocked Pages" infrastructure is based on work originally
- posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
- Nick posted his patch as an alternative to a patch posted by Christoph Lameter
- to achieve the same objective: hiding mlocked pages from vmscan.
- In Nick's patch, he used one of the struct page LRU list link fields as a count
- of VM_LOCKED VMAs that map the page. This use of the link field for a count
- prevented the management of the pages on an LRU list, and thus mlocked pages
- were not migratable as isolate_lru_page() could not find them, and the LRU list
- link field was not available to the migration subsystem.
- Nick resolved this by putting mlocked pages back on the lru list before
- attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When
- Nick's patch was integrated with the Unevictable LRU work, the count was
- replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
- mapped the page. More on this below.
- BASIC MANAGEMENT
- ----------------
- mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
- pages. When such a page has been "noticed" by the memory management subsystem,
- the page is marked with the PG_mlocked flag. This can be manipulated using the
- PageMlocked() functions.
- A PG_mlocked page will be placed on the unevictable list when it is added to
- the LRU. Such pages can be "noticed" by memory management in several places:
- (1) in the mlock()/mlockall() system call handlers;
- (2) in the mmap() system call handler when mmapping a region with the
- MAP_LOCKED flag;
- (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE
- flag
- (4) in the fault path, if mlocked pages are "culled" in the fault path,
- and when a VM_LOCKED stack segment is expanded; or
- (5) as mentioned above, in vmscan:shrink_page_list() when attempting to
- reclaim a page in a VM_LOCKED VMA via try_to_unmap()
- all of which result in the VM_LOCKED flag being set for the VMA if it doesn't
- already have it set.
- mlocked pages become unlocked and rescued from the unevictable list when:
- (1) mapped in a range unlocked via the munlock()/munlockall() system calls;
- (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including
- unmapping at task exit;
- (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file;
- or
- (4) before a page is COW'd in a VM_LOCKED VMA.
- mlock()/mlockall() SYSTEM CALL HANDLING
- ---------------------------------------
- Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup()
- for each VMA in the range specified by the call. In the case of mlockall(),
- this is the entire active address space of the task. Note that mlock_fixup()
- is used for both mlocking and munlocking a range of memory. A call to mlock()
- an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is
- treated as a no-op, and mlock_fixup() simply returns.
- If the VMA passes some filtering as described in "Filtering Special Vmas"
- below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
- off a subset of the VMA if the range does not cover the entire VMA. Once the
- VMA has been merged or split or neither, mlock_fixup() will call
- __mlock_vma_pages_range() to fault in the pages via get_user_pages() and to
- mark the pages as mlocked via mlock_vma_page().
- Note that the VMA being mlocked might be mapped with PROT_NONE. In this case,
- get_user_pages() will be unable to fault in the pages. That's okay. If pages
- do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the
- fault path or in vmscan.
- Also note that a page returned by get_user_pages() could be truncated or
- migrated out from under us, while we're trying to mlock it. To detect this,
- __mlock_vma_pages_range() checks page_mapping() after acquiring the page lock.
- If the page is still associated with its mapping, we'll go ahead and call
- mlock_vma_page(). If the mapping is gone, we just unlock the page and move on.
- In the worst case, this will result in a page mapped in a VM_LOCKED VMA
- remaining on a normal LRU list without being PageMlocked(). Again, vmscan will
- detect and cull such pages.
- mlock_vma_page() will call TestSetPageMlocked() for each page returned by
- get_user_pages(). We use TestSetPageMlocked() because the page might already
- be mlocked by another task/VMA and we don't want to do extra work. We
- especially do not want to count an mlocked page more than once in the
- statistics. If the page was already mlocked, mlock_vma_page() need do nothing
- more.
- If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
- page from the LRU, as it is likely on the appropriate active or inactive list
- at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
- back the page - by calling putback_lru_page() - which will notice that the page
- is now mlocked and divert the page to the zone's unevictable list. If
- mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
- it later if and when it attempts to reclaim the page.
- FILTERING SPECIAL VMAS
- ----------------------
- mlock_fixup() filters several classes of "special" VMAs:
- 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind
- these mappings are inherently pinned, so we don't need to mark them as
- mlocked. In any case, most of the pages have no struct page in which to so
- mark the page. Because of this, get_user_pages() will fail for these VMAs,
- so there is no sense in attempting to visit them.
- 2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We
- neither need nor want to mlock() these pages. However, to preserve the
- prior behavior of mlock() - before the unevictable/mlock changes -
- mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
- allocate the huge pages and populate the ptes.
- 3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of
- kernel pages, such as the VDSO page, relay channel pages, etc. These pages
- are inherently unevictable and are not managed on the LRU lists.
- mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
- make_pages_present() to populate the ptes.
- Note that for all of these special VMAs, mlock_fixup() does not set the
- VM_LOCKED flag. Therefore, we won't have to deal with them later during
- munlock(), munmap() or task exit. Neither does mlock_fixup() account these
- VMAs against the task's "locked_vm".
- munlock()/munlockall() SYSTEM CALL HANDLING
- -------------------------------------------
- The munlock() and munlockall() system calls are handled by the same functions -
- do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs
- lock operation indicated by an argument. So, these system calls are also
- handled by mlock_fixup(). Again, if called for an already munlocked VMA,
- mlock_fixup() simply returns. Because of the VMA filtering discussed above,
- VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be
- ignored for munlock.
- If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
- specified range. The range is then munlocked via the function
- __mlock_vma_pages_range() - the same function used to mlock a VMA range -
- passing a flag to indicate that munlock() is being performed.
- Because the VMA access protections could have been changed to PROT_NONE after
- faulting in and mlocking pages, get_user_pages() was unreliable for visiting
- these pages for munlocking. Because we don't want to leave pages mlocked,
- get_user_pages() was enhanced to accept a flag to ignore the permissions when
- fetching the pages - all of which should be resident as a result of previous
- mlocking.
- For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling
- munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
- flag using TestClearPageMlocked(). As with mlock_vma_page(),
- munlock_vma_page() use the Test*PageMlocked() function to handle the case where
- the page might have already been unlocked by another task. If the page was
- mlocked, munlock_vma_page() updates that zone statistics for the number of
- mlocked pages. Note, however, that at this point we haven't checked whether
- the page is mapped by other VM_LOCKED VMAs.
- We can't call try_to_munlock(), the function that walks the reverse map to
- check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
- try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
- not be on an LRU list [more on these below]. However, the call to
- isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So,
- we go ahead and clear PG_mlocked up front, as this might be the only chance we
- have. If we can successfully isolate the page, we go ahead and
- try_to_munlock(), which will restore the PG_mlocked flag and update the zone
- page statistics if it finds another VMA holding the page mlocked. If we fail
- to isolate the page, we'll have left a potentially mlocked page on the LRU.
- This is fine, because we'll catch it later if and if vmscan tries to reclaim
- the page. This should be relatively rare.
- MIGRATING MLOCKED PAGES
- -----------------------
- A page that is being migrated has been isolated from the LRU lists and is held
- locked across unmapping of the page, updating the page's address space entry
- and copying the contents and state, until the page table entry has been
- replaced with an entry that refers to the new page. Linux supports migration
- of mlocked pages and other unevictable pages. This involves simply moving the
- PG_mlocked and PG_unevictable states from the old page to the new page.
- Note that page migration can race with mlocking or munlocking of the same page.
- This has been discussed from the mlock/munlock perspective in the respective
- sections above. Both processes (migration and m[un]locking) hold the page
- locked. This provides the first level of synchronization. Page migration
- zeros out the page_mapping of the old page before unlocking it, so m[un]lock
- can skip these pages by testing the page mapping under page lock.
- To complete page migration, we place the new and old pages back onto the LRU
- after dropping the page lock. The "unneeded" page - old page on success, new
- page on failure - will be freed when the reference count held by the migration
- process is released. To ensure that we don't strand pages on the unevictable
- list because of a race between munlock and migration, page migration uses the
- putback_lru_page() function to add migrated pages back to the LRU.
- mmap(MAP_LOCKED) SYSTEM CALL HANDLING
- -------------------------------------
- In addition the the mlock()/mlockall() system calls, an application can request
- that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
- call. Furthermore, any mmap() call or brk() call that expands the heap by a
- task that has previously called mlockall() with the MCL_FUTURE flag will result
- in the newly mapped memory being mlocked. Before the unevictable/mlock
- changes, the kernel simply called make_pages_present() to allocate pages and
- populate the page table.
- To mlock a range of memory under the unevictable/mlock infrastructure, the
- mmap() handler and task address space expansion functions call
- mlock_vma_pages_range() specifying the vma and the address range to mlock.
- mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in
- "Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have
- already been set by the caller, in filtered VMAs. Thus these VMA's need not be
- visited for munlock when the region is unmapped.
- For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to
- fault/allocate the pages and mlock them. Again, like mlock_fixup(),
- mlock_vma_pages_range() downgrades the mmap semaphore to read mode before
- attempting to fault/allocate and mlock the pages and "upgrades" the semaphore
- back to write mode before returning.
- The callers of mlock_vma_pages_range() will have already added the memory range
- to be mlocked to the task's "locked_vm". To account for filtered VMAs,
- mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the
- callers then subtract a non-negative return value from the task's locked_vm. A
- negative return value represent an error - for example, from get_user_pages()
- attempting to fault in a VMA with PROT_NONE access. In this case, we leave the
- memory range accounted as locked_vm, as the protections could be changed later
- and pages allocated into that region.
- munmap()/exit()/exec() SYSTEM CALL HANDLING
- -------------------------------------------
- When unmapping an mlocked region of memory, whether by an explicit call to
- munmap() or via an internal unmap from exit() or exec() processing, we must
- munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
- Before the unevictable/mlock changes, mlocking did not mark the pages in any
- way, so unmapping them required no processing.
- To munlock a range of memory under the unevictable/mlock infrastructure, the
- munmap() handler and task address space call tear down function
- munlock_vma_pages_all(). The name reflects the observation that one always
- specifies the entire VMA range when munlock()ing during unmap of a region.
- Because of the VMA filtering when mlocking() regions, only "normal" VMAs that
- actually contain mlocked pages will be passed to munlock_vma_pages_all().
- munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup()
- for the munlock case, calls __munlock_vma_pages_range() to walk the page table
- for the VMA's memory range and munlock_vma_page() each resident page mapped by
- the VMA. This effectively munlocks the page, only if this is the last
- VM_LOCKED VMA that maps the page.
- try_to_unmap()
- --------------
- Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may
- have VM_LOCKED flag set. It is possible for a page mapped into one or more
- VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one
- of the active or inactive LRU lists. This could happen if, for example, a task
- in the process of munlocking the page could not isolate the page from the LRU.
- As a result, vmscan/shrink_page_list() might encounter such a page as described
- in section "vmscan's handling of unevictable pages". To handle this situation,
- try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse
- map.
- try_to_unmap() is always called, by either vmscan for reclaim or for page
- migration, with the argument page locked and isolated from the LRU. Separate
- functions handle anonymous and mapped file pages, as these types of pages have
- different reverse map mechanisms.
- (*) try_to_unmap_anon()
- To unmap anonymous pages, each VMA in the list anchored in the anon_vma
- must be visited - at least until a VM_LOCKED VMA is encountered. If the
- page is being unmapped for migration, VM_LOCKED VMAs do not stop the
- process because mlocked pages are migratable. However, for reclaim, if
- the page is mapped into a VM_LOCKED VMA, the scan stops.
- try_to_unmap_anon() attempts to acquire in read mode the mmap semaphore of
- the mm_struct to which the VMA belongs. If this is successful, it will
- mlock the page via mlock_vma_page() - we wouldn't have gotten to
- try_to_unmap_anon() if the page were already mlocked - and will return
- SWAP_MLOCK, indicating that the page is unevictable.
- If the mmap semaphore cannot be acquired, we are not sure whether the page
- is really unevictable or not. In this case, try_to_unmap_anon() will
- return SWAP_AGAIN.
- (*) try_to_unmap_file() - linear mappings
- Unmapping of a mapped file page works the same as for anonymous mappings,
- except that the scan visits all VMAs that map the page's index/page offset
- in the page's mapping's reverse map priority search tree. It also visits
- each VMA in the page's mapping's non-linear list, if the list is
- non-empty.
- As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file
- page, try_to_unmap_file() will attempt to acquire the associated
- mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this
- is successful, and SWAP_AGAIN, if not.
- (*) try_to_unmap_file() - non-linear mappings
- If a page's mapping contains a non-empty non-linear mapping VMA list, then
- try_to_un{map|lock}() must also visit each VMA in that list to determine
- whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit
- all VMAs in the non-linear list to ensure that the pages is not/should not
- be mlocked.
- If a VM_LOCKED VMA is found in the list, the scan could terminate.
- However, there is no easy way to determine whether the page is actually
- mapped in a given VMA - either for unmapping or testing whether the
- VM_LOCKED VMA actually pins the page.
- try_to_unmap_file() handles non-linear mappings by scanning a certain
- number of pages - a "cluster" - in each non-linear VMA associated with the
- page's mapping, for each file mapped page that vmscan tries to unmap. If
- this happens to unmap the page we're trying to unmap, try_to_unmap() will
- notice this on return (page_mapcount(page) will be 0) and return
- SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to
- recirculate this page. We take advantage of the cluster scan in
- try_to_unmap_cluster() as follows:
- For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the
- mmap semaphore of the associated mm_struct for read without blocking.
- If this attempt is successful and the VMA is VM_LOCKED,
- try_to_unmap_cluster() will retain the mmap semaphore for the scan;
- otherwise it drops it here.
- Then, for each page in the cluster, if we're holding the mmap semaphore
- for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to
- mlock the page. This call is a no-op if the page is already locked,
- but will mlock any pages in the non-linear mapping that happen to be
- unlocked.
- If one of the pages so mlocked is the page passed in to try_to_unmap(),
- try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default
- SWAP_AGAIN. This will allow vmscan to cull the page, rather than
- recirculating it on the inactive list.
- Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it
- returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED
- VMA, but couldn't be mlocked.
- try_to_munlock() REVERSE MAP SCAN
- ---------------------------------
- [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
- page_referenced() reverse map walker.
- When munlock_vma_page() [see section "munlock()/munlockall() System Call
- Handling" above] tries to munlock a page, it needs to determine whether or not
- the page is mapped by any VM_LOCKED VMA without actually attempting to unmap
- all PTEs from the page. For this purpose, the unevictable/mlock infrastructure
- introduced a variant of try_to_unmap() called try_to_munlock().
- try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
- mapped file pages with an additional argument specifying unlock versus unmap
- processing. Again, these functions walk the respective reverse maps looking
- for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file
- pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
- attempt to acquire the associated mmap semaphore, mlock the page via
- mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
- pre-clearing of the page's PG_mlocked done by munlock_vma_page.
- If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap
- semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to
- recycle the page on the inactive list and hope that it has better luck with the
- page next time.
- For file pages mapped into non-linear VMAs, the try_to_munlock() logic works
- slightly differently. On encountering a VM_LOCKED non-linear VMA that might
- map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the
- page. munlock_vma_page() will just leave the page unlocked and let vmscan deal
- with it - the usual fallback position.
- Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
- reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
- However, the scan can terminate when it encounters a VM_LOCKED VMA and can
- successfully acquire the VMA's mmap semaphore for read and mlock the page.
- Although try_to_munlock() might be called a great many times when munlocking a
- large region or tearing down a large address space that has been mlocked via
- mlockall(), overall this is a fairly rare event.
- PAGE RECLAIM IN shrink_*_list()
- -------------------------------
- shrink_active_list() culls any obviously unevictable pages - i.e.
- !page_evictable(page, NULL) - diverting these to the unevictable list.
- However, shrink_active_list() only sees unevictable pages that made it onto the
- active/inactive lru lists. Note that these pages do not have PageUnevictable
- set - otherwise they would be on the unevictable list and shrink_active_list
- would never see them.
- Some examples of these unevictable pages on the LRU lists are:
- (1) ramfs pages that have been placed on the LRU lists when first allocated.
- (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to
- allocate or fault in the pages in the shared memory region. This happens
- when an application accesses the page the first time after SHM_LOCK'ing
- the segment.
- (3) mlocked pages that could not be isolated from the LRU and moved to the
- unevictable list in mlock_vma_page().
- (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
- acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
- munlock_vma_page() was forced to let the page back on to the normal LRU
- list for vmscan to handle.
- shrink_inactive_list() also diverts any unevictable pages that it finds on the
- inactive lists to the appropriate zone's unevictable list.
- shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
- after shrink_active_list() had moved them to the inactive list, or pages mapped
- into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
- recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter,
- but will pass on to shrink_page_list().
- shrink_page_list() again culls obviously unevictable pages that it could
- encounter for similar reason to shrink_inactive_list(). Pages mapped into
- VM_LOCKED VMAs but without PG_mlocked set will make it all the way to
- try_to_unmap(). shrink_page_list() will divert them to the unevictable list
- when try_to_unmap() returns SWAP_MLOCK, as discussed above.
|