flush_cache_foo(...); modify_address_space(); flush_tlb_foo(...);The logic here is:
void flush_cache_all(void); void flush_tlb_all(void);These routines are to notify the architecture specific code that a change has been made to the kernel address space mappings, which means that the mappings of every process has effectively changed.
An implementation shall:
void flush_cache_mm(struct mm_struct *mm); void flush_tlb_mm(struct mm_struct *mm);These routines notify the system that the entire address space described by the mm_struct passed is changing. Please take note of two things in particular:
flush_cache_range(struct mm_struct *mm, unsigned long start, unsigned long end); flush_tlb_range(struct mm_struct *mm, unsigned long start, unsigned long end);A change to a particular range of user addresses in the address space described by the mm_struct passed is occurring. The two notes above for flush_*_mm() concerning the mm_struct passed apply here as well.
An implementation shall:
void flush_cache_page(struct vm_area_struct *vma, unsigned long address); void flush_tlb_page(struct vm_area_struct *vma, unsigned long address);A change to a single page at address within user space to the address space described by the vm_area_struct passed is occurring. An implementation, if need be, can get at the assosciated mm_struct for this address space via vma->vm_mm. The VMA is passed for convenience so that an implementation can inspect vma->vm_flags. This way in an implementation where the instruction and data spaces are not unified, one can check to see if VM_EXEC is set in vma->vm_flags to possibly avoid flushing the instruction space, for example.
The two notes above for flush_*_mm() concerning the mm_struct (passed indirectly via vma->vm_mm) apply here as well.
An implementation shall:
void flush_page_to_ram(unsigned long page);This is the ugly duckling. But its semantics are necessary on so many architectures that I needed to add it to the flush architecture for Linux.
Briefly, when (as one example) the kernel services a COW fault, it uses the aliased mappings of all physical memory in kernel space to perform the copy of the page in question to a new page. This presents a problem for virtually indexed caches which are write-back in nature. In this case, the kernel touches two physical pages in kernel space. The code sequence being described here essentially looks like:
do_wp_page() { [ ... ] copy_cow_page(old_page,new_page); flush_page_to_ram(old_page); flush_page_to_ram(new_page); flush_cache_page(vma, address); modify_address_space(); free_page(old_page); flush_tlb_page(vma, address); [ ... ] }(Some of the actual code has been simplified for example purposes.)
Consider a virtually indexed cache which is write-back. At the point in time at which the copy of the page occurs to the kernel space aliases, it is possible for the user space view of the original page to be in the caches (at the user's address, ie. where the fault is occurring). The page copy can bring this data (for the old page) into the caches. It will also place the data (at the new kernel aliased mapping of the page) being copied to into the cache, and for write back caches this data will be dirty or modified in the cache.
In such a case main memory will not see the most recent copy of the data. The caches are stupid, so for the new page we are giving to the user, without forcing the cached data at the kernel alias to main memory the process will see the old contents of the page (ie. whatever garbage was there before the copy done by COW processing above).
A concrete example of what was just described:
Consider a process which shares a page, read-only with another task (or many) at virtual address 0x2000 in user space. And for example purposes let us say that this virtual address maps to physical page 0x14000.
Virtual Pages task 1 -------------- | 0x00000000 | -------------- | 0x00001000 | Physical Pages -------------- -------------- | 0x00002000 | --\ | 0x00000000 | -------------- \ -------------- \ | ... | task 2 -------------- \ -------------- | 0x00000000 | |----> | 0x00014000 | -------------- / -------------- | 0x00001000 | / | ... | -------------- / -------------- | 0x00002000 | --/ --------------If task 2 tries to write to the read-only page at address 0x2000 we will get a fault and eventually end up at the code fragment shown above in do_wp_page().
The kernel will get a new page for task2, let us say this is physical page 0x26000, and let us also say that the kernel alias mappings for physical pages 0x14000 and 0x26000 can reside in the two unique cache lines at the same time based upon the line indexing scheme of this cache.
The page contents get copied from the kernel mappings for physical page 0x14000 to the ones for physical page 0x26000.
At this point in time, on a write-back virtually indexed cache architecture we have a potential inconsistancy. The new data copied into physical page 0x26000 is not necessary in main memory at this point, in fact it could be all in the cache only at the kernel alias of the physical address. Also, the (non-modified, ie. clean) data for the original (old) page is in the cache at the kernel alias for physical page 0x14000, this can produce an inconsistancy later on, so to be safe it is best to be eliminate the cached copies of this data as well.
Let us say we did not write back the data for the page at 0x26000 and we let it just stay there. We would return to task 2 (who has this new page now mapped in at virtual address 0x2000), he would complete his write, then he would read some other piece of data in this new page (i.e. expecting the contents that existed there beforehand). At this point in time if the data is left in the cache at the kernel alias for the new physical page, the user will get whatever was in main memory before the copy for his read. This can lead to disasterous results.
Therefore an architecture shall:
NOTE: It is actually necessary for this routine to invalidate lines in a virtual cache which is not write-back in nature. To see why this is really necessary, replay the above example with task 1 and 2, but this time fork() yet another task 3 before the COW faults occur, consider the contents of the caches in both kernel and user space if the following sequence occurrs in exact succession:
void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t pte);Although not strictly part of the flush architecture, on certain architectures some critical operations and checks need to be performed here for things to work out properly and for the system to remain consistant.
In particular, for virtually indexed caches this routine must check to see that the new mapping being added by the current page fault does not add an "bad alias" to user space.
A "bad alias" is defined as two or more mappings (at least one of which is writable) to two or more virtual pages which all translate to the same exact physical page, and due to the indexing algorithm of the cache can also reside in unique and mutually exclusive cache lines.
If such a "bad alias" is detected an implementation needs to resolve this inconsistancy some how, one solution is to walk through all of the mappings and change the page tables to make these pages as "non-cacheable" if the hardware allows such a thing.
The checks for this are very simple, all an implementation needs to do essentially is:
if((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) check_for_potential_bad_aliases();So for the common case (shared writable mappings are extremely rare) only one comparison is needed for systems with virtually indexed caches.
The main concern is whether one of the above flush operations cause the entire system to be globally see the flush, or the flush is only guarenteed to be seen by the local processor.
In the latter case a cross calling mechanism is needed. The current two SMP systems supported under Linux (Intel and Sparc) use inter-processor interrupts to "broadcast" the flush operation and cause it to run locally on all processors if necessary.
As an example, on sun4m Sparc systems all processers in the system must execute the flush request to guarentee consistancy across the entire system. However, on sun4d Sparc machines, TLB flushes performed on the local processor are broadcast over the system bus by the hardware and therefore a cross call is not necessary.
To take full advantage of such a facility, and still maintain coherency as described above, requires some extra consideration from the implementor.
The issues involved will vary greatly from one implementation to another, at least this has been the experience of the author. But in particular some of the issues are likely to be:
Such issues are most cleanly dealt with at the device driver level. The author is convinced of this after his experiance with a common set of Sparc device drivers which needed to all function correctly on more than a handfull of cache/mmu and bus architecrures in the same kernel.
In fact this implementation is more efficient because the driver knows exactly when DMA needs to see consistant data or when DMA is going to create an inconsistancy which must be resolved. Any attempt to reach this level of efficiency via hooks added to the generic kernel memory management code would be complex and if anything very unclean.
As an example, consider on the Sparc how DMA buffers are handled. When a device driver must perform DMA to/from either a single buffer or a scatter list of many buffers it uses a set of abstract routines:
char *(*mmu_get_scsi_one)(char *, unsigned long, struct linux_sbus *sbus); void (*mmu_get_scsi_sgl)(struct mmu_sglist *, int, struct linux_sbus *sbus); void (*mmu_release_scsi_one)(char *, unsigned long, struct linux_sbus *sbus); void (*mmu_release_scsi_sgl)(struct mmu_sglist *, int, struct linux_sbus *sbus); void (*mmu_map_dma_area)(unsigned long addr, int len);Essentially the mmu_get_* routines are passed a pointer or a set pointers and size specifications to areas in kernel space for which DMA will occur, they return a DMA capable address (i.e. one which can be loaded into the DMA controller for the transfer). When the driver is done with the DMA and the transfer has completed the mmu_release_* routines must be called with the DMA'able address(es) so that the resources can be freed (if necessary) and cache flushes can be performed (if necessary).
The final routine is there for drivers which need to have a block of DMA memory for a long period of time, for example a networking driver would use this for a pool transmit and receive buffers.
The final argument is a Sparc specific entity which allows the machine level code to perform the mapping if DMA mappings are setup on a per-BUS basis.
There has been heated talk lately about adding page flipping facilities for very intelligent networking hardware. It may be necessary to extend the flush architecture to provide the interfaces and facilities necessary for these changes to the networking code.
And by all means, the flush architecture is always subject to improvements and changes to handle new issues or new hardware which presents a problem that was to this point unknown.
David S. Miller davem@caip.rutgers.edu