在拷贝二进制文件的时候,如果文件是一个可执行文件,并且有一个进程在运行这个可执行文件,那么拷贝的时候会出现"文本忙"(ETXTBSY)的错误提示,并且拷贝失败。这还算是好的情况,如果拷贝的是一个so文件,并且此时这个so正在被某个进程使用,那么此时拷贝可以成功,但是可能会导致这个进程触发crash。
之前总结过一次这种现象,隔了一段时间之后竟然有些淡忘了。然后在网上看到LWN的这篇文章,言简意赅,所以再整理一下。
linus的主要观点是这个ETXTBUSY只是内核一个“有教养”(courtesy)的特性,或者说从道德上做的、避免某些人作出超级愚蠢的事情(we'll help you avoid shooting yourself in the foot when we notice)。但是对于共享库文件的写入避免并没强制,可能觉得这不是内核的义务,另一方面不会有这么不可思议(incredibly stupid)。
The kernel ETXTBUSY thing is purely a courtesy feature, and as people have noticed it only really works for the main executable because of various reasons. It's not something user space should even rely on, it's more of a "ok, you're doing something incredibly stupid, and we'll help you avoid shooting yourself in the foot when we notice".
文章也提到了共享库之前是通过执行mmap的时候添加MAP_DENYWRITE标志位来避免被修改,但是通过mmap的man手册可以看到,这种行为会造成拒绝服务攻击(denial-of-service attacks),所以已经忽略该标志位。这也对应了文章中所说的该功能已经从内核中移除,所以内核会愉快的替换一个进程正在使用的so文件( When MAP_DENYWRITE went away, so did that protection; current Linux systems will happily allow a suitably privileged user to overwrite in-use, shared libraries)。
MAP_DENYWRITE
This flag is ignored. (Long ago—Linux 2.0 and earlier—it
signaled that attempts to write to the underlying file
should fail with ETXTBSY. But this was a source of
denial-of-service attacks.)
如果修改的so是自己构建的(通常如此),那么这个修改通常只是部分进程crash,但是如果修改了一个系统so文件,例如libc.so这个,那岂不是系统中大部分的进程都可能会触发异常。
文章中说明了在每个inode的i_writecount 保存了写入次数,这个值做了特殊逻辑处理:如果这个值为负值,说明它正在被一个进程作为主文件执行;如果正值则表示该文件正在以可写的方式打开的次数(同一个文件可以被多次以可写方式打开,可以在同一个进程,也可以在不同进程中)。由于任意一个执行和任意一个写入打开都是互斥的,所以正负两个范围表示不同的意义是可行的。
下面代码可以看到,当获取写入权限的时候,如果数值为负值(有进程在执行),则表示返回ETXTBSY;同样,在执行运行文件进程的时候,如果该文件正在以可写方式打开,此时也会执行失败。补充一点:一个文件可以同时被多次以可写方式打开;当然也可以同时运行多个进程实例。
///@file: linux-3.12.6\include\linux\fs.h
/*
* get_write_access() gets write permission for a file.
* put_write_access() releases this write permission.
* This is used for regular files.
* We cannot support write (and maybe mmap read-write shared) accesses and
* MAP_DENYWRITE mmappings simultaneously. The i_writecount field of an inode
* can have the following values:
* 0: no writers, no VM_DENYWRITE mappings
* < 0: (-i_writecount) vm_area_structs with VM_DENYWRITE set exist
* > 0: (i_writecount) users are writing to the file.
*
* Normally we operate on that counter with atomic_{inc,dec} and it's safe
* except for the cases where we don't hold i_writecount yet. Then we need to
* use {get,deny}_write_access() - these functions check the sign and refuse
* to do the change if sign is wrong.
*/
static inline int get_write_access(struct inode *inode)
{
return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
}
static inline int deny_write_access(struct file *file)
{
struct inode *inode = file_inode(file);
return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
}
static inline void put_write_access(struct inode * inode)
{
atomic_dec(&inode->i_writecount);
}
static inline void allow_write_access(struct file *file)
{
if (file)
atomic_inc(&file_inode(file)->i_writecount);
}
当把一个文件通过mmap映射到进程地址空间之后,再修改文件的内容,此时进程是否可以看到修改之后的内容呢?关于这一点,在mmap的man手册中同样有说明,主要是通过mmap时的MAP_SHARED和MAP_PRIVATE标志位决定。
也就是说,如果通过MAP_SHARED进行的映射,那么此次mmap的修改对所有通过MAP_SHARED的修改都可见,并且如果修改了文件系统中的内容,这个修改也同样对所有mmap可见;如果是通过MAP_PRIVATE进行的mmap,那么这个可见是未知的(unspecified)。
MAP_SHARED
Share this mapping. Updates to the mapping are visible to
other processes mapping the same region, and (in the case
of file-backed mappings) are carried through to the
underlying file. (To precisely control when updates are
carried through to the underlying file requires the use of
msync(2).)
MAP_PRIVATE
Create a private copy-on-write mapping. Updates to the
mapping are not visible to other processes mapping the
same file, and are not carried through to the underlying
file. It is unspecified whether changes made to the file
after the mmap() call are visible in the mapped region.
这种shared的实现其实最为简单:系统中的每个文件在内核中只有一个inode,文件某个部分的内容在inode中有唯一的一个页面。在shared的模式下,所有的修改都发生在这个页面中,由于页面对所有mmap可见,所以从现象上看都是互相立即可见。
相反,private的实现类似于页面的COW:在首日访问的时候加载问价内容,当首次写入的时候,分配一个私有的页面,由于加载和写入的时机并不确定,所以可见性也不确定。
在通过cp命令拷贝文件的时候,该工具会判断目标文件是否存在,如果存在在可写方式打开文件的时候会加上O_TRUNC标志位,从而在open系统调用中清空文件所有内容。
现在关键的问题是:通过文件系统操作(open/write)修改了文件的内容,那些通过mmap映射的内存是否/何时/如何感受到文件系统的修改?
从truncate的代码可以看到,这个修改是立即可见的。关键的数据结构在于每个inode::address_space中的i_mmap红黑树和i_mmap_nonlinear链表,也就是当执行mmap的时候,不仅每个进程通过vma知道了自己映射了哪些文件,而每个文件(inode)也需要有一个结构来记录有哪些vma映射了文件中的内容。只有知道了这个内容,当文件内容发生变化的时候,才可以通知并操作映射了文件内容的vma。
在truncate的代码中,如果一个文件的某一部分被清零之后,所有映射到该内容的mmap都会被解除映射,下次访问的时候将会触发一次按需加载。这个机制其实和每个页面的管理结构相同,每个page也需要有一个rmap来记录这个页面被哪些vma映射,只是页面记录的vma主要是在于页面被swap到磁盘时解除映射。
///@file: linux-3.12.6\fs\namei.c
/*
* Handle the last step of open()
*/
static int do_last(struct nameidata *nd, struct path *path,
struct file *file, const struct open_flags *op,
int *opened, struct filename *name)
{
///....
opened:
error = open_check_o_direct(file);
if (error)
goto exit_fput;
error = ima_file_check(file, op->acc_mode);
if (error)
goto exit_fput;
if (will_truncate) {
error = handle_truncate(file);
if (error)
goto exit_fput;
}
``
```c
int ext3_setattr(struct dentry *dentry, struct iattr *attr)
{
///...
if ((attr->ia_valid & ATTR_SIZE) &&
attr->ia_size != i_size_read(inode)) {
truncate_setsize(inode, attr->ia_size);
ext3_truncate(inode);
}
///...
}
/**
* truncate_setsize - update inode and pagecache for a new file size
* @inode: inode
* @newsize: new file size
*
* truncate_setsize updates i_size and performs pagecache truncation (if
* necessary) to @newsize. It will be typically be called from the filesystem's
* setattr function when ATTR_SIZE is passed in.
*
* Must be called with inode_mutex held and before all filesystem specific
* block truncation has been performed.
*/
void truncate_setsize(struct inode *inode, loff_t newsize)
{
i_size_write(inode, newsize);
truncate_pagecache(inode, newsize);
}
通过address_space中记录的所有mmap了该文件的vma,并从vma中解除映射关系。
/**
* unmap_mapping_range - unmap the portion of all mmaps in the specified address_space corresponding to the specified page range in the underlying file.
* @mapping: the address space containing mmaps to be unmapped.
* @holebegin: byte in first page to unmap, relative to the start of
* the underlying file. This will be rounded down to a PAGE_SIZE
* boundary. Note that this is different from truncate_pagecache(), which
* must keep the partial page. In contrast, we must get rid of
* partial pages.
* @holelen: size of prospective hole in bytes. This will be rounded
* up to a PAGE_SIZE boundary. A holelen of zero truncates to the
* end of the file.
* @even_cows: 1 when truncating a file, unmap even private COWed pages;
* but 0 when invalidating pagecache, don't throw away private data.
*/
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows)
{
struct zap_details details;
pgoff_t hba = holebegin >> PAGE_SHIFT;
pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
/* Check for overflow. */
if (sizeof(holelen) > sizeof(hlen)) {
long long holeend =
(holebegin + holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (holeend & ~(long long)ULONG_MAX)
hlen = ULONG_MAX - hba + 1;
}
details.check_mapping = even_cows? NULL: mapping;
details.nonlinear_vma = NULL;
details.first_index = hba;
details.last_index = hba + hlen - 1;
if (details.last_index < details.first_index)
details.last_index = ULONG_MAX;
mutex_lock(&mapping->i_mmap_mutex);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
unmap_mapping_range_tree(&mapping->i_mmap, &details);
if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
mutex_unlock(&mapping->i_mmap_mutex);
}
在代码上简单修改,可以验证通过SHARED映射的文件,通过文件系统修改之后对mmap立即可见。
tsecer@harry: cat truncate.after.mmap.cpp
/* For the size of the file. */
#include <sys/stat.h>
/* This contains the mmap calls. */
#include <sys/mman.h>
/* These are for error printing. */
#include <errno.h>
#include <string.h>
#include <stdarg.h>
/* This is for open. */
#include <fcntl.h>
#include <stdio.h>
/* For exit. */
#include <stdlib.h>
/* For the final part of the example. */
#include <ctype.h>
#include <unistd.h>
/* "check" checks "test" and prints an error and exits if it is
true. */
static void
check (int test, const char * message, ...)
{
if (test) {
va_list args;
va_start (args, message);
vfprintf (stderr, message, args);
va_end (args);
fprintf (stderr, "\n");
exit (EXIT_FAILURE);
}
}
int main (int argc, const char *argv[])
{
/* The file descriptor. */
int fd;
/* Information about the file. */
struct stat s;
int status;
size_t size;
/* The file name to open. */
const char * file_name = "me.c";
/* The memory-mapped thing itself. */
const char * mapped;
int i;
/* Open the file for reading. */
fd = open (argv[1] , O_RDONLY);
check (fd < 0, "open %s failed: %s", file_name, strerror (errno));
/* Get the size of the file. */
status = fstat (fd, & s);
check (status < 0, "stat %s failed: %s", file_name, strerror (errno));
size = s.st_size;
/* Memory-map the file. */
mapped = (const char *)mmap (nullptr, size, PROT_READ, MAP_SHARED, fd, 0);
check (mapped == MAP_FAILED, "mmap %s failed: %s",
file_name, strerror (errno));
while(true)
{
printf("mapped %c\n", mapped[0]);
sleep(1);
}
return 0;
}
tsecer@harry: g++ truncate.after.mmap.cpp
tsecer@harry: ./a.out ./X &
[1] 8931
tsecer@harry: mapped X
ecmapped X
ho mapped X
mapped X
echomapped X
mapped X
Ymapped X
mapped X
>mapped X
Xmapped X
tsecer@harry: mapped Y
mapped Y
mapped Y
mapped Y
mapped Y
手机扫一扫
移动阅读更方便
你可能感兴趣的文章