 Curve Cloud Nativehigh performance cloud native distributed block storage • Curve File System (CurveFS) • CurveFS: a high performance cloud native file system (Based on CurveBS / S3 compatible Storage)Operator capability CAPABILITY LEVEL CURVE COMMENT BASIC INSTALL Y (by Helm) automated application provisioning and configuration management SEAMLESS UPGRADES Y (by Helm) patch and minor version upgrads supported FULL LIFECYCLE failure domains of Kubernetes • Support for public cloud environments • Dashboard-driven configuration after minimal Curve installFeature list for cluster • CurveBS mirroring configured with CRDs0 码力 | 9 页 | 2.85 MB | 6 月前3 Curve Cloud Nativehigh performance cloud native distributed block storage • Curve File System (CurveFS) • CurveFS: a high performance cloud native file system (Based on CurveBS / S3 compatible Storage)Operator capability CAPABILITY LEVEL CURVE COMMENT BASIC INSTALL Y (by Helm) automated application provisioning and configuration management SEAMLESS UPGRADES Y (by Helm) patch and minor version upgrads supported FULL LIFECYCLE failure domains of Kubernetes • Support for public cloud environments • Dashboard-driven configuration after minimal Curve installFeature list for cluster • CurveBS mirroring configured with CRDs0 码力 | 9 页 | 2.85 MB | 6 月前3
 Curve for CNCF Mainperformance cloud native distributed block storage • Curve File System (CurveFS) • CurveFS: a high performance cloud native file systemUse Cases • Container • Database • Data apps(middleware/bigdata/ai) Features • RAFT for data consistency • minor impaction when chunk server fails • Precreated chunk file for volume space mapping • high performance framework • Use bthread (M bthread map N pthread) Engine Comparison (vs. Ceph) META MANAGEMENT CURVE CHUNK SERVER BLUESTORE META Precreate Chunk File Pool on ext4 RocksDB META OVERHEAD without ext4 meta overhead increase read/write magnification0 码力 | 21 页 | 4.56 MB | 6 月前3 Curve for CNCF Mainperformance cloud native distributed block storage • Curve File System (CurveFS) • CurveFS: a high performance cloud native file systemUse Cases • Container • Database • Data apps(middleware/bigdata/ai) Features • RAFT for data consistency • minor impaction when chunk server fails • Precreated chunk file for volume space mapping • high performance framework • Use bthread (M bthread map N pthread) Engine Comparison (vs. Ceph) META MANAGEMENT CURVE CHUNK SERVER BLUESTORE META Precreate Chunk File Pool on ext4 RocksDB META OVERHEAD without ext4 meta overhead increase read/write magnification0 码力 | 21 页 | 4.56 MB | 6 月前3
 Raft在Curve存储中的工程实践Closure* done); void remove_peer(const PeerId& peer, Closure* done); void change_peers(const Configuration& new_peers, Closure* done); StateMachine void on_apply(::raft::Iterator& iter); void on_sn0 码力 | 29 页 | 2.20 MB | 6 月前3 Raft在Curve存储中的工程实践Closure* done); void remove_peer(const PeerId& peer, Closure* done); void change_peers(const Configuration& new_peers, Closure* done); StateMachine void on_apply(::raft::Iterator& iter); void on_sn0 码力 | 29 页 | 2.20 MB | 6 月前3
 CurveFs 用户权限系统调研w_other”(该配置项是无值的)。详见libfuse官方文 档:https://github.com/libfuse/libfuse#security-implications # The file /etc/fuse.conf allows for the following parameters: # # user_allow_other - Using the allow_other mount file access to the filesystem owner, so that all users (including root) can access the files. allow_root This option is similar to allow_other but file access is nt$ touch file1 wanghai01@pubbeta1-nostest2:/tmp/fsmount$ ls -l total 0 -rw-r--r-- 0 wanghai01 neteaseusers 0 Jan 7 2079 file1 wanghai01@pubbeta1-nostest2:/tmp/fsmount$ echo "hello" > file1 wanghai00 码力 | 33 页 | 732.13 KB | 6 月前3 CurveFs 用户权限系统调研w_other”(该配置项是无值的)。详见libfuse官方文 档:https://github.com/libfuse/libfuse#security-implications # The file /etc/fuse.conf allows for the following parameters: # # user_allow_other - Using the allow_other mount file access to the filesystem owner, so that all users (including root) can access the files. allow_root This option is similar to allow_other but file access is nt$ touch file1 wanghai01@pubbeta1-nostest2:/tmp/fsmount$ ls -l total 0 -rw-r--r-- 0 wanghai01 neteaseusers 0 Jan 7 2079 file1 wanghai01@pubbeta1-nostest2:/tmp/fsmount$ echo "hello" > file1 wanghai00 码力 | 33 页 | 732.13 KB | 6 月前3
 Open Flags 调研FASYNC, O_TMPFILE 结论 参考文献 open接口原型 # man page open, openat, creat - open and possibly create a file #include Open Flags 调研FASYNC, O_TMPFILE 结论 参考文献 open接口原型 # man page open, openat, creat - open and possibly create a file #include- int open(const char *pathname, int flags); int open(const char *pathname, int Page 4 of 23 文件创建标志只影响打开操作, 文件状态标志影响后面的读写操作 file creation flags: O_CLOEXEC, O_CREAT, O_DIRECTORY, O_EXCL, O_NOCTTY, O_NOFOLLOW, O_TMPFILE, and O_TRUNC file status flags: O_APPEND, FASYNC, O_DIRECT, O_SYNC(O_DSYNC) 输入都将会影响用户的进程。 O_NOCTTY : 如果文件存在,且是个普通文件,具有对该文件的写权限,该flag会将文件长度截断为0。 O_TRUNC : 追加写,每次write都会将file offset 指向文件尾(file offset的修改和write操作在一个原子操作中完成)。 O_APPEND O_NONBLOCK O_NDELAY: O_NONBLOCK和O_NDELAY所产生的结果 0 码力 | 23 页 | 524.47 KB | 6 月前3
 CurveBS IO Processing Flowvirtual block device to a file. For example, block device /dev/sda corresponds to file /foo/bar in CurveBS 2. The address space of the block device /dev/sda maps to chunks of file in the system. For example Each file (/foo/bar) contains chunks scattered all over the storage nodes. ChunkServer provides 4KB random read/write capability to support 4KB aligned read/write on block devices.CurveBS file structure look at the metadata for a file in CurveBS. 1. A file in CurveBS consists of chunks. The default size of chunk is 16MB. If the file directly maps to chunk, a 4TB file will consist of 256KB chunks0 码力 | 13 页 | 2.03 MB | 6 月前3 CurveBS IO Processing Flowvirtual block device to a file. For example, block device /dev/sda corresponds to file /foo/bar in CurveBS 2. The address space of the block device /dev/sda maps to chunks of file in the system. For example Each file (/foo/bar) contains chunks scattered all over the storage nodes. ChunkServer provides 4KB random read/write capability to support 4KB aligned read/write on block devices.CurveBS file structure look at the metadata for a file in CurveBS. 1. A file in CurveBS consists of chunks. The default size of chunk is 16MB. If the file directly maps to chunk, a 4TB file will consist of 256KB chunks0 码力 | 13 页 | 2.03 MB | 6 月前3
 curvefs client删除文件和目录功能设计request from * the kernel even after calls to unlink, rmdir or (when * overwriting an existing file) rename. Filesystems must handle * such requests properly and it is recommended to defer removal closely by forget * unless the file or directory is open, in which case the * kernel issues forget only after the release or releasedir * calls. * * Note that if a file system will be exported over unmount the lookup count for all inodes implicitly drops * to zero. It is not guaranteed that the file system will * receive corresponding forget messages for the affected© XXX Page 5 of 15 * inodes0 码力 | 15 页 | 325.42 KB | 6 月前3 curvefs client删除文件和目录功能设计request from * the kernel even after calls to unlink, rmdir or (when * overwriting an existing file) rename. Filesystems must handle * such requests properly and it is recommended to defer removal closely by forget * unless the file or directory is open, in which case the * kernel issues forget only after the release or releasedir * calls. * * Note that if a file system will be exported over unmount the lookup count for all inodes implicitly drops * to zero. It is not guaranteed that the file system will * receive corresponding forget messages for the affected© XXX Page 5 of 15 * inodes0 码力 | 15 页 | 325.42 KB | 6 月前3
 OID CND Asia Slide: CurveFS○ apps bundled with data locations ● Requirements for elastic block storage ● Requirements for file systemopen-source storage ● Requirements ○ Cloud Native ○ Easy operation and maintenance ○ High ● CopySet pre-allocation algorithm ● Raft Consistency protocol High performance ● pre-created file pool ● data strip like RAID ● Zero data copy ● RDMA Cloud NativeCluster topology The physical on a physical serverCurve metadata organization Curve maps virtual block devices to files Each file contains chunks scattered across storage nodes in the cluster Chunkservers are grouped by failure0 码力 | 24 页 | 3.47 MB | 6 月前3 OID CND Asia Slide: CurveFS○ apps bundled with data locations ● Requirements for elastic block storage ● Requirements for file systemopen-source storage ● Requirements ○ Cloud Native ○ Easy operation and maintenance ○ High ● CopySet pre-allocation algorithm ● Raft Consistency protocol High performance ● pre-created file pool ● data strip like RAID ● Zero data copy ● RDMA Cloud NativeCluster topology The physical on a physical serverCurve metadata organization Curve maps virtual block devices to files Each file contains chunks scattered across storage nodes in the cluster Chunkservers are grouped by failure0 码力 | 24 页 | 3.47 MB | 6 月前3
 CurveFS Client 概要设计+retrieve_reply +forget_multi +flock +fallocate© XXX Page 5 of 11 +readdirplus +copy_file_range +lseek 关键接口分析 init void (*init) (void *userdata, struct fuse_conn_info *conn); 根据 void (*write) (fuse_req_t req, fuse_ino_t ino, const char *buf, size_t size, off_t off, struct fuse_file_info *fi); 首先根据inode id 从缓存中查找到对应inode结构; 如果inode缓存中不存在对应的inode,则从mds获取inode所在copyset,metaserver 结构,缓存之; 判断inode结构中,对应请求[off, size]位置的空间是否有分配:如果未分配或只有部分分配空间,则调用空间分配器分配空间,并根据空间分配器返回结果,修改inode结构(包括file length); inode修改需要持久化到底层并修改本地cache; 调用curve client接口,写curve卷对应[offset,len] 数据。 (这里涉及到一个问题,是否从fus0 码力 | 11 页 | 487.92 KB | 6 月前3 CurveFS Client 概要设计+retrieve_reply +forget_multi +flock +fallocate© XXX Page 5 of 11 +readdirplus +copy_file_range +lseek 关键接口分析 init void (*init) (void *userdata, struct fuse_conn_info *conn); 根据 void (*write) (fuse_req_t req, fuse_ino_t ino, const char *buf, size_t size, off_t off, struct fuse_file_info *fi); 首先根据inode id 从缓存中查找到对应inode结构; 如果inode缓存中不存在对应的inode,则从mds获取inode所在copyset,metaserver 结构,缓存之; 判断inode结构中,对应请求[off, size]位置的空间是否有分配:如果未分配或只有部分分配空间,则调用空间分配器分配空间,并根据空间分配器返回结果,修改inode结构(包括file length); inode修改需要持久化到底层并修改本地cache; 调用curve client接口,写curve卷对应[offset,len] 数据。 (这里涉及到一个问题,是否从fus0 码力 | 11 页 | 487.92 KB | 6 月前3
 Curve文件系统空间分配方案Curve文件系统空间分配方案(基于块的方案,已实现)© XXX Page 2 of 11 背景 本地文件系统空间分配相关特性 局部性 延迟分配/Allocate-on-flush Inline file/data 空间分配 整体设计 空间分配流程 特殊情况 空间回收 小文件处理 并发问题 文件系统扩容 接口设计 RPC接口 空间分配器接口 背景 根据 ,文件系统基于当前的块进行实 间。 延迟分配/Allocate-on-flush 在sync/flush之前,尽可能多的积累更多的文件数据块才进行空间分配,一方面可以提高局部性,另一方面可以降低磁盘碎片。 Inline file/data 几百字节的小文件不单独分配磁盘空间,直接把数据存放到文件的元数据中。 针对上述的本地文件系统特性,Curve文件系统分配需要着重考虑 。 局部性 虽然Curve是一个分布式 tent进行记录即可,(0,100MiB,2MiB)。 所以,如果能对文件的多次空间申请分配连续的地址空间,则inode中记录的extent数量可以大大减少,能够降低整个文件系统的元数据量。 对于延迟分配和Inline file这两个特性,需要fuse client端配合完成。 空间分配 整体设计 分配器包括两层结构: 第一层用bitmap进行表示,每个bit标识其所对应的一块空间(以4MiB为例,具体大小可配置)是否分配出去。0 码力 | 11 页 | 159.17 KB | 6 月前3 Curve文件系统空间分配方案Curve文件系统空间分配方案(基于块的方案,已实现)© XXX Page 2 of 11 背景 本地文件系统空间分配相关特性 局部性 延迟分配/Allocate-on-flush Inline file/data 空间分配 整体设计 空间分配流程 特殊情况 空间回收 小文件处理 并发问题 文件系统扩容 接口设计 RPC接口 空间分配器接口 背景 根据 ,文件系统基于当前的块进行实 间。 延迟分配/Allocate-on-flush 在sync/flush之前,尽可能多的积累更多的文件数据块才进行空间分配,一方面可以提高局部性,另一方面可以降低磁盘碎片。 Inline file/data 几百字节的小文件不单独分配磁盘空间,直接把数据存放到文件的元数据中。 针对上述的本地文件系统特性,Curve文件系统分配需要着重考虑 。 局部性 虽然Curve是一个分布式 tent进行记录即可,(0,100MiB,2MiB)。 所以,如果能对文件的多次空间申请分配连续的地址空间,则inode中记录的extent数量可以大大减少,能够降低整个文件系统的元数据量。 对于延迟分配和Inline file这两个特性,需要fuse client端配合完成。 空间分配 整体设计 分配器包括两层结构: 第一层用bitmap进行表示,每个bit标识其所对应的一块空间(以4MiB为例,具体大小可配置)是否分配出去。0 码力 | 11 页 | 159.17 KB | 6 月前3
共 23 条
- 1
- 2
- 3













