会议动态

大会论坛：计算存储创新Best of the Best

发布时间： 2024.11.26

一、论坛概要

本论坛邀请近年来在CCF-A类国际学术会议上发表与存储研究相关并获得最佳论文奖的作者分享前沿学术成果与科研心得体会。注重研讨新兴存储应用和新型存储硬件之间融合协同的关键技术，并基于学术前沿成果进行全面技术分析、互动讨论及经验分享。

二、论坛主席

华宇

华中科技大学

华中科技大学教授，国家杰出青年科学基金获得者，CCF杰出会员和杰出演讲者，长期从事新型存储器件、高性能存储系统和安全架构等方面的研究工作，注重推动大内存在持久化、原子化、智能化方面的发展，形成了从器件-系统-安全的一体化技术体系。在OSDI、ASPLOS、MICRO、FAST等会议上发表多篇学术论文。在ICDCS 2021、ACM APSys 2019等国际会议上担任程序主席/副主席，在OSDI、SIGCOMM、FAST、NSDI、MICRO、ASPLOS等国际会议上担任程序委员，是ACM Transactions on Storage期刊的编委。研究成果获得教育部自然科学一等奖等三项省部级科技奖励，以及FAST等四项国际会议和期刊的最佳论文奖。

付印金

中山大学

付印金，中山大学副教授、研究生导师、鹏城孔雀计划特聘岗位专家、CCF杰出会员。主要研究方向为分布式存储、云计算与大数据保护。主持过国家重点研发子课题、国家自然科学基金、江苏省自然科学基金、华为胡杨林基金等课题多项。先后在ACM/IEEE Trans.、FAST、Middleware、MSST、DATE、ICCD、Cluster等国内外主流期刊和会议上发表学术论文50余篇，授权国家发明专利10余项，出版教材3部。担任CCF信息存储技术专委会常务委员、CCF系统软件专委会执行委员、《计算机工程》编委。

三、论坛讲者及报告

徐梦炜

北京邮电大学

北京邮电大学计算机学院副教授，博士生导师。于北京大学获得本科与博士学位，入选中国科协青年人才托举工程，北京市科技新星，微软亚洲研究院“铸星计划”访问学者，普渡大学访问学者。主要研究领域为移动/边缘计算和系统软件，相关成果发表于ACM MobiCom/MobiSys/ASPLOS /IEEE TMC/软件学报等国内外顶级会议期刊，获USENIX ATC 2024最佳论文奖。主持国家自然科学基金、科技部重点研发项目课题等多个项目。

报告主题：An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise (USENIX ATC 2024最佳论文奖)

报告摘要：Developed for over 30 years, Linux has already become the computing foundation for today's digital world; from gigantic, complex :mainframes (e.g., supercomputers) to cheap, wimpy embedded devices (e.g., IoTs), countless applications are built: on top of it. Yet, such an infrastructure has been plagued by numerous memory and concurrency hugs since the day it was born, due to many rogue memory operations are permitted by C language. A recent project Rust-for-Linux (RFL) has the potential to address Linux's safety concerns once and for all by embracing Rust's static ownership and type checkers into the kernel code, the kernel may finally be free from memory and concurrency bugs without hurting its performance, While it has been gradually matured and even merged into Linux mainline, however, RP I. is rarely studied and still remains unclear whether it has indeed reconciled the safety and performance dilemma for the kernel. To this end, we conduct the first empirical study on RFL to understand its status quo and benefits, especially/on how Rust fuses with I and whether the fusion assures driver safety without overhead, We collect and analyze 6 key RH, drivers, which involve hundreds of issues and PRs, thousands of Github commits and mail exchanges of the Linux mailing list, as well as over 12K discussions on Zulip. We have found while Rust mitigates kernel vulnerabilities, it is beyond Rust's capability to fully eliminate them; what is more, if not handled properly, its safety assurance even costs the developers dearly in terms of both runtime overhead and development efforts.

王晨曦

中国科学院计算技术研究所

副研究员，博士生导师。研究方向为面向数据中心构建跨编程语言、运行时、操作系统的云计算系统软件。其研究成果发表在OSDI、PLDI、NSDI等系统领域顶级会议，以及TOCS、TDSC等系统领域顶级期刊，并获得了OSDI 2022 最佳论文奖、ACM ChinaSys 新星奖、国家级高层次青年人才等荣誉。此外，他还担任了ASPLOS、NSDI、ATC、TACO等国际学术会议的评委。

报告主题：MemLiner: Lining up Tracing and Application for a Far-Memory-Friendly Runtime(OSDI 2022 最佳论文奖)

报告摘要：Far-memory techniques that enable applications to use remote memory are increasingly appealing in modern data centers, supporting applications' large memory footprint and improving machines' resource utilization. Unfortunately, most far-memory techniques focus on OS-level optimizations and are agnostic to managed runtimes and garbage collections (GC) underneath applications written in high-level languages. With different object-access patterns from applications, GC can severely interfere with existing far-memory techniques, breaking remote memory prefetching algorithms and causing severe local-memory misses. We developed MemLiner, a runtime technique that improves the performance of far-memory systems by "lining up" memory accesses from the application and the GC so that they follow similar memory access paths, thereby (1) reducing the local-memory working set and (2) improving remote-memory prefetching through simplified memory access patterns. We implemented MemLiner in two widely-used GCs in OpenJDK: G1 and Shenandoah. Our evaluation with a range of widely-deployed cloud systems shows MemLiner improves applications' end-to-end performance by up to 2.5x.

孙广宇

北京大学

长聘副教授。研究领域为领域定制体系架构的设计与自动化，包括高能效计算架构、新型存储架构、“应用-架构-电路-器件”跨层设计等。近年来在ISCA、MICRO、HPCA、DAC、TCAD、TC等高质量会议和期刊上发表论文100余篇, 获最佳论文奖4次、最佳论文提名3次。获得CCF-IEEE CS青年科学家奖、DAC Under-40 Innovators Award等，并入选北京智源人工智能研究院“青年科学家”等。

报告主题：DIMM-Link: Enabling Efficient Inter-DIMM Communication for Near-Memory Processing(HPCA 2023最佳论文奖)

报告摘要：DIMM-based near-memory processing architectures (DIMM-NMP) have received growing interest from both academia and industry. They have the advantages of large memory capacity, low manufacturing cost, high flexibility, compatible form factor, etc. However, inter-DIMM communication (IDC) has become a critical obstacle for generic DIMM-NMP architectures because it involves costly forwarding transactions through the host CPU. Recent research has demonstrated that, for many applications, the overhead induced by IDC may even offset the performance and energy benefits of near-memory processing. To tackle this problem, we propose DIMM-Link, which enables high-performance IDC in DIMM-NMP architectures and supports seamless integration with existing host memory systems. It adopts bidirectional external data links to connect DIMMs, via which point-to-point communication and inter-DIMM broadcast are efficiently supported in a packet-routing way. We present the full-stack design of DIMM-Link, including the hardware architecture, interconnect protocol, system organization, routing mechanisms, optimization strategies, etc. Comprehensive experiments on typical data-intensive tasks demonstrate that the DIMM-Link-equipped NMP system can achieve a 5:93x average speedup over the 16-core CPU baseline. Compared to other IDC methods, DIMM-Link outperforms MCN, AIM, and ABC-DIMM by 2:42x, 1:87x, and 1:77x, respectively. More importantly, DIMM-Link fully considers the implementation feasibility and system integration constraints, which are critical for designing NMP architectures based on modern DDR4/DDR5 DIMMs.

董明凯

上海交通大学

助理研究员，主要研究方向为操作系统、文件与存储系统、DNA存储，相关工作发表在SOSP、OSDI、FAST、USENIX ATC等系统领域顶级国际会议。曾获SOSP2023最佳论文奖、华为“火花奖”、英特尔中国学术英才计划2022年度荣誉学者。其研制的高性能持久内存文件系统SoupFS进入OpenEuler操作系统创新版；与华为合作的EROFS成为首个合入Linux内核主线的国人自主研发的文件系统，并已经成为Android系统分区的官方文件系统，运行在亿级智能手机之中。

报告题目：TreeSLS: A Tree-structured Microkernel with Efficient Whole-system Persistence on NVM(SOSP 2023最佳论文奖)

报告摘要：Whole-system persistence promises simplified application deployment and near-instantaneous recovery. This can be implemented using single-level store (SLS) through periodic checkpointing of ephemeral state to persistent devices. However, traditional SLSs suffer from two main issues on checkpointing efficiency and external synchrony, which are critical for low-latency services with persistence need.
In this paper, we note that the decentralized state of microkernel-based systems can be exploited to simplify and optimize state checkpointing. To this end, we propose TreeSLS, a whole-system persistent microkernel that simplifies the whole-system state maintenance to a capability tree and a failure-resilient checkpoint manager. TreeSLS further exploits the emerging non-volatile memory to minimize checkpointing pause time by eliminating the distinction between ephemeral and persistent devices. With efficient state maintenance, TreeSLS further proposes delayed external visibility to provide transparent external synchrony with little overhead. Evaluation on microbenchmarks and real-world applications (e.g., Memcached, Redis and RocksDB) show that TreeSLS can complete a whole-system persistence in around 100 μs and even take a checkpoint every 1 ms with reasonable overhead to applications.

徐尔茨

阿里云

长期从事分布式存储系统和软硬件可靠性研究。他的研究成果发表在多个国际顶级学术会议上，包括USENIX OSDI、FAST、ATC和ACM EuroSys。他是两届USENIX FAST最佳论文奖（FAST'23, FAST'24）和2023 ACM SIGOPS中国新星奖的获得者。

报告主题：What's the Story in EBS Glory: Evolutions and Lessons in Building Cloud Block Store (FAST 2024最佳论文奖)

报告摘要：In this paper, we qualitatively and quantitatively discuss the design choices, production experience, and lessons in building the Elastic Block Storage (EBS) at ALIBABA CLOUD over the past decade. To cope with hardware advancement and users' demands, we shift our focus from design simplicity in EBS1 to high performance and space efficiency in EBS2, and finally reducing network traffic amplification in EBS3.
In addition to the architectural evolutions, we also summarize development lessons and experiences as four topics, including: (i) achieving high elasticity in latency, throughput, IOPS and capacity; (ii) improving availability by minimizing the blast radius of individual, regional, and global failure events; (iii) identifying the motivations and key tradeoffs in various hardware offloading solutions; and (iv) identifying the pros/cons of alternative solutions and explaining why seemingly promising ideas would not work in practice.

李鹏飞

华中科技大学

主要研究学习型索引、分离式内存系统、分布式文件系统等方面，以第一作者在FAST、VLDB、ACM TOS、IEEE TKDE、Middleware、JCST等多个国际会议和期刊上发表学术论文，并获得FAST 2023的最佳论文奖和2023年ACM ChinaSys优博奖。

报告主题：ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems(FAST 2023最佳论文奖)

报告摘要：Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide rang query service via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this paper, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data-movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2 times than state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.

微信扫一扫：分享

大会论坛：计算存储创新Best of the Best