RainbowFS final workshop

RainbowFS is a joint project to develop an industry-scale geo-distributed multi-master file system.

Accordingly, RainbowFS: studies relevant approaches to consistency; develops principled tools for the design, verification, deployment and monitoring of geo-scale systems; and applies the above to developing a correct and efficient geo-scale petabyte-sized storage system. The ElmerFS system is designed as a CRDT and runs above AntidoteDB.

Registration

The workshop takes place 28 March 2022 at Sorbonne-Université–LIP6, Paris, France. Online attendance will be supported. Registration is free but mandatory. Please register here: https://framaforms.org/registration-for-rainbowfs-final-workshop-1644573244.

Programme

In bold: invited speakers.

CRDTs09:00 Keynote: Martin Kleppmann (U. of Cambridge) — Automerge: CRDTs meet version control
09:40 Romain Vaillant (Scality) — Design & Implementation of ElmerFS
10:10 Thuy Linh Nguyen (U. Grenoble–Alpes) — Cloudal: a tool for managing experiment campaigns on cloud platforms
break 10:40 Demo of ElmerFS
AntidoteDB 11:10 Annette Bieniusa (TU Kaiserslautern) — AntidoteDB: past, present and future
11:40 Saalik Hatia (Sorbonne-Université Delys) — Correct backend for a highly-available database
12:10 Ayush Pandey (TUK and S-U) — Designing and Implementing a Persistent Cache for a CRDT Data Store
break 12:35 Lunch
break 13:35 Demo of Cloudal
Consistency 14:00 Sreeja Nair (ZettaScale Technology) — Consistency in Zenoh, an edge data fabric
14:30 Pierre Sutra (Télécom Sud Paris) — Modern techniques for data availability and durability
15:00 Sébastien Monnet & Étienne Mauffret (U. Savoie–Mont-Blanc) — Towards a consistency-aware data placement mechanism
15:30 Nuno Preguiça — Weak Consistency Semantics for Insecure Settings
break 16:00 Demo of ElmerFS
Looking to the future 16:30 James Arthur (Vaxine) — Vaxine: building a rich-CRDT database on AntidoteDB
17:00 Brad King (Scality) — Scality and RainbowFS, the Search for the Grail Continues
17:30 Marc Shapiro & Ilyas Toumlilt (S-U and Inria) — Power to the edge! Data-oriented collaboration at the edge with Concordant.io
End 18:00

Abstracts — invited speakers

Download all slides as a zip archive ]

James Arthur (Vaxine.io) — Vaxine: building a rich-CRDT database on AntidoteDB

Slides (PDF) ]

Geo-distributed apps need a geo-distributed database that’s fast for both reads and writes. CRDTs unlock low write-path latency. However, commutativity is not enough to guarantee data integrity. Vaxine is a rich-CRDT database system built on AntidoteDB that aims to provides both commutativity and invariant safety.

In this talk, James introduces the Vaxine project, its rationale, architecture and objectives. This includes how it builds on and aims to support Antidote and the design challenges in developing a production-ready rich-CRDT system that's accessible to general developers.

Bio:

James is a software developer and technology entrepreneur. Prior to Vaxine, he was co-founder and CTO of synthetic data company Hazy, edge AI company LGN, deep tech venture builder Post Urban and digital manufacturing company Opendesk.

Annette Bieniusa (TU Kaiserslautern) — AntidoteDB: past, present and future

Slides (PDF) ]

The open-source project AntidoteDB originated in the European Project “Syncfree”. In a collaborative effort of several project partners, we investigated how to enrich a CRDT-based key-value store with transactions to provide consistent snapshot reads and atomic updates. Meanwhile, AntidoteDB provided the basis for a variety of research on CRDT processing; it inspired and served as platform for numerous PhD, Master and Bachelor students. Since the first commit in March 2014, the code base has undergone several significant improvements and extensions, from the storage and caching layer to the different client-facing APIs and corresponding libraries. In my talk, I will give an overview on AntidoteDB’s past, describe its current state and discuss how we plan to evolve it - as research platform and for production environments.

Bio
Annette Bieniusa is a senior researcher and lecturer at the TU Kaiserslautern, working on concurrent and distributed programming, with a focus on (geo-)replication, synchronization, and programming language concepts. Annette is the project lead of AntidoteDB, an open-source planet-scale, highly available, transactional database based on CRDTs.

Martin Kleppmann (U. of Cambridge) — Automerge: CRDTs meet version control

Slides (PDF) ]

Most collaboration software today assumes a Google-Docs-like model in which each collaborator’s edits are applied to the shared document or data structure as quickly as possible, often keystroke by keystroke. While this real-time collaboration model is great in some situations, it is not always appropriate. Sometimes, a user will want to work in isolation on a separate version of a document for a while, and share their updates with collaborators only when they are ready. Users might have multiple versions of a document side-by-side, which may or may not be merged later.

Software developers often use version control systems such as Git to enable such branching and merging workflows, to compare versions of a document, and to inspect the editing history. However, while Git works okay for plain text files, its support for more complex file formats (e.g. spreadsheets, graphics, CAD files) is poor: if two users modify the same file on different branches, resolving that merge conflict is left as a manual task for the user.

But we already have an excellent tool for automatically merging concurrent updates to a data structure: CRDTs! In this talk I will introduce our work-in-progress research on Automerge, which aims to bring together the world of Git-like version control and the world of CRDTs.

Bio:

Dr. Martin Kleppmann is a research fellow and affiliated lecturer at the University of Cambridge, and author of the bestselling book “Designing Data-Intensive Applications” (O'Reilly Media). He works on distributed systems and security, in particular collaboration software and CRDTs. Previously he was a software engineer and entrepreneur, co-founding and selling two startups, and working on large-scale data infrastructure at LinkedIn.

Slides (PDF) ]

Zenoh is a middleware for real-time systems that enables dependable, high-performance and scalable data exchanges. It unifies data in motion, data at rest and computations. Zenoh blends traditional pub/sub with geo distributed storage, queries and computations, while retaining time and space efficiency as required by the industrial internet. It is adapted to the needs of industries such as aeronautics, automobile and robotics. Storages in Zenoh subscribe to updates through the network. Zenoh provides plugins that can connect to different storage technologies. Data can be stored in different storages using different storage technologies as per user preference. In case of a partition, different storages might diverge. In this talk, we present our approach to achieve strong eventual consistency for data replicated in different data stores connected through Zenoh.

Nuno Preguiça (U. NOVA de Lisboa) — Weak Consistency Semantics for Insecure Settings

Slides (PDF) ]

Client-side replication and direct client-to-client synchronization can be used to create highly available, low-latency interactive applications. Causal consistency, the strongest available consistency model under network partitions, is an attractive consistency model for these applications. This work focuses on how client misbehaviour impacts causal consistency. We analyze the possible attacks to causal consistency and derive secure consistency models that preclude different types of misbehaviour. We propose a set of techniques for implementing such secure consistency models, which exhibit different trade-offs between the application guarantees, and the latency and communication overhead.

Abstracts — RainbowFS speakers

Download all slides as a zip archive ]

Saalik Hatia (Sorbonne-Université Delys) — Correct backend for a highly-available database

Slides (PDF) ]

A large-scale application is typically built on top of a geo-distributed databases, running on multiple datacenters (DCs), situated around the globe. Replication, journaling, materializing, checkpointing and journal truncation are well-known mechanisms that used in modern databases. However, their interplay is known to be tricky. Implementing these features correctly is hard. In this work, we adopt a stepwise approach to implement a fully-featured geo-distributed database that is correct by design. We first study existing databases, to identify the key invariants that capture the correctness of the storage backend with high availabilty. From those key invariants, we formalize an operational semantics, and (manually) derive a reference implementation that satisfies this model. Finally, we show that our implementation respects the consistency invariants through test cases.

Brad King (Scality) — Scality and RainbowFS, the Search for the Grail Continues

Slides (PDF) ]

Scality has been researching the possibility of offering a geo-distributed POSIX compliant filesystem that safely permits local updates without synchronous global coordination. The RainbowFS project is Scality’s key research activity in this area. Progress in the field of CRDTs presented interesting possibilities for making this a reality. This presentation will briefly present the idealized and realistic outcomes that motivated and resulted from this work. Additionally, storage needs, expectations and alternatives to traditional filesystem interfaces have evolved significantly during the roughly five-year timespan of this effort. These industry changes will be discussed as well as the progress made and the challenges that remain in transforming this research into a commercially viable product.

Sébastien Monnet & Étienne Mauffret (U. Savoie–Mont-Blanc) — Towards a consistency-aware data placement mechanism

Slides (PDF) ]

Data management has become crucial. Distributed applications and users manipulate large amounts of data. More and more distributed data management solutions arise, e.g. Cassandra or CosmosDB. Some of them propose multiple consistency protocols. Thus, for each piece of data, the developer can choose a consistency protocol adapted to his needs. The consistency protocol should be taken into account while placing copies of pieces of data. We propose an approach that dynamically adapts data replication according to the data usage (read/write frequencies and locations) and the consistency protocol used to manage the piece of data.

Thuy Linh Nguyen (U. Grenoble–Alpes) — Cloudal: a tool for managing experiment campaigns on cloud platforms

Slides (PDF) ]

Experiment-driven research plays an important role in many sciences to verify new hypotheses, or to unveil hidden interactions between many factors of a phenomenon. For modern days computing experiments, with the ever-increasing size of infrastructures and the complexity of distributed systems, it is fundamental to run large-scale experiments on heterogeneous cloud platforms to have a better understanding of the system.

In general, researchers have to overcome several challenges and obstacles to deal with provisioning resources and deploying services on one specific cloud system. Additionally, they also have to ensure the reproducibility of the experiments and the control of the parameter space.

To help researchers shorten the time-consuming and error-prone setting stages as well as manage the experiments productively, we developed Cloudal, an experiment management tool, to design and run a full factorial experiment on a cloud system automatically and reproducibility. The current version of cloudal supports several cloud platforms: Grid'5000, Google Cloud Platform (GCP), Microsoft Azure and OVHCloud. In order to illustrate the usefulness of Cloudal, we implement a performance benchmark experiment with a complex workflow involving the deployment of a geo-distributed file system on top of Grid’5000. The whole experiment is encapsulated in an easy-to-follow script with a separated easy-to-manage system setting file.

Ayush Pandey (TUK and Sorbonne-Université Delys) — Designing and Implementing a Persistent Cache for a CRDT Data Store.

Slides (PDF) ]

Most in-memory databases offer object-persistence through transaction logging and snapshots. During routine restarts, the memory snapshots are read from the disk and stored in the memory for accesses. When failures happen, the logs help in recovery to undo or redo transactions.

There is another way of managing objects in databases. Instead of writing snapshots to memory and using logs as a fallback mechanism, we can use logs as the primary source of objects’ states and use snapshots to fallback on in case of failures. This strategy is implemented in Journal based databases.

In our work, we present the design of a cache and a persistent checkpoint store for a database which uses the log to create objects’ states. We study how to materialize the objects from the journal using other snapshots while maintaining causal dependencies and keeping these snapshots in memory to improve transaction performance. The design is supported by benchmarks comparing the design with a reference implementation.

Marc Shapiro & Ilyas Toumlilt (Sorbonne-Université and Inria) — Power to the edge! Data-oriented collaboration at the edge with Concordant.io

Slides (PDF) ]

Concordant.io is a decentralised edge-first platform for powerful edge apps. It brings data to its point of use, with the strongest consistency. This enables immediate response, high availability, seamless online/offline, and direct collaboration between nearby devices. Concordant's data-centric API lets developers focus on application logic, without the latency and lock-in of the cloud. Applications use CRDT data, and are guaranteed Transactional Causal Consistency; Concordant is inspired by AntidoteDB and Colony.

Pierre Sutra (Télécom Sud Paris) — Modern techniques for data availability and durability

Slides (PDF) ]

The scale and intensity of today's economic exchanges require modern web services to be always-on and responsive. One of the key concerns when building these services is the durability and availability of their data. Data must remain accessible despite network, machine, or even data center-scale temporary (or permanent) outages. This talk explores some of the recent techniques to achieve such guarantees.

First, we focus on geo-replicated state machines. Geo-replication places several copies of a logical data item across different data centers to improve access locality and fault-tolerance. State-machine replication (SMR) ensures that these copies stay in sync. Recent advances in SMR focus on leaderless protocols, that sidestep the availability and performance limitations of traditional Paxos-based solutions. We detail several leaderless protocols and compare their pros and cons under different applicative workloads.

The second part of this talk focuses on non-volatile main memory (NVMM). NVMM is a new tier in the memory hierarchy that offers jointly the durability of spinning disks, near-DRAM speed and byte addressability. We present J-NVM, a fully-fledged interface to use NVMM in the Java language. Internally, J-NVM relies on proxy objects that intermediate direct off-heap access to NVMM. We present the internal of J-NVM and how to use it in the context of a modern distributed data store.

Romain Vaillant (Scality) — Design & Implementation of ElmerFS

Slides (PDF) ]

In this work, we design and implement a POSIX-compliant file system, ElmerFS, that supports active-active geo-replication. We study how to handle concurrent updates to preserve application correctness, to satisfy important POSIX requirements and to respect user intent. The design leverages the Conflict Free Replicated Data Types (CRDTs) supported by the AntidoteDB geo-distributed database. CRDTs allow application designers to build highly available and geo-distributed systems while providing competitive response times. We compare ElmerFS experimentally with GlusterFS in a large scale geo-distributed deployment, and to analyse how both approaches behave the presence of network latency and failures.