Alluxio next generation Virtual Distributed File System for AI Analytics

Data orchestration platform brings your data closer to compute across clusters, regions, clouds, and countries

Alluxio: A Virtual Distributed File System

Alluxio, Inc. is developing an opensource virtual distributed storage system that bridges the gap between computation frameworks, storage systems and analytics acceleration.  It sits between the computer computation and storage layer in the big data analytics stack.

It was originally created in 2018 at the University of California, Berkeley’s AMPLab under Haoyuan Li’s Ph.D. Thesis, advising by Professor Scott Shenker & Professor Ion Stoica.

The world continues to expand into the data revolution era.  The latest advancements of the Internet, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the large amount of data we are generating, collecting, storing, managing, and analyzing is growing exponentially.  To store and process this data exposed tremendous challenges and opportunities that Alluxio seeks to solve.

Decade’s ago the computation layer, the ecosystem began from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Tensorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than a hundreds of popular frameworks today for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics.

Alluxio enables this data orchestration for compute in any cloud. It unifies data silos on-premise and across any cloud to give you the data locality, accessibility, and elasticity needed to reduce the complexities associated with orchestrating data for today’s big data and AI/ML workloads.

Reference Architecture at DBS Bank

The platform is scalable to over a billion files in a single cluster, supporting features:

  • Get in-memory data access for Spark, Presto, or any analytics framework on Amazon AWS, Google Cloud Platform, or Microsoft Azure.
  • Simplify Hadoop for the hybrid cloud by making on-prem HDFS accessible to any compute in the cloud.
  • Accelerate your Spark, Presto, Tensorflow, or any other analytics workload for your object stores.
  • Logically unify your geo-distributed data from different clusters, datacenters, regions, and countries.

presents a set of disparate data stores as a single file system, greatly reducing the complexity of storage APIs, and semantics exposed to applications. Alluxio is designed with a memory centric architecture, enabling applications to leverage memory speed I/O by simply using Alluxio.

Today, Alluxio has been deployed across hundreds of the leading companies in production, serving critical workloads. Its open source community has attracted more than 800 contributors worldwide from over 200 companies. These include Baidu, Barclays, China Unicom, Comcast, DBS Bank, Huawei, IBM, Intel, and more.

To understand more about the technology, read the paper by authored Haoyuan Li

Here you will find use cases and future direction presented by Calvin Jia & Bin Fan


Spread the word

Related posts