A milestone for FaaS pipelines; object storage vs VM-driven data exchange
Abstract
Serverless functions provide high levels of parallelism, short startup times, and "pay-as-you-go" billing. These attributes make them a natural substrate for data analytics workflows. However, the impossibility of direct communication between functions makes the execution of workflows challenging. The current practice to share intermediate data among functions is through remote object storage (e.g., IBM COS). Contrary to conventional wisdom, the performance of object storage is not well understood. For instance, object storage can even be superior to other simpler approaches like the execution of shuffle stages (e.g., GroupBy) inside powerful VMs to avoid all-to-all transfers between functions. Leveraging a genomics pipeline, we show that object storage is a reasonable choice for data passing when the appropriate number of functions is used in shuffling stages.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.