Homomorphism Calculus for User-Defined Aggregations
Abstract
Data processing frameworks like Apache Spark and Flink provide built-in support for user-defined aggregation functions (UDAFs), enabling the integration of domain-specific logic. However, for these frameworks to support efficient UDAF execution, the function needs to satisfy a homomorphism property, which ensures that partial results from independent computations can be merged correctly. Motivated by this problem, this paper introduces a novel homomorphism calculus that can both verify and refute whether a UDAF is a dataframe homomorphism. If so, our calculus also enables the construction of a corresponding merge operator which can be used for incremental computation and parallel execution. We have implemented an algorithm based on our proposed calculus and evaluate it on real-world UDAFs, demonstrating that our approach significantly outperforms two leading synthesizers.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.