Adaptive Resource Management for Data-Intensive Applications

Aiming to make it easier to run data-intensive applications efficiently on distributed computing infrastructures from small devices to large-scale clusters, we work on adaptive resource management, following an iterative systems research approach.

Background

Today, many organizations have to deal with very large volumes of collected data, be it to search through billions of websites, to recommend songs or TV shows to millions of users, to accurately identify genetic disorders by comparing terabytes of human genomic data, to monitor current environment conditions such as seismic activities using large distributed sensor networks, or to detect fraudulent behavior in millions of business transactions. For this, businesses, sciences, municipalities, as well as other large and small organizations deploy data-intensive applications that run on scalable and fault-tolerant distributed systems and large-scale computing infrastructures. Prominent examples of such systems include cluster resource managers, distributed storage systems, scalable databases, messaging systems, distributed dataflow systems, and scalable systems for machine learning and graph processing.

The computing infrastructures for data-intensive applications and scalable systems are often large clusters of commodity hardware. These infrastructures are becoming increasingly diverse, distributed, and dynamic. Beyond the data center, there will be more edge and fog resources as well as IoT devices. This will enable to run applications closer to data sources and users, allowing for lower latencies, improved security and privacy, and possibly less energy for wide-area networking. At the same time, also data centers are becoming more diverse. In public clouds, users can choose among hundreds of different virtual machines, including instances optimized for compute, memory, and storage functions as well as for accelerated computing. Similarly, dedicated cluster infrastructures at larger organizations are also becoming more heterogeneous. Scientists at universities, for instance, often have access to several clusters, each potentially again with multiple different types of machines, such as machines equipped with large amounts of memory or graphics processors.

Motivation

The data-intensive applications that run on these systems and infrastructures are created by many different users. These include software engineers, system operators, data analysts, machine learners, scientists, and domain experts, yet typically also users without a strong background and deep understanding of parallel computer architectures and networking, scalable distributed systems, and efficient data management and processing. At the same time, it is still very difficult to run data-intensive applications on today’s diverse and dynamic distributed computing environments, so that the applications provide the required performance and dependability, and also run efficiently. Users are largely left alone with the question, how much and which resources they should use for their applications. Meanwhile, configuring scalable fault-tolerant distributed systems, so that they run as required on large-scale distributed computing infrastructure, is frequently not straightforward even for expert users. Anticipating the runtime behavior of these systems and infrastructures is particularly difficult as it depends on several factors, as there are usually many options and large parameter spaces to configure, and as computing environments and workloads also often change dynamically over time.

As a result, efficiently running data-intensive applications is hard, especially when given targets for an application’s performance and dependability. In fact, we argue that – while high-level programming abstractions and comprehensive processing frameworks have made it easier to develop data-intensive applications – efficiently operating data-intensive systems and applications has become more difficult over the last decade. And there is abundant evidence of low resource utilization, limited energy-efficiency, and severe failures with applications deployed in practice that back up this claim. Therefore, as an increasing number of data-intensive applications is developed and deployed in businesses, in the sciences, and by municipalities and governments, it is from both an economical and an ecological point of view absolutely critical that the computing infrastructures are used efficiently.

Approach and Methods

The main objective of our work is supporting organizations and users in using distributed computing infrastructures and scalable fault-tolerant systems efficiently for their data-intensive applications. Towards this goal, we develop methods, systems, and tools that make the implementation, testing, and operation of efficient and dependable data-intensive applications easier. Ultimately, we aim to realize systems that adapt to current workloads, dynamically changing distributed computing environments, and given performance and dependability requirements fully automatically.

More specifically, we work on adaptive resource management in distributed computing environments from small IoT devices to large clusters of virtual resources, following these three approaches:

Central methods for realizing an adaptive resource management according to high-level user-defined objectives and constraints are techniques that enable an accurate modeling of the performance, dependability, and efficiency of distributed data-intensive applications. Additionally, we investigate methods for efficient monitoring, profiling, and experimentation.

Topics of Interest

In our work, we make use of and improve methods in the areas of Distributed Systems, Operating Systems, and Information Systems and investigate them in context of current topics such as big data analytics, cloud and fog computing, and the Internet of Things.

Research Methodology

We mostly do empirical systems research. Therefore, we evaluate new ideas by implementing them prototypically in context of relevant open-source systems (such as Flink, Kubernetes, and FreeRTOS) and by conducting experiments on actual hardware, with exemplary applications, and real-world data. For this, we have access to state-of-the-art infrastructures, including commodity clusters, a GPU cluster, an HPC cluster, private and public clouds, as well as IoT devices and sensors.

We believe iterative research processes and short feedback cycles are important for research. We are, consequently, implementing a multi-staged approach to research: first presenting new ideas in focused workshops and work-in-progress tracks of conferences, then submitting mature results to main research tracks of international conferences, before summarizing findings comprehensively in journal articles. At the same time, we are also convinced that it is critical to be involved in applied and interdisciplinary joint projects, for a more direct experience of relevant problems and to uncover opportunities for well motivated, impactful research.

Believing in the value of scientific discourse and feedback, we seek frequent exchange with other research groups and take on an active role in the international scientific community with academic service. Moreover, we make results available to the public as far and as soon as possible with openly accessible publications, open-source software prototypes, and research-based university teaching.