Frugal computing

Companies care about cheap computing.

Well, the first thing they care about is usability: Is it easy for average programmers to develop solutions using the framework?

And the second thing they care about is cheap computing: Does this break the bank? Can I do this cheaper?

Speed, efficiency, elegance, technical strength... These are often not of interest. People are OK with their analytic jobs return results hours later. We watched aghast, at the beginning of Hadoop epidemic, people use MapReduce for doing analytics in a wasteful and slow manner.

I was recently thinking about this question: How can we trade-off speed with monetary cost of computing? If the constraints are that the user will not need the results before a couple hours but it would be nice to get the results in a day, what is the cheapest way to get this analytics job done in the cloud?

While distributed systems may be the answer for a lot of questions (such as providing fault-tolerance, low-latency access for geo-distributed deployment, scalability, etc.), it is not very advantageous for cheap computing. This is because distributed processing often comes with a large overhead for state synchronization, which takes a long time to get compensated, as the "Scalability at what COST paper" showed. With frugal computing, we should try to avoid the cost of state synchronization as much as possible. So work should be done on one machine if it is cheaper to do so and the generous time budget is not exceeded. Other machines should be involved when it is cheaper to involve them, and when there is not a lot of state to synchronize.

Primitives for frugal computing

Memory is expensive but storage via local disk is not. And time is not pressing. So we can consider out-of-core execution, juggling between memory and disk.

Communication costs money. So batching communication and trading off computation with communication (when possible) would be useful for frugal computing. If something is computationally heavy, we can have lookup tables stored in disk or S3 (if still feasible monetarily).

We may then need schemes for data-naming (which may be more sophisticated then simple key), so that a node can locate the result it needs in S3 instead of computing itself. This can allow nodes to collaborate with other nodes in an asynchronous, offline, or delay-tolerant way. For this, maybe the Linda tuplespaces idea could be a good fit. We can have an await based synchronization. This may even allow a node or process to collaborate with itself. The node may might switch to other threads when it is stuck waiting for a data item, and then come pick up that thread when the awaited thing becomes ready through work of the other threads and continue execution from there.

In frugal computing, we cannot afford to allocate extra resources for fault-tolerance, and we need to do in a way commensurate with the risk of fault and the cost of restarting computation from scratch. Snapshots that are saved for offline collaboration may be useful for building frugal fault-tolerance. Also self-stabilizing approach can also be useful, because it can provide forward error correction instead of a costly roll-back error correction.

This is all raw stuff, and in the abstract. I wonder if it would be possible to start with an opensource data processing framework (such as Spark), and customize it to prioritize frugality over speed. How much work would that entail?

Comments

Mark Callaghan said…
Systems and algorithms optimized for efficiency rather than response time is an interesting topic. Concurrency in general isn't frugal.

More things that aren't frugal:
* Spin locks and mutexes that do a try-spin-wait loop, see https://bugs.mysql.com/bug.php?id=52806
* Read, write and space amplification from using the wrong index structure for a workload - http://smalldatum.blogspot.com/2019/05/crum-conjecture-read-write-space-and.html

Popular posts from this blog

Hints for Distributed Systems Design

Learning about distributed systems: where to start?

Making database systems usable

Looming Liability Machines (LLMs)

Foundational distributed systems papers

Advice to the young

Linearizability: A Correctness Condition for Concurrent Objects

Understanding the Performance Implications of Storage-Disaggregated Databases

Scalable OLTP in the Cloud: What’s the BIG DEAL?

Designing Data Intensive Applications (DDIA) Book