Cost-effective Distributed Training with Auto ML and AWS Spot Instances

Dr. Adrian Keller

Professor, Department of Cloud Computing and Distributed Systems, Technical University of Munich, Germany

DOI: 10.63665/ijmlaidse-y1f1a004

View / Download Full Article (PDF)

Abstract

Most companies cannot afford or obtain training in cloud machine learning. This paper illustrates how one can use AWS Spot Instances in combination with Auto ML methods and distributed training in a cost-effective and scalable way. Using checkpointing, smart orchestration strategies, and the fact that Spot Instances can be stopped and started at will, one can train models for a fraction of the cost without sacrificing much performance. The design of our system is inherently sound: It works seamlessly with several Auto ML libraries designed for various tasks, while it also supports instance interruptions. Our tests reveal that big data performance saves up to 80% compared to On-Demand instances and does not seem to affect the quality of the training. The current paper thus constitutes a useful guide in automating cloud-based machine learning workflows that are able to scale while economizing.

Keywords

Distributed Training, Auto ML, AWS Spot Instances, Cloud Computing, Cost Optimization, Fault Tolerance, Deep Learning, SageMaker, Checkpointing, Cloud AI Infrastructure

References

[1] Golovin, G., Solovyov, A., & Krishnan, S. (2019). Google AutoML: Efficient and Scalable Machine Learning Automation. Google Research Blog.

[2] Amazon Web Services. (2023). Amazon EC2 Spot Instances. AWS Documentation.

[3] Li, S., Wang, Z., et al. (2018). Horovod: Fast and Easy Distributed Deep Learning in TensorFlow. SysML Conference.

[4] Chen, J., et al. (2020). Distributed Training of Deep Learning Models: A Survey. Journal of Parallel and Distributed Computing.

[5] Riquelme, M., et al. (2018). Deep Reinforcement Learning for Cloud Resource Management. ICML Workshop on ML Systems.

[6] Amazon Web Services. (2023). SageMaker Autopilot – Fully Managed AutoML. AWS Documentation.

[7] Jia, D., et al. (2020). Cost-efficient Deep Learning Training in the Cloud with Spot Instances. IEEE International Conference on Cloud Computing.

[8] Yu, L., Wang, Y., et al. (2022). Checkpointing and Fault Tolerance in Distributed Machine Learning. ACM Computing Surveys.

[9] Gupta, A., et al. (2021). Leveraging AWS Spot Instances for Cost-Effective Distributed Machine Learning. AWS re:Invent.

[10] Amazon Web Services. (2023). Amazon Elastic File System (EFS). AWS Documentation.

[11] Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images. University of Toronto Technical Report.

[12] Fang, H., et al. (2021). AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. KDD Conference.

[13] Zaharia, M., et al. (2016). Apache Spark: A Unified Engine for Big Data Processing. Communications of the ACM.

[14] Amazon Web Services. (2023). AWS Cost Explorer API. AWS Documentation.

[15] He, K., et al. (2016). Deep Residual Learning for Image Recognition. CVPR.