Using Reinforcement Learning in AWS EC2 for Dynamic Resource Allocation

Kayode Soladeji
Ladoke Akintola University of Technology, ksoginni@student.lautech.edu.ng

View / Download Full Article (PDF)

Abstract

For the best performance at the best cost for cloud computing services like Amazon EC2, one has to dynamically allocate resources. Some of the traditional techniques for scaling depend on static thresholds or predictive models that cannot be updated according to changing workloads over time. We will show how one would use reinforcement learning to do real-time allocations of AWS EC2 resources. An RL agent uses an MDP for transforming the scaling problem of how to share resources into learning how to use system metrics and patterns in the workload for identifying the optimal instance type and scaling actions. We use the AWS SDK and CloudWatch metrics in implementing our system. It is able to be tried in both the lab and real-world setting. Tests have shown that our reinforcement learning agent can reduce costs without badly affecting service-level goals. Few people do this, but this approach offers superior automated scaling. Perhaps the method could do scaling of the cloud infrastructure independently.

Keywords

Reinforcement Learning (RL), Cloud Computing, AWS EC2, Dynamic Resource Allocation, Auto Scaling, Markov Decision Process (MDP), Cost Optimization, Cloud Infrastructure Management, Deep Q-Learning, Performance-Aware Scheduling.

References

[1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.

[2] Mao, H., Alizadeh, M., Menache, I., & Kandula, S. (2016). Resource Management with Deep Reinforcement Learning. Proceedings of the 15th ACM Workshop on Hot Topics in Networks (HotNets-XV).

[3] Xu, Y., Rao, L., & Liu, X. (2012). On balancing energy consumption and end-to-end delay in cloud-based data centers. IEEE Transactions on Parallel and Distributed Systems, 24(6), 1234–1244.

[4] Amazon Web Services. (2024). Amazon EC2 Auto Scaling Documentation. Retrieved from: https://docs.aws.amazon.com/autoscaling/

[5] Google Cloud Platform. (2011). Google Cluster Data Trace. Retrieved from: https://github.com/google/cluster-data

[6] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster Computing with Working Sets. USENIX Conference on Hot Topics in Cloud Computing.

[7] Chen, X., Ren, S., & Cheng, X. (2018). Performance-aware Virtual Machine Placement in Clouds: A Reinforcement Learning Approach. IEEE Transactions on Cloud Computing.

[8] OpenAI. (2020). Gym: A Toolkit for Developing and Comparing Reinforcement Learning Algorithms. Retrieved from: https://github.com/openai/gym

[9] Liang, H., Lakshmanan, V., & Bhat, V. (2021). Autoscaling Kubernetes Applications Using Reinforcement Learning. ACM Symposium on Cloud Computing (SoCC).

[10] Stable Baselines3. (2022). Reliable implementations of reinforcement learning algorithms in PyTorch. https://github.com/DLR-RM/stable-baselines3

[11] He, Y., Guo, Y., Li, D., & Xu, C. (2020). Learning-based Auto-scaling for Web Applications in Clouds. Future Generation Computer Systems, 107, 501–512.

[12] Jiang, J., Lan, T., Ha, S., & Chiang, M. (2012). Joint VM Placement and Routing for Data Center Traffic Engineering. IEEE INFOCOM.

[13] AWS SDK for Python (boto3). (2024). Documentation and Code Samples. https://boto3.amazonaws.com/