Optimizing Distributed Systems for Scalable Machine Learning Workflows using AI-Driven Software Engineering

Stephen Eteng

Authors

Stephen Eteng¹
¹University of Ibadan

Abstract

The rapid growth in machine learning applications is demanding the development of distributed systems to handle large data, complex models, and high computation intensities. Among the most important challenging aspects in developing distributed systems for machine learning that have to be scaled up are data storage management, resource allocation, load balancing, and ensuring system performance. This work investigates the use of AI in software engineering for the efficient handling of machine learning by distributed systems. We present how AI techniques like reinforcement learning can help in the management of resources, bug detection, and division of tasks. We also present how AI may further enhance two well-known frameworks for distributed computing, Apache Spark, and Kubernetes, for machine learning tasks. We present what the future of AI in distributed systems will look like and demonstrate the usage of AI-driven optimizations in real-world settings through examples and case studies. Results indicate that AI-driven software engineering is imperative for meeting the performance and scalability demands of state-of-the-art distributed systems employing machine learning.

Keywords

AI-driven optimization Distributed systems Scalable machine learning Resource allocation Load balancing Machine learning workflows Reinforcement learning Cloud infrastructure Distributed computing frameworks Apache Spark

How to Cite This Article

Eteng, S. (2026). Optimizing distributed systems for scalable machine learning workflows using AI-driven software engineering. International Journal of Engineering & Tech Development, 1(3), 18-26.

References

[1] Abadi, M., Barham, P., Chen, J., et al. TensorFlow: Large Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint (2016).

[2] Abadi, M., Barham, P., Chen, J., et al. TensorFlow: A system for large scale machine learning. Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), Savannah, GA, pp. 265–283 (2016).

[3] Verbraeken, J., Wolting, M., Katzy, J., Kloppenburg, J., Verbelen, T., Rellermeyer, J. S. A Survey on Distributed Machine Learning. arXiv preprint (2019).

[4] Lo, S. K., Lu, Q., Wang, C., Paik, H.-Y., Zhu, L. A Systematic Literature Review on Federated Machine Learning: From A Software Engineering Perspective. arXiv preprint (2020).

[5] Crishtoper, A. Optimizing Big Data Processing Using AI Driven Distributed Computing Architectures for Enhanced Scalability and Performance. International Journal of Computer Science and Engineering Research and Development (IJCSERD) (2022).

[6] Rachakatla, S. K., Ravichandran, P., Machireddy, J. R. Scalable Machine Learning Workflows in Data Warehousing: Automating Model Training and Deployment with AI. Australian Journal of Machine Learning Research & Applications (2022).

[7] Tryfou, G. Orchestrating AI: Event Driven Architectures for Complex AI Workflows. The New Stack (2024).

[8] GeeksforGeeks. Role of AI in Distributed Systems. GeeksforGeeks article (no date).

[9] Bu, L., Liang, Y., Xie, Z., Qian, H., Hu, Y.-Q., Yu, Y., Chen, X., Li, X. Machine learning steered symbolic execution framework for complex software code. Formal Aspects of Computing, 2021.

[10] Wang, S., Liu, T., Nam, J., Tan, L. Deep semantic feature learning for software defect prediction. IEEE Transactions on Software Engineering, 2018.

[11] Bowes, D., Hall, T., Petrić, J. Software defect prediction: Do different classifiers find the same defects? Software Quality Journal, 2018.

[12] Almodovar, C., Sabrina, F., Karimi, S., Azad, S. LogFiT: Log anomaly detection using fine tuned language models. IEEE Transactions on Network and Service Management, 2024.