A Comparative Study of Data Modelling Strategies for Hybrid Cloud Analytics Platforms

Vempalli Mopuru Rakesh Reddy
Systems Engineer, Tata Consultancy Services

View / Download Full Article (PDF)

Abstract

Hybrid cloud analytics platforms are very useful for businesses that want scalability combined with data sovereignty. However, due to the spread nature of these environments, modelling data is more challenging, which can substantially affect performance, governance, and the outcomes of analyses. This paper presents an in-depth comparative analysis of data modelling strategies that are appropriate for hybrid cloud analytics platforms. We discuss conventional relational models, NoSQL paradigms, Data Vault modelling, and schema-on-read methodologies from various perspectives: performance, scalability, adaptability, and cost-efficiency. We illustrate what does and does not work for each strategy by using real-life examples and benchmarks. Our findings are intended to guide architects and data engineers in choosing the best data modelling patterns for hybrid cloud infrastructures.

Keywords

Hybrid Cloud, Data Modelling, Analytics Platforms, Schema-on-Read, Data Vault, NoSQL, Polyglot Persistence, Cloud Data Warehousing, Data Federation, Performance Benchmarking.

References

[1] Inmon, W. H. (2005). Building the data warehouse (4th ed.). Wiley.

[2] Kimball, R., & Ross, M. (2013). The data warehouse toolkit: The definitive guide to dimensional modeling (3rd ed.). Wiley.

[3] Linstedt, D., & Olschimke, M. (2015). Building a scalable data warehouse with Data Vault 2.0. Morgan Kaufmann.

[4] Golfarelli, M., & Rizzi, S. (2009). Data warehouse design: Modern principles and methodologies. McGraw-Hill.

[5] Stonebraker, M. (2010). SQL databases v. NoSQL databases. Communications of the ACM, 53(4), 10–11.

[6] Han, J., Haihong, E., Le, G., & Du, J. (2011). Survey on NoSQL database. Proceedings of the 6th International Conference on Pervasive Computing and Applications, 363–366.

[7] Abadi, D. J. (2012). Consistency tradeoffs in modern distributed database system design. Computer, 45(2), 37–42.

[8] Sadalage, P. J., & Fowler, M. (2012). NoSQL distilled: A brief guide to the emerging world of polyglot persistence. Addison-Wesley.

[9] Armbrust, M., Fox, A., Griffith, R., et al. (2010). A view of cloud computing. Communications of the ACM, 53(4), 50–58.

[10] Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of big data on cloud computing: Review and open research issues. Information Systems, 47, 98–115.

[11] Marz, N., & Warren, J. (2015). Big data: Principles and best practices of scalable real-time data systems. Manning Publications.

[12] Zaharia, M., Das, T., Li, H., et al. (2012). Discretized streams: Fault-tolerant streaming computation at scale. Proceedings of the 24th ACM Symposium on Operating Systems Principles, 423–438.

[13] Grolinger, K., Higashino, W. A., Tiwari, A., & Capretz, M. A. M. (2013). Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing, 2(1), 1–24.

[14] Jagadish, H. V., Gehrke, J., Labrinidis, A., et al. (2014). Big data and its technical challenges. Communications of the ACM, 57(7), 86–94.

[15] Vassiliadis, P., & Sellis, T. (2014). A survey of logical models for OLAP databases. ACM Computing Surveys, 42(3), 1–38.

[16] Reinsel, D., Gantz, J., & Rydning, J. (2018). The digitization of the world: From edge to core. IDC White Paper.

[17] Quix, C., Hai, R., & Vatov, I. (2016). Metadata management for big data systems. Proceedings of the 2016 IEEE International Conference on Big Data, 3586–3595.

[18] Lenzerini, M. (2002). Data integration: A theoretical perspective. Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 233–246.

[19] Elmore, A. J., Das, S., Agrawal, D., & El Abbadi, A. (2015). Zephyr: Live migration in shared nothing databases for elastic cloud platforms. Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, 301–312.

[20] Gartner. (2021). Hybrid cloud and multi-cloud data management trends. Gartner Research Report.