| Peer-Reviewed

Survey the Storage Systems Used in HPC and BDA Ecosystems

Received: 25 March 2022    Accepted: 28 April 2022    Published: 19 May 2022
Views:       Downloads:
Abstract

The advancement in HPC and BDA ecosystem demands a better understanding of the storage systems to plan effective solutions. The amount of data being generated from the ever-growing devices over years have increased tremendously. To make applications access data more efficiently for computation, HPC and BDA ecosystems adopt different storage systems. Each storage system has its pros and cons. Therefore, it is worthwhile and interesting to explore the storage systems used in HPC and BDA respectively. Also, it’s inquisitive to understand how such storage systems can handle data consistency and fault tolerance at a massive scale. In this paper, we’re surveying four storage systems: Lustre, Ceph, HDFS, and CockroachDB. Lustre and HDFS are some of the most prominent file systems in HPC and BDA ecosystem. Ceph is an upcoming filesystem and is being used by supercomputers. CockroachDB is based on NewSQL systems a technique that is being used in the industry for BDA applications. The study helps us to understand the underlying architecture of these storage systems and the building blocks used to create them. The protocols and mechanisms used for data storage, data access, data consistency, fault tolerance, and recovery from failover are also overviewed. The comparative study will help system designers to understand the key features and architectural goals of these storage systems to select better storage system solutions.

Published in Internet of Things and Cloud Computing (Volume 10, Issue 1)
DOI 10.11648/j.iotcc.20221001.12
Page(s) 12-28
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

HPC, BDA, Storage Systems, CockroachDB, HDFS, Ceph, Lustre

References
[1] P. Matri. Storage-Based HPC and Big Data Convergence Using Transactional Blobs. PhD Thesis: Programa de Doctorado de Inteligencia Artificial Escuela T ́ecnica Superior de Ingenieros Inform ́aticos., 2018.
[2] S. Caíno-Lores, J. Carretero, B. Nicolae, O. Yildiz and T. Peterka. Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY. In IEEE Access, vol. 7, pp. 156929-156955, 2019.
[3] David Reinsel, John Gantz, and John Rydning. The Digitization of the World From Edge to Core. White Paper of IDC, 2018.
[4] The Scientific Case for HPC in Europe. Insight publishers Bristol, 2012.
[5] ITRS: International technology roadmap for semiconductors - 2.0. Tech. rep., 2015.
[6] Top500: Top500 Supercomputer Sites. http://www.top500.org/ (2017), accessed: 2018-03-01.
[7] Kuhn, M., Kunkel, J., and Ludwig T. Data Compression for Climate Data. Supercomputing Frontiers and Innovations, 2016.
[8] McKee. Reflections on the memory wall. In Proceedings of the First Conference on Computing Frontiers, 2004.
[9] Khan, S., Shakil, K. A., and Alam, M. Educational intelligence: applying cloud-based big data analytics to the Indian education sector. In 2016 2nd international conference on contemporary computing and informatics (IC3I) (pp. 29-34). IEEE, 2016.
[10] Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 2015.
[11] Chen, C. P. and Zhang, C. Y. Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information sciences, 2014.
[12] George, L. HBase: the definitive guide: random access to your planet-size data. O'Reilly Media, Inc., 2014.
[13] Shvachko, K., Kuang, H., Radia, S., & Chansler, R. The hadoop distributed file system. In MSST, 2010.
[14] Chodorow, K. MongoDB: the definitive guide: powerful and scalable data storage. O'Reilly Media, Inc., 2013.
[15] Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., and Joshi, I. Impala: A Modern, Open-Source SQL Engine for Hadoop. In Cidr, 2015.
[16] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I.. Spark: Cluster computing with working sets. HotCloud, 2010.
[17] Ihaka, R., & Gentleman, R. R: a language for data analysis and graphics. Journal of computational and graphical statistics, 1996.
[18] Oliphant, T. E. Python for scientific computing. Computing in Science & Engineering, 2007.
[19] Gu, M., Li, X., & Cao, Y. Optical storage arrays: a perspective for future big data storage. Light: Science & Applications, 2014.
[20] Strauch, C., Sites, U. L. S., & Kriha, W. NoSQL databases. Lecture Notes, Stuttgart Media University, 2011.
[21] Lustre Software Release 2. x Operations Manual, https://doc.lustre.org/lustre_manual.pdf, 2011.
[22] Oh, M., Eom, J., Yoon, J., Yun, J. Y., Kim, S., Yeom, H. Y. Performance optimization for all flash scale-out storage. In: 2016 IEEE International Conference on Cluster Computing, CLUSTER 2016.
[23] Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation, 2018.
[24] Weil, S. A., Leung, A. W., Brandt, S. A., Maltzahn, C. RADOS: a scalable, reliable storage service for petabyte-scale storage clusters. In: Proceedings of the 2nd International Petascale Data Storage Workshop, 2007.
[25] Ceph Documentation, https://docs.ceph.com/en/pacific/architecture/, accessed: 11/25/2021.
[26] IO 500, https://io500.org/, SC21 List, last accessed: 11/25/2021.
[27] Comparison of distributed file systems, Wikipedia, Last updated: 11/2/2021.
[28] OpenSFS, https://www.opensfs.org/wp-content/uploads/2020/04/Lustre_IO500_v2.pdf, DOI: 04/22/2020.
[29] Dubeyko, Viacheslav. Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques, 2019.
[30] Ceph.io Case Studies, https://ceph.io/en/discover/case-studies/, accessed: 11/25/2021.
[31] Amazon FSx for Lustre Case Studies, https://aws.amazon.com/fsx/lustre/, accessed: 11/25/2021.
[32] CockroachDB Documentation, https://www.cockroachlabs.com/docs/, accessed: 11/26/2021.
[33] Hadoop Documentation 1.2.1, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Introduction, accessed: 11/26/2021.
[34] Lofstead, J. F., Jimenez, I., Maltzahn, C., Koziol, Q., Bent, J., Barton, E. DAOS and friends: a proposal for an exascale storage system. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016.
[35] Tang, H., Byna, S., Dong, B., Liu, J., Koziol, Q. Someta: Scalable object-centric metadata management for high performance computing. In: 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017.
[36] Ferrer, E. C. The blockchain: a new framework for robotic swarm systems. In Proceedings of the Future Technologies Conference (pp. 1037-1058). Springer, 2018.
[37] Dang, H., Dinh, T. T. A., Loghin, D., Chang, E. C., Lin, Q., and Ooi, B. C. Towards Scaling Blockchain Systems via Sharding. arXiv preprint, 2018.
[38] Khan, Samiya, Xiufeng Liu, Syed Arshad Ali, and Mansaf Alam. Storage solutions for big data systems: A qualitative study and comparison. arXiv preprint, 2019.
[39] DB-engines, https://db-engines.com/en/ranking, last accessed: 12/2/2021.
Cite This Article
  • APA Style

    Priyam Shah, Jie Ye, Xian-He Sun. (2022). Survey the Storage Systems Used in HPC and BDA Ecosystems. Internet of Things and Cloud Computing, 10(1), 12-28. https://doi.org/10.11648/j.iotcc.20221001.12

    Copy | Download

    ACS Style

    Priyam Shah; Jie Ye; Xian-He Sun. Survey the Storage Systems Used in HPC and BDA Ecosystems. Internet Things Cloud Comput. 2022, 10(1), 12-28. doi: 10.11648/j.iotcc.20221001.12

    Copy | Download

    AMA Style

    Priyam Shah, Jie Ye, Xian-He Sun. Survey the Storage Systems Used in HPC and BDA Ecosystems. Internet Things Cloud Comput. 2022;10(1):12-28. doi: 10.11648/j.iotcc.20221001.12

    Copy | Download

  • @article{10.11648/j.iotcc.20221001.12,
      author = {Priyam Shah and Jie Ye and Xian-He Sun},
      title = {Survey the Storage Systems Used in HPC and BDA Ecosystems},
      journal = {Internet of Things and Cloud Computing},
      volume = {10},
      number = {1},
      pages = {12-28},
      doi = {10.11648/j.iotcc.20221001.12},
      url = {https://doi.org/10.11648/j.iotcc.20221001.12},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.iotcc.20221001.12},
      abstract = {The advancement in HPC and BDA ecosystem demands a better understanding of the storage systems to plan effective solutions. The amount of data being generated from the ever-growing devices over years have increased tremendously. To make applications access data more efficiently for computation, HPC and BDA ecosystems adopt different storage systems. Each storage system has its pros and cons. Therefore, it is worthwhile and interesting to explore the storage systems used in HPC and BDA respectively. Also, it’s inquisitive to understand how such storage systems can handle data consistency and fault tolerance at a massive scale. In this paper, we’re surveying four storage systems: Lustre, Ceph, HDFS, and CockroachDB. Lustre and HDFS are some of the most prominent file systems in HPC and BDA ecosystem. Ceph is an upcoming filesystem and is being used by supercomputers. CockroachDB is based on NewSQL systems a technique that is being used in the industry for BDA applications. The study helps us to understand the underlying architecture of these storage systems and the building blocks used to create them. The protocols and mechanisms used for data storage, data access, data consistency, fault tolerance, and recovery from failover are also overviewed. The comparative study will help system designers to understand the key features and architectural goals of these storage systems to select better storage system solutions.},
     year = {2022}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Survey the Storage Systems Used in HPC and BDA Ecosystems
    AU  - Priyam Shah
    AU  - Jie Ye
    AU  - Xian-He Sun
    Y1  - 2022/05/19
    PY  - 2022
    N1  - https://doi.org/10.11648/j.iotcc.20221001.12
    DO  - 10.11648/j.iotcc.20221001.12
    T2  - Internet of Things and Cloud Computing
    JF  - Internet of Things and Cloud Computing
    JO  - Internet of Things and Cloud Computing
    SP  - 12
    EP  - 28
    PB  - Science Publishing Group
    SN  - 2376-7731
    UR  - https://doi.org/10.11648/j.iotcc.20221001.12
    AB  - The advancement in HPC and BDA ecosystem demands a better understanding of the storage systems to plan effective solutions. The amount of data being generated from the ever-growing devices over years have increased tremendously. To make applications access data more efficiently for computation, HPC and BDA ecosystems adopt different storage systems. Each storage system has its pros and cons. Therefore, it is worthwhile and interesting to explore the storage systems used in HPC and BDA respectively. Also, it’s inquisitive to understand how such storage systems can handle data consistency and fault tolerance at a massive scale. In this paper, we’re surveying four storage systems: Lustre, Ceph, HDFS, and CockroachDB. Lustre and HDFS are some of the most prominent file systems in HPC and BDA ecosystem. Ceph is an upcoming filesystem and is being used by supercomputers. CockroachDB is based on NewSQL systems a technique that is being used in the industry for BDA applications. The study helps us to understand the underlying architecture of these storage systems and the building blocks used to create them. The protocols and mechanisms used for data storage, data access, data consistency, fault tolerance, and recovery from failover are also overviewed. The comparative study will help system designers to understand the key features and architectural goals of these storage systems to select better storage system solutions.
    VL  - 10
    IS  - 1
    ER  - 

    Copy | Download

Author Information
  • Computer Department, Illinois Institute of Technology, Chicago, USA

  • Computer Department, Illinois Institute of Technology, Chicago, USA

  • Computer Department, Illinois Institute of Technology, Chicago, USA

  • Sections