Data Driven High Performance Data Access

Date

2020-03-05

Authors

Ramljak, Dusan

Journal Title

Journal ISSN

Volume Title

Publisher

WTAMU Cornette Library

Abstract

Low-latency, high throughput mechanisms to retrieve data become increasingly crucial as the cyber and cyber-physical systems pour out increasing amounts of data that often must be analyzed in an online manner. Generally, as the data volume increases, the marginal utility of an average'' data item tends to decline, which requires greater effort in identifying the most valuable data items and making them available with minimal overhead. We believe that data analytics-driven mechanisms have a big role to play in solving this needle-in-the-haystack problem. We rely on the claim that efficient pattern discovery and description, coupled with the observed predictability of complex patterns within many applications offers significant potential to enable many I/O optimizations. Our research covers exploitation of storage hierarchy for data driven caching and tiering, reduction of distance between data and computations, removing redundancy in data, using sparse representations of data, the impact of data access mechanisms on resilience, energy consumption, storage usage, and the enablement of new classes of data driven applications. For caching and prefetching, we offer a powerful model that separates the process of access prediction from the data retrieval mechanism. Predictions are made on a data entity basis and used the notions of context'' and its aspects such as ``belief'' to uncover and leverage future data needs. This approach allows truly opportunistic utilization of predictive information. We elaborate on which aspects of the context we are using in areas other than caching and prefetching different situations and why it is appropriate in the specified situation. We present in more detail the methods we have developed, BeliefCache for data driven caching and prefetching and AVSC for pattern mining based compression of data. In BeliefCache, using a belief, an aspect of context representing an estimate of the probability that the storage element will be needed, we developed modular framework BeliefCache, to make unified informed decisions about that element or a group. For the workloads, we examined we were able to capture complex non-sequential access patterns better than a state-of-the-art framework for optimizing cloud storage gateways. Moreover, our framework is also able to adjust to variations in the workload faster. It also does not require a static workload to be effective since modular framework allows for discovering and adapting to the changes in the workload. In AVSC, using an aspect of context to gauge the similarity of the events, we perform our compression by keeping relevant events intact and approximating other events. We do that in two stages. We first generate a summarization of the data, then approximately match the remaining events with the existing patterns if possible, or add the patterns to the summary otherwise. We show gains over the plain lossless compression for a specified amount of accuracy for purposes of identifying the state of the system and a clear tradeoff between the compressibility and fidelity. In other mentioned research areas we present challenges and opportunities with the hope that will spur researchers to further examine those issues in the space of rapidly emerging data intensive applications. We also discuss the ideas on how our research in other domains could be applied in our attempts to provide high performance data access.

Description

I have started research in intelligent, data driven, management of computer storage as soon as I have finished my undergraduate studies. The presentation provides ideas of how to approach high performance data access as a result of more than 20 years of practical and research experience. Parts of the presentation were shown at 15th USENIX Conference on File and Storage Technologies (FAST), Work in progress (WiP) session, Santa Clara, Feb 27 - Mar 2, 2017, The HotStorage '17, WACI session, Santa Clara, July 10-11, 2017, ICDCN Workshop on Smart and Connected Communities: Technological Foundations, Challenges and Opportunities (SCC-2018) , Varanasi, India, January 4-7, 2018, International Conference on Edge Computing Seattle, USA, June 25-30, 2018, at internal meetings at companies like Dell, HPE, Huawei, Salesforce, at Industry–University Cooperative Research Centers Program (IUCRC) meetings and at my dissertation defense.

Keywords

Citation

Ramljak, Dusan (2019) "Data Driven High Performance Data Access", Temple University, Ann Arbor, ProQuest, https://search.proquest.com/docview/2171051321 Ramljak, D., Tom, D.A., Voigt, D., Kant, K. (2018) "Modular framework for Data Prefetching and Replacement in Storage Systems," International Conference on Edge Computing Seattle, USA, June 25-30, 2018 Ramljak, D., Pal, A., Kant, K. (2018) "Pattern Mining Based Compression of IoT Data," In ICDCN Workshop on Smart and Connected Communities: Technological Foundations, Challenges and Opportunities (SCC-2018) , Varanasi, India, January 4-7, 2018 Ramljak, D., Kant, K. (2017) "Belief-Based Storage Systems," In The HotStorage '17, WACI session, Santa Clara, July 10-11, 2017 Ramljak, D., Alazzawe, A., Uversky, A., Kant, K. (2017) "Belief-Based Data Prefetching and Replacement in Storage Systems," In The 15th USENIX Conference on File and Storage Technologies (FAST), Work in progress (WiP) session, Santa Clara, Feb 27 - Mar 2, 2017

Permalink for this item. Use this when sharing or citing this source.