Last month, the SNIA Cloud Storage Technologies Initiative was fortunate to have artificial intelligence (AI) expert, Parviz Peiravi, explore the topic of AI Operations (AIOps) at our live webcast, “IT Modernization with AIOps: The Journey.” Parviz explained why the journey to cloud native and microservices, and the complexity that comes along with that, requires a rethinking of enterprise architecture. If you missed the live presentation, it’s now available on demand together with the webcast slides.
We had some interesting questions from our live audience. As promised, here are answers to them all:
Q. Can you please define the Data Lake and how different it is from other data storage models?
A. A data lake is another form of data repository with specific capability that allows data ingestion from different sources with different data types (structured, unstructured and semi-structured), data as is and not transformed. The data transformation process Extract, Load, Transform (ELT) follow schema on read vs. schema on write Extract, Transform and Load (ETL) that has been used in traditional database management systems. See the definition of data lake in the SNIA Dictionary here.
In 2005 Roger Mougalas coined the term Big Data, it refers to large volume high velocity data generated by the Internet and billions of connected intelligent devices that was impossible to store, manage, process and analyze by traditional database management and business intelligent systems. The need for a high-performance data management systems and advanced analytics that can deal with a new generation of applications such as Internet of things (IoT), real-time applications and, streaming apps led to development of data lake technologies. Initially, the term “data lake” was referring to Hadoop Framework and its distributed computing and file system that bring storage and compute together and allow faster data ingestion, processing and analysis. In today’s environment, “data lake” could refer to both physical and logical forms: a logical data lake could include Hadoop, data warehouse (SQL/No-SQL) and object-based storage, for instance.
Q. One of the aspects of replacing and enhancing a brownfield environment is that there are different teams in the midst of different budget cycles. This makes greenfield very appealing. On the other hand, greenfield requires a massive capital outlay. How do you see the percentages of either scenario working out in the short term?
A. I do not have an exact percentage, but the majority of enterprises using a brownfield implementation strategy have been in place for a long time. In order to develop and deliver new capabilities with velocity, greenfield approaches are gaining significant traction. Most of the new application development based on microservices/cloud native is being implemented in greenfield to reduce the risk and cost using cloud resources available today in smaller scale at first and adding more resources later.
Q. There is a heavy reliance upon mainframes in banking environments. There’s quite a bit of error that has been eliminated through decades of best practices. How do we ensure that we don’t build in error because these models are so new?
A. The compelling reasons behind mainframe migration – beside the cost – is ability to develop and deliver new application capabilities, business services and making data available to all other applications.
There are four methods for mainframe migration:
- Data migration only
Each approach provides enterprises different degrees of risk and freedom. Applying best practices to both application design/development and operational management, is the best way to ensure smooth application migration from a monolith to a new distributed environment such as microservices/cloud native. Data architecture plays a pivotal role in the design process in addition to applying Continuous Integration and Continuous Delivery (CI/CD) process.
Q. With the changes into a monolithic data lake, will we be seeing different data lakes with different security parameters, which just means that each lake is simply another data repository?
A. If we follow a domain-driven design principal, you could have multiple data lakes with specific governance and security policies appropriate to that domain. Multiple data lakes could be accessed through data virtualization to mimic a monolithic data lake; this approach is based on a logical data lake architecture.
Q. What’s the difference between multiple data lakes and multiple data repositories? Isn’t it just a matter of quantity?
A. Looking from Big Data perspective, a data lake is not only stored data but also provides capabilities to process and analyze data (e.g. Hadoop framework/HDFS). New trends are emerging that separate storage and compute (e.g., disaggregated storage architectures) hence some vendors use the term “data lake” loosely and offer only storage capability, while others provide both storage and data processing capabilities as an integrated solution. What is more important than the definition of data lake is your usage and specific application requirements to determine which solution is a good fit for your environment.