Physical or virtualized data marts – what’s right for your organization?
Data marts and data warehouses have served the business users requirements for a long time. But in today’s agile environment, organizations cannot wait for weeks or even months for the data to be available if the companies follow a Software Development Life Cycle (SDLC) to create data marts.
Why were physical Data Marts created?
An organization typically created a physical data mart to meet one or more of the following requirements:
- Blending of data: Data coming from varied sources which may be a mix of relational and non-relational data. Most of the Business Intelligence (BI) tools allow blending of data, but the performance of the tools suffers especially if the volume of data is large, and user experience for drill downs and aggregations is also not as per expectations.
- Data transformations and aggregations: Transformations to raw data and aggregations to a higher granular level are main use cases for using Extract Transform and Load (ETL) tools to create data marts. Although basic transformations and aggregations can be done in BI tools but aggregating data on fly could be costly, and if the transformations are complex, we will have to use ETL tools.
- Single source of truth: For financial companies and other companies where the data accuracy/consistency and reliability are of utmost importance, having a single source of truth is essential from a reporting perspective, and data marts can serve as a container for all financial and audit reporting.
- Easy and direct integration to BI or visualization tools: Data marts have been generally created in relational databases which provide a direct (open database connectivity /Java database connectivity (ODBC/JDBC) source to the BI tools and don’t have integration challenges.
Are physical Data Marts the only solution?
Organizations may no longer need to build extensive and expensive physical data marts. With wide variety of distributed computing and virtualization tools on-premise and in the cloud, we can now deliver data to the business in an agile manner. Data Virtualization fulfills all the above features of physical data marts like blending of data, Data transformations and aggregations, single source of truth and integration to BI tools but also provide other advantages of:
- Agility: Saves time in creation and delivery of data structures.
- Cost advantages: Saves cost in creation, upgrades and maintenance.
- Extensibility: The existing marts or lakes can be quickly extended with new sources or one-time data sources.
- Real Time and Streaming Analytics: For real-time analytics, data virtualization can provide near real-time capabilities.
- Performance: As data virtualization tools store data in server memory, it provides enhanced query performance for BI /Analytics tools.
Data Virtualization and Big Data Lake Architecture
Virtual data marts are providing a lot of value in following patterns of big data architecture:
- Big Data Analytics: The newer data stores like NoSQL, document stores, and graph databases do not typically provide a standard JDBC/ODBC-based SQL interface which make them difficult to use for BI and reporting tools and for data analysts who are familiar with SQL queries. The Data Virtualization Platform provides an abstraction and federation layer that hides the complexities of the Big Data stores and makes it easy to integrate data from these stores with other data within the enterprise.
- Hybrid Virtual Data Warehouses or Marts: More and more companies are moving data to Data lakes, but there is a still a substantial amount of data in traditional relation databases and appliances. The DV Platform can sit on top of both data stores and federates queries across both so that the hybrid data warehouse appears as a single data store to users. As shown below in the diagram we can leverage Virtualization to combine data from operational databases, Enterprise Data warehouse, flat files, and big data systems and provide a common Analysis layer.
Leveraging Data Virtualization Tools
Data Virtualization (DV) tools enable independently designed and deployed data structures to be leveraged together as a single source, in real time, and with limited (or often no) data movement. According to noted data virtualization expert Rick van der Lans, author of Data Virtualization for Business Intelligence Systems, “data virtualization is the technology that offers data consumers a unified, abstracted, and encapsulated view for querying and manipulating data stored in a heterogeneous set of data stores.”
There are various types of virtualization tools and technologies which can be categorized broadly as:
- Traditional Virtualization Tools: Traditional virtualization tools like Composite / Denodo can provide a common business (canonical) layer from multiple sources of data. The virtualization layer provides pass-thru capabilities to the BI tools using direct ODBC / JDBC connectivity. But high-volume data may present challenges based on cluster sizing and server RAM limitations. Simple transformations can be done in these tools. This approach works fine with small to medium data sets with simple transformations.
- Big Data Architecture Based Virtualization Tools: With distributed computing, new frameworks and tools are now in the market, which provide big data based (mostly Spark based) processing engines. These engines are constraint to a single server CPU/memory for processing a single job, as they can distribute a single job to multiple servers. Installations of these tools can utilize existing Spark or Hadoop cluster. These tools can fulfill high-performance requirements with large volumes without creating physical data structures. The data can be kept virtualized or written back to databases or file systems for historical and audit reporting purposes. Presto / Spark can be used for providing virtualized / real-time processing, and some of these frameworks are being used by companies like Airbnb, Netflix, and Facebook. A new generation of visualization tools like druid can be used for real-time visualizations. A new open source framework called Apache Arrow has been developed, and tools like Dremio look very promising for virtualization and big data processing. Big data based MOLAP tools like Kyvos, Atscale, or Kylin can also be used for a virtual layer but need prep work to get data in cubes.
- Data Preparation Tools: Data prep tools are a new set of lightweight, agile ETL tools gaining popularity for creating virtualized data marts and logical data lakes. These tools can create a physical or virtual data mart or Data Lake, but development and execution are more agile than the traditional ETL tools. Alteryx and Paxata are some of the prominent tools.
Physical vs. Virtual – What’s right for your organization?
Companies looking for more agile options for providing blended high-volume data along with simple to complex transformations and persistence options should look at the above new frameworks and tools for BI / ETL and data preparation. Virtualization also becomes an important tool for any real-time reporting and analytics requirements.
Some of the key features of the data virtualization we should be considering during the selection process are:
- Allow all types of data sources to be accessed, including Hadoop files, SQL databases, flat files, spreadsheets, packaged applications, social media networks, and messaging devices.
- Complex data preparation, data cleaning and data manipulation functions should be built into the tool. Reusable user-defined functions (UDF’s) should be available.
- Detailed lineage and impact analysis capabilities that show from which data sources results are derived and which transformations have been applied to the data.
- A searchable data catalog that contains definitions, descriptions, tags, and classifications of all the available data. This feature is indispensable for more investigative and experimental users, such as data scientists.
- Advanced query optimization techniques for federated queries across a homogeneous set of data sources. Joins combining data stored in Hadoop with data stored in a SQL database should be executed as efficiently as possible.
- Advanced query pushdown and parallel processing capabilities to handle complex queries on real big data. It’s important that query processing is executed as close to where data is stored as possible, regardless of the data storage technology used.
- Centralized data security features for specifying on a granular level, i.e., which user is allowed to do what, with which data.
- Having different access interfaces like SQL, REST/JSON, XQuery, MDX, SOAP/XML, or JMS.
- Distributed processing and memory architecture for high volume processing and high volume in-memory storage.
- Managed services or auto-scaling server architecture based on load and concurrency.
It could be challenging to look at the plethora of tools in the market today, but starting with some quick proof of concepts (POC’s) is always the best option to understand their capabilities and what best fits in your organization. To discover how Leidos can help your organization find the best solution, contact us today to speak to one of our experts.