VMware : Efficient Data Engineering with Versatile Data Kit

October 07, 2021 at 02:47 pm EDT

Description of icon when needed8 Min Read

Digital innovation goes hand in hand with data. To be successful, businesses need to be data driven. But extracting value from data is a tedious process, and it's only possible when the data system value grows faster than its cost. Data engineers face a lot of challenges to focus only on work that requires data engineers' business domain expertise.

Versatile Data Kit is a data engineering framework that enables data engineers to develop, troubleshoot, run, and manage data processing workloads (called "Data Jobs"). A "Data Job" enables Data Engineers to implement automated pull ingestion (Extract and Load in ELT/ETL) and batch data transformation (Transform in ELT/ETL) into a data warehouse. This framework, in production for more than 3 years in VMware's internal Data Analytics platform, helps VMware data engineers, analysts, and scientists manage hundreds of terabytes of data. Today, we are making this framework available in open source on GitHub as Versatile Data Kit.

A Closer Look

Let's look at what data engineering challenges Versatile Data Kit tackles.

A core skill of data engineers is to understand the business domain and apply that business knowledge and data model or ML knowledge to create the correct model. For example, it may be a data model or fact and dimension tables that not only enable historical reporting and BI, but also enables companies to look forward by creating machine learning models and performing predictive analytics.

A big challenge for data engineers is how to focus on only those tasks that require their core data engineering skills. When building out a data product with standard tools, 10% of the effort requires using data engineer core skills, while the remaining 90% are tasks that require system or infrastructure knowledge. Those tasks are still critical work such as setting up packaging, configuration, monitoring, testing, deployment, and automation. Similar to how electricity today has become infrastructure that you buy from a utility company, that 90% should be provided by the tools around the data engineers as utility.

In the last decade, software development has evolved to fully owning the entire application lifecycle (DevOps) and introduced methodologies for building services (12 factor app). We can make those practices accessible to data engineers and analysts without needing to learn them.

Many best practices and patterns are established in the data engineering world (like Kimball's patterns for data warehouse). Those practices can be automated and abstracted so that engineers do not reinvent the wheel or do the same tasks repeatedly.

A framework that provides such capabilities can make data engineers focus all their efforts only in the 10%.

Another big challenge is reducing "unplanned work."

Unplanned work comes from incidents and issues or due to unforeseen consequences of the development work-data in a report suddenly shows unrealistic numbers, or a customer-facing application makes incorrect recommendations based on bugs in data.

Unplanned work is unpredictable as it can come at any time. It often pushes out planned work, breaks commitments, and lowers productivity.

To minimize time spent in unplanned work, it's important to troubleshoot efficiently, and when in firefighting mode-quickly return to a stable state. All data applications consist of four basic elements-input (data), code (version), configuration, and dependencies (other software applications). By keeping track of both data lineage and code lineage (version of the code and its dependencies), a data engineer can scan the data jobs to quickly pinpoint and resolve issues.

Versatile Data Kit

Versatile Data Kit optimizes the work of data engineers at every step of the engineering cycle. In doing so, data practitioners can focus on work that requires business domain expertise.

Define and discover: If we look at a data engineer's development cycle, it starts with defining the business problem and identifying the required data. Data engineers need to track data source owners, check out contacts, and find documentation about the data. Versatile Data Kit can keep lineage -enabling tracking this information. Versatile Data Kit keeps a catalog of all data applications which enables better knowledge sharing.
Integrate: Versatile Data Kit automates data integration in an almost codeless flow-data engineers only need to specify the source and the source objects required and identify the destination database and tables. Versatile Data Kit then extracts the specified data and loads the database and tables automatically.
Transform: Efficient data transformation requires simplifying complex transformation tasks. This starts with removing unneeded information-known as boilerplate-so engineers can focus on data modeling logic. This enables creation of declarative transformations using SQL only, implementing common data patterns like Kimball and when necessary, using the full power of Python.
Deploy: The ideal state for data engineers is a single click deployment of their data applications. Versatile Data Kit takes care of building, packaging, versioning, and scheduling the data applications.
Manage: To efficiently manage many applications, Versatile Data Kit takes care of monitoring, tracking data lineage, and code lineage-thus enabling engineers to quickly pinpoint issues and revert to the latest stable version.

Versatile Data Kit consists of:

A Data SDK which includes the building blocks to build data applications with minimal effort and a plugin framework to extend or inspect any part of the data application.

A Control Service Server that enables users to manage data jobs by building, packaging, installing, tracking dependencies, and maintaining versioning. Jobs are run in a managed runtime environment with telemetry, logging, notifications.

Try it Out and Get Involved

While the project has been in development for some time, it still has a long way to go. And now that it's open source we look forward to new insights and fresh ideas. We welcome everyone to join and contribute-whether you're a data engineer or an analyst who wants to try out the project and provide feedback, ideas and a fresh perspective or an engineer who wants to contribute to the project by adding a new database connector via the pluggable architecture, or a completely new core feature-you are welcome.

You can check out Versatile Data Kit GitHub repository here: https://github.com/vmware/versatile-data-kit

Please check out the Versatile Data Kit GitHub repository and join our public Slack workspace.

Attachments

Original document
Permalink

Disclaimer

VMware Inc. published this content on 07 October 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 07 October 2021 18:46:04 UTC.

	1st Jan change	Capi.
ZSCALER, INC.	-20.09%	26.2B
MONDAY.COM LTD.	+1.38%	9.06B
WALKME LTD.	-25.68%	714M
TRAFFIC CONTROL TECHNOLOGY CO., LTD.	-1.92%	486M
XPERI INC.	-9.71%	443M
INSYDE SOFTWARE CORP.	-9.63%	270M
ZHENGZHOU TIAMAES TECHNOLOGY CO.,LTD	-28.88%	206M
HANGZHOU HOPECHART IOT TECHNOLOGY CO.,LTD	-42.63%	184M
ELLIPTIC LABORATORIES ASA	-6.34%	159M

1st Jan change

Capi.

ZSCALER, INC.

-20.09%

26.2B

MONDAY.COM LTD.

+1.38%

9.06B

WALKME LTD.

-25.68%

714M

TRAFFIC CONTROL TECHNOLOGY CO., LTD.

-1.92%

486M

XPERI INC.

-9.71%

443M

INSYDE SOFTWARE CORP.

-9.63%

270M

ZHENGZHOU TIAMAES TECHNOLOGY CO.,LTD

-28.88%

206M

HANGZHOU HOPECHART IOT TECHNOLOGY CO.,LTD

-42.63%

184M

ELLIPTIC LABORATORIES ASA

-6.34%

159M

Liquid Web Expands Business Continuity Solutions with Disaster Recovery Protection Powered by VMware	Mar. 26	CI
VIAVI Solutions Inc. and VMware, Inc. Expand RIC Testbed as a Service to Advance Digital Twin Environments	Feb. 21	CI
VMware, Inc.(NYSE:VMW) dropped from S&P Software & Services Select Industry Index	Nov. 28	CI
VMware, Inc.(NYSE:VMW) dropped from S&P TMI Index	Nov. 28	CI
VMware, Inc.(NYSE:VMW) dropped from S&P Global BMI Index	Nov. 28	CI
VMware, Inc.(NYSE:VMW) dropped from FTSE All-World Index	Nov. 27	CI
Global markets live: Vertex, Amazon, Microsoft, BP, Unilever...	Nov. 23
Broadcom Borrows $28.39 Billion to Fund VMware Acquisition	Nov. 22	MT
Global markets live: HP, Nordstrom, Deere, Nvidia, Broadcom...	Nov. 22
Trending : Broadcom Finally Seals VMware Deal	Nov. 22	DJ
Broadcom Completes $69 Billion Acquisition of VMware	Nov. 22	MT
Broadcom Closes VMware Acquisition	Nov. 22	MT
Broadcom closes $69 bln VMware deal after China approval	Nov. 22	RE
'Boomerang' CEOs of major companies	Nov. 22	RE
Broadcom Inc. completed the acquisition of VMware, Inc. from Michael S. Dell, Dodge & Cox, Silver Lake Management, L.L.C., Silver Lake Partners V DE, L.P., SL SPV-2, L.P. and others.	Nov. 21	CI
Broadcom, VMware Receive Final Regulatory Approval for Merger; Expect to Close Deal by Wednesday	Nov. 21	MT
Broadcom, VMware Say $61 Billion Takeover Deal is Cleared by Regulators	Nov. 21	DJ
Broadcom plans to close $69 billion VMWare deal on Wednesday	Nov. 21	RE
Broadcom, VMWare Acquire All Regulatory Approvals For Merger, Intend to Close Acquisition November 22	Nov. 21	MT
China market regulator grants conditional approval of Broadcom-VMware deal	Nov. 21	RE
CHINA'S MARKET REGULATOR GAVE CONDITIONAL APPROVAL FOR BROADCOM'…	Nov. 21	RE
Orange Business and VMware, Inc. Transform Flexible Sd-Wan to Simplify Customer Experience with Digitalization and Automation	Nov. 08	CI
NetApp Launches Hybrid Cloud Offering for Small, Medium-Sized Businesses	Nov. 07	MT
VMware Expands Partnership With Google Cloud	Nov. 07	MT
VMware Unveils Advanced Automation Capabilities Aimed to Simplify Information Technology Workflows	Nov. 07	MT

VMware, Inc.

Equities

VMW

US9285634021

Software

VMware : Efficient Data Engineering with Versatile Data Kit

Latest news about VMware, Inc.

Chart VMware, Inc.

Company Profile

Sector System Software