To achieve business success with AI, you need rapid experimentation. Although the emerging world of machine learning operations (MLOps) offers many tools for iterative model training and deployment, most of them don't streamline data management. And enterprise-caliber storage and data management platforms are often complex and unapproachable for the data scientists and data engineers who work on AI projects.

To fill this gap, we've developed the NetApp® Data Science Toolkit. This toolkit provides NetApp industry-leading, multitenant data management capabilities in a simple, easy-to-use interface that's designed for data scientists and data engineers. Using the familiar form of a Python program, the toolkit enables data scientists and engineers to provision and destroy data volumes in seconds. Because it also provides easy access to advanced storage features that would normally require help from a storage administrator, the toolkit delivers real business value by significantly speeding up projects.

A quicker, easier AI workflow

With the NetApp Data Science Toolkit, a data scientist can almost instantaneously create a data volume that's an exact copy of an existing volume, even if the existing volume contains terabytes or even petabytes of data. Data scientists can quickly create clones of datasets that they can reformat, normalize, and manipulate, while preserving the original 'gold-source' dataset. Under the hood, these operations use highly efficient and battle-tested NetApp cloning technology, but they can be performed by a data scientist without storage expertise. What used to take days or weeks (and the assistance of a storage administrator) now takes seconds.

Data scientists can also save a space-efficient, read-only copy of an existing data volume. Based on the famed NetApp Snapshot technology, this functionality can be used to version datasets and implement dataset-to-model traceability. In regulated industries, traceability is a baseline requirement, and implementing it is extremely complicated with most other tools. Now, with the Data Science Toolkit, it's quick and easy.

Data Science Toolkit versus AI Control Plane

The NetApp AI Control Plane is a full-stack solution that pairs popular open-source MLOps tools with NetApp technology so that you can rapidly manage AI data and experimentation. The Data Science Toolkit enhances this solution by making it much easier to manage data. A data scientist working within a Jupyter Notebook that was provisioned using the AI Control Plane can use the Data Science Toolkit to implement a data management task in one simple line of Python code. Likewise, a data engineer can easily run a Data Science Toolkit operation as a step within an Apache Airflow or Kubeflow Pipelines automated workflow.

You can also use the Data Science Toolkit to integrate advanced NetApp data management capabilities into other MLOps platforms, including custom and homegrown platforms.

Alternatively, the Data Science Toolkit can serve as an easy-to-use, simple-to-manage standalone solution for smaller teams or teams that don't need the overhead of a full-blown MLOps platform. The toolkit is compatible with NetApp Cloud Volumes ONTAP® software, so teams can use on-demand cloud compute resources in AWS, Microsoft Azure, or Google Cloud.

With the NetApp Data Science Toolkit, data management is no longer an impediment to a fast, streamlined AI process. To learn more, visit the toolkit's GitHub repository.

Attachments

  • Original document
  • Permalink

Disclaimer

NetApp Inc. published this content on 17 December 2020 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 17 December 2020 21:14:05 UTC