NetApp : Orchestrate Spark pipelines with Airflow on Ocean for Apache Spark - The Spot by NetApp Blog

February 24, 2022 at 12:22 am EST

Running Apache Spark applications on Kubernetes has a lot ofbenefits, but operating and managing Kubernetes at scale hassignificant challengesfor data teams. With the recent addition of Ocean for Apache Spark to Spot's suite of Kubernetes solutions, data teams have the power and flexibility of Kubernetes without the complexities. A cloud-native managed service, Ocean Spark automates cloud infrastructure and application management for Spark-on-Kubernetes.

Designed to be developer-friendly, Ocean Spark comes with built-in integrations with popular data tools, including scheduling solutions like Airflow and Jupyter notebooks. There are multiple ways to run a Spark application onOcean for Apache Spark:

You can connect Jupyter Notebooks to work with Spark interactively
You can submit Spark applications using schedulers like Airflow, Azure Data Factory, Kubeflow, Argo, Prefect, or just a simple CRON job.
You can also directly call the Ocean SparkREST APIto submit Spark applications from anywhere, thereby enabling custom integrations with your infrastructure, CI/CD tools, and more.

In this tutorial, we're going to show you how to connect the orchestration service most popular among enterprise customers, Apache Airflow, and illustrate how you can schedule and monitor your workflows and pipelines on Ocean for Apache Spark.

We're going to use the AWS service Managed Workflows for Apache Airflow (MWAA) as our main example, because it is easy to set up and handles the management of the underlying infrastructure for scalability, availability and security. But these instructions are easy to adapt to alternative ways of running Airflow.

(Optional) Set up Amazon Managed Workflows for Apache Airflow (MWAA)

An Amazon S3 bucket is used to store Apache Airflow Directed Acyclic Graphs (DAGs), custom plugins in a plugins.zipfile, and Python dependencies in a requirements.txtfile. Please make sure that the S3 bucket is configured to Block all public access, with Bucket Versioningenabled and located in the same AWS Region as the Amazon MWAA environment.

The following image shows how to setup the locations on S3 to store different artifacts.

[Link]

Please follow the instructions here to sync the files between your git repository and S3

Install and Configure the Ocean Spark Airflow Provider

In MWAA, you can provide a requirements.txt file listing all the python packages you want to install. You should include the ocean-spark-airflow-providerpackage, which is available here.On other distributions of Airflow, you can simply install this package by running pip install ocean-spark-airflow-provider.

This open-source package (seegithub repository) provides an OceanSparkOperator that we will show you later, and a connection to configure how to talk to Ocean Spark.

Please enter the connection details as shown below. You may access it from Admin -> Connections -> Add a new record (+ sign) and select Ocean For Apache Spark from Connection Type dropdown.

[Link]

Enter the following details in the connection window, and then click Save.

Connection Id: Use ocean_spark_default by default. You may use a different name.
Connection Type: Select "Ocean For Apache Spark" from the dropdown
Description: Enter any optional text to describe the connection.
Cluster Id: The ID of your Ocean Spark cluster
Account Id: The Spot Account ID the cluster belongs to, which corresponds to a cloud provider account.
API token: Your Spot by NetApp API token (seeHow to create an API token)

Using the Ocean Spark Operator in your Airflow DAGs

In Airflow, a DAG - or a Directed Acyclic Graph - is a collection of the tasks that you want to run, organized in a way that reflects their relationships and dependencies as a graph. Airflow will only start running a task, once all its upstream tasks are finished.

When you define an Airflow task using the Ocean Spark Operator, the task consists of running a Spark application on Ocean Spark. For example, you can run multiple independent Spark pipelines in parallel, and only run a final Spark (or non-Spark) application once the parallel pipelines have completed.

[Link]

The final Spark job in this DAG will be executed once the two parallel jobs are finished.

A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code. When using MWAA, you should upload the DAG python script into the S3 DAGs folder. Here's an example DAG consisting of a single Spark job.


[Link]

Once the file is uploaded into the S3 DAGs folder, the DAG will appear in the MWAA environment within a few minutes

[Link]

Click on Run to run the DAG. The Spark application will start running in your Spot environment in a couple of minutes (Note: you can reduce this startup time by configuringheadroom).

[Link]

Once the application is completed, you should see the DAG completed successfully in the MWAA environment.

[Link]

Note: If you want to give a different name to the connection other than the default name (ocean_spark_default), please use the conn_id parameter of OceanSparkOperator.

[Link]

Start using Ocean for Apache Spark

Airflow is just one of several built-in integrations that Ocean for Apache Spark supports to help data teams run their Spark applications with Kubernetes. Learn how you can easily set up, configure and scale Spark applications and Kubernetes clusters with Ocean Spark.Schedule an initial meetingwith our team of Apache Spark Solutions Architects, so we can discuss your use case and help you with a successful onboarding of our platform.

Attachments

Original Link
Original Document
Permalink

Disclaimer

NetApp Inc. published this content on 23 February 2022 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 24 February 2022 05:21:03 UTC.

	1st Jan change	Capi.
NETAPP, INC.	+33.86%	24.35B
WESTERN DIGITAL CORPORATION	+45.39%	24.86B
PURE STORAGE, INC.	+75.32%	20.32B
SHANNON SEMICONDUCTOR TECHNOLOGY CO.,LTD.	+0.06%	2.14B
QUANTA STORAGE INC.	+50.00%	1.06B
INNODISK CORPORATION	+1.28%	868M
NETAC TECHNOLOGY CO., LTD.	-33.00%	649M
NETLIST, INC.	-16.22%	403M
ARGOSY RESEARCH INC.	-4.08%	460M
INFORTREND TECHNOLOGY, INC.	+45.07%	263M

1st Jan change

Capi.

NETAPP, INC.

+33.86%

24.35B

WESTERN DIGITAL CORPORATION

+45.39%

24.86B

PURE STORAGE, INC.

+75.32%

20.32B

SHANNON SEMICONDUCTOR TECHNOLOGY CO.,LTD.

+0.06%

2.14B

QUANTA STORAGE INC.

+50.00%

1.06B

INNODISK CORPORATION

+1.28%

868M

NETAC TECHNOLOGY CO., LTD.

-33.00%

649M

NETLIST, INC.

-16.22%

403M

ARGOSY RESEARCH INC.

-4.08%

460M

INFORTREND TECHNOLOGY, INC.

+45.07%

263M

Market Closed - Nasdaq Other stock markets 04:00:00 2024-05-28 pm EDT			5-day change	1st Jan Change
118 ^USD	+2.02%		+4.48%	+33.86%

May. 28	Wedbush Raises Price Target on NetApp to $120 From $100, Neutral Rating Maintained	MT
May. 28	Wedbush Raises Price Target on NetApp to $120 From $100, Maintains Neutral Rating	MT

Wedbush Raises Price Target on NetApp to $120 From $100, Neutral Rating Maintained	May. 28	MT
Wedbush Raises Price Target on NetApp to $120 From $100, Maintains Neutral Rating	May. 28	MT
Morgan Stanley Adjusts Price Target on NetApp to $106 From $92	May. 24	MT
Susquehanna Adjusts Price Target on NetApp to $130 From $115	May. 22	MT
Citigroup Adjusts Price Target on NetApp to $120 From $110	May. 21	MT
Netapp Unveils Unified Data Storage Built for the AI Era	May. 14	CI
Wall Street: "techno" favorites back in the spotlight	May. 07	CF
Wall Street: "techno" favorites back in the spotlight	May. 06	CF
NetApp Appoints Dallas Olson as Chief Commercial Officer	May. 06	MT
NetApp, Inc. Appoints Dallas Olson as Chief Commercial Officer	May. 06	CI
NetApp, Inc. Announces Executive Changes	Apr. 09	CI
Netapp Insider Sold Shares Worth $862,218, According to a Recent SEC Filing	Mar. 21	MT
NetApp, Inc. Empowers Customers to Securely Talk to Their Data in Collaboration with NVIDIA	Mar. 18	CI
NetApp, Inc. Appoints Alessandra Yockelson as Chief Human Resources Officer	Mar. 18	CI
Netapp Insider Sold Shares Worth $1,535,614, According to a Recent SEC Filing	Mar. 14	MT
BofA Securities Ups Price Target on NetApp to $85 From $78, Keeps Underperform Rating	Mar. 13	MT
Transcript : NetApp, Inc. Presents at Morgan Stanley?s Technology, Media & Telecom Conference 2024, Mar-05-2024 02:50 PM	Mar. 05
Citigroup Raises NetApp's Price Target to $110 From $90, Maintains Neutral Rating	Mar. 05	MT
Tesla, Apple and China’s delusions	Mar. 05
NetApp Turbocharge AI Innovation with Intelligent Data Infrastructure	Mar. 05	CI
NetApp Fights Ransomware in Real-Time with Built-In Artificial Intelligence on Enterprise Storage and Enhanced Cyber-Resiliency Solutions	Mar. 05	CI
North American Morning Briefing : S&P 500 Futures -2-	Mar. 05	DJ
ANALYST RECOMMENDATIONS : Dell, Domino's, Netapp, Okta, M&S...	Mar. 05
Argus Upgrades NetApp to Buy From Hold, Price Target is $130	Mar. 04	MT
Wall Street: record-breaking fireworks on Friday	Mar. 04	CF

NetApp, Inc.

Equities

NTAP

US64110D1046

Computer Hardware

NetApp : Orchestrate Spark pipelines with Airflow on Ocean for Apache Spark - The Spot by NetApp Blog

Latest news about NetApp, Inc.

Chart NetApp, Inc.

Company Profile

Income Statement Evolution

Analysis / Opinion

Ratings for NetApp, Inc.

Analysts' Consensus

EPS Revisions

Quarterly earnings - Rate of surprise

Sector Storage Devices