Cluster computing goes local with Spark Connect

Photo by Kent Pilcher on Unsplash

Gone are the days where Data/ML engineers had to repeatedly package their data processing logic into a Spark app and send it to the cluster to test, customize, and tune the data logic. Spark connect can now power local computing environment on all platforms with direct access to the Sparks cluster compute engine.

Ajay Gupta

cLuster computing on Spark is accessed primarily through launching the Spark shell on a node that has access to the cluster or by packing the desired data processing logic into a Spark app and later sending it to the cluster manager via spark-submit command, but the resubmission must be on a node that has access to the cluster.

These constraints pose challenges for data engineers to seamlessly test their code on a real cluster while building their data processing logic using Spark APIs. Also, because of this, data applications cannot seamlessly leverage the compute capabilities of the on-demand Sparks cluster.

To address these constraints to some extent, there are some standard solutions today, such as Spark thrift server and Apache Livy. Spark thrift server (basically a Thrift service implemented by the Apache Spark community based on HiveServer2) allows data applications to leverage the power of Spark SQL remotely in a standard SQL way based on the standard JDBC interface, while Livy allows sending snippets of code and push applications to a Spark cluster remotely using REST and programmatic APIs.

However, none of these solutions offer a native execution experience of Sparks advanced Dataframe APIs on all platforms. This execution experience is similar to what you typically experience on a Spark shell. Additionally, these solutions require some learning curve, may require some custom modifications in a native Spark application, and may require some add-on installation/maintenance.

But with Spark Connect Released in the latest version of Spark, 3.4, you can experience and natively leverage the power of Sparks cluster computing from a remote configuration. Spark Connect is based on a decoupled based on gRPC client-server architecture in which unresolved logical planes serve as a common contract between client and server.

The architecture is shown below (Reference: Spark Docs):

Ref: Spark docs

The gRPC service (the server) is hosted in the driver as a plug-in. Multiple Spark connect clients can connect to it to execute their respective query plans. In general, the connection service analyzes, optimizes and executes the logical plans received from various clients and transmits the results to the respective clients.

Further, Spark connection it would provide you with a thin client library which can be embedded in application servers, IDEs, notebooks and programming languages. The thin client library allows developers to write data processing logic in their preferred Dataframe APIs and automatically trigger remote evaluation of the underlying query plan when an action is invoked. Once remote execution is complete, the desired output is available in the same scope.

The Spark Connect client library actually provides applications with a special SparkSession object that points to a remote Spark driver. This special SparkSession instance encapsulates all the logic to package/push unresolved query execution plans via gRPC contract to the configured driver when required, collect the results passed by the driver against successful execution of a plan, and then serve the collected results to the ‘application.

To summarize, it should now be easy to understand that with Spark connect enabled, productivity and development experience for data engineers will increase many times. It would also allow anyone to interactively explore large datasets remotely and ultimately open up opportunities to develop rich data applications that can seamlessly leverage the remote cluster computing paradigm to enrich customer experience and interactions.

If you have any concerns/questions or have any comments on this story, you can contact me @ LinkedIn

#Cluster #computing #local #Spark #Connect
Image Source : ajaygupta-spark.medium.com

Leave a Comment