《隨時隨地使用 Apache Spark:使用 Spark? Connect 進行遠程連接.pdf》由會員分享,可在線閱讀,更多相關《隨時隨地使用 Apache Spark:使用 Spark? Connect 進行遠程連接.pdf(49頁珍藏版)》請在三個皮匠報告上搜索。
1、Use Apache Spark from Anywhere Remote Connectivity with Spark ConnectStefania Leone,Martin GrundSr.Manager Product Management,DatabricksSr.Staff Software Engineer,DatabricksProduct Safe Harbor StatementThis information is provided to outline Databricks general product direction and is for informatio
2、nal purposes only.Customers who purchase Databricks services should make their purchase decisions relying solely upon services,features,and functions that are currently available.Unreleased features or functionality described in forward-looking statements are subject to change at Databricks discreti
3、on and may not be delivered as planned or at all.Who develops with OSS Spark locally?Who develops with OSS Spark locally?What about the data?Who uses Apache Livy or JDBC to connect to Spark?Todays Developer experience requirementsBe close to data during development-Software engineering best practise
4、s-Interactive exploration-High production fidelity:develop&run close to dataBetter remote connectivity-From any application-From any languageHow to build on Spark?Up until Spark 3.4:Hard to support todays developer experience requirementsApplicationsIDEs/NotebooksProgramming Languages/SDKsNo JVM Int
5、erOpClose to REPLSQL onlySparks Monolith DriverApplication LogicAnalyzerOptimizerSchedulerDistributed Execution EngineModern data applicationApache Spark 3.4:Spark ConnectRemote Connectivity:thin client,full power of Apache SparkSpark Connect Client APISparks DriverApplication GatewayAnalyzerOptimiz
6、erSchedulerDistributed Execution EngineApplicationsIDEs/NotebooksProgramming Languages/SDKsModern data applicationSpark ServerClientAnalyzerOptimizerSchedulerDistributedExecution Enginespark.read.table(“logs”).select(“id”,extract_profile(“blob”).write.insertInto(“profiles”)InsertInto profiles+-Proje
7、ct +-UnresolvedTable logsDataFrame APITranslated into logicalparse planUnresolved logical plan is sent to the server via gRPC/protobuf(language agnostic)Results streamed back to the client via gRPC/Arrow(language agnostic)Spark ServerClientAnalyzerOptimizerSchedulerDistributedExecution Enginespark.r
8、ead.table(“logs”).select(“id”,extract_profile(“blob”).write.insertInto(“profiles”)InsertInto profiles+-Project +-UnresolvedTable logsDataFrame APITranslated into logicalparse planUnresolved logical plan is sent to the server via gRPC/protobuf(language agnostic)Results streamed back to the client via
9、 gRPC/Arrow(language agnostic)Declarative APISpark ServerClientAnalyzerOptimizerSchedulerDistributedExecution Enginespark.read.table(“logs”).select(“id”,extract_profile(“blob”).write.insertInto(“profiles”)InsertInto profiles+-Project +-UnresolvedTable logsDataFrame APITranslated into logicalparse pl
10、anUnresolved logical plan is sent to the server via gRPC/protobuf(language agnostic)Results streamed back to the client via gRPC/Arrow(language agnostic)TranslateSpark ServerClientAnalyzerOptimizerSchedulerDistributedExecution Enginespark.read.table(“logs”).select(“id”,extract_profile(“blob”).write.
11、insertInto(“profiles”)InsertInto profiles+-Project +-UnresolvedTable logsDataFrame APITranslated into logicalparse planUnresolved logical plan is sent to the server via gRPC/protobuf(language agnostic)Results streamed back to the client via gRPC/Arrow(language agnostic)ProcessSpark ServerClientAnaly
12、zerOptimizerSchedulerDistributedExecution Enginespark.read.table(“logs”).select(“id”,extract_profile(“blob”).write.insertInto(“profiles”)InsertInto profiles+-Project +-UnresolvedTable logsDataFrame APITranslated into logicalparse planUnresolved logical plan is sent to the server via gRPC/protobuf(la
13、nguage agnostic)Results streamed back to the client via gRPC/Arrow(language agnostic)ResultSpark Connect Client:Connect from anywhere141.Install PySpark:pip install pyspark=3.4 2.Create your Spark remote session configuring a connection string3.Develop,debug and run your applications!Getting started
14、Its only the sessionExample:Its only the session from pyspark.sql import SparkSessionspark=SparkSession.builder.remote(sc:/).getOrCreate()df=(spark.read.table(samples.nyctaxi.trips).filter(trip_distance=3.4.0pip install pyspark=3.4.0in your favorite IDE!Download example from githubInteractively deve
15、lop&debug from your IDE Check out Databricks Connect Contribute the Spark Go clientNew Connectors and SDKs in any language!17Build interactive Data ApplicationsGet started with our github example!Databricks ConnectSimple ETL ExampleSource:Taxi DatasetCSVBronzeSilverYellow ViewGreen ViewLoad to Delta
16、-Drop Columns-Transform Columns-Filter Rowsvendor=1vendor=21.Load Data from source into bronze table.2.Transform input data,apply filters&write to silver table.3.Produce destination-specific viewsETL Example in Python using PyCharmCan be run from any IDE or Notebook applicationETL Example in Scala u
17、sing IntelliJDebugging code from the IDE ETL example in Scala,usingConnectors and Language SDKs22Databricks Connect powered by Spark ConnectDatabricks ConnectData ApplicationsIDEs/NotebooksPartner IntegrationsYour applicationDatabricks ConnectConnecting to Databricks from anywhereNew SDKsExperimenta
18、l Official Golang Client-Apache Spark Connect Client for GolangR Support via reticulate,easy to setup and use directly from R StudioData Application Development“JDBC for PySpark”integrate your existing applications”Example:Data driven interactive dashboard applications,using Plotly-Write once,deploy
19、 anywhere(Docker,K8s,Raspberry Pi)Get started with our github exampleSpark w/configDashSample Application with Dashhttps:/ environment with dash,pyspark installed-Spark Cluster/Databricks ClusterData Applications with pyspark.aiLive Demo!Server extensibility:Spark Server Libraries32Spark Server Libr
20、ariesDataset/DataFrame API is just the beginning.DriverExecutorExecutorExecutorConnect ServerConnect ClientClient AppWhat about existing Spark libraries?Spark Server LibrariesDataset/DataFrame API is just the beginning.DriverExecutorExecutorExecutorConnect ServerConnect ClientClient App?What about n
21、ew APIs?Goal:Extensibility,Compatibility,Stability-Extensible,declarative surface for extending Spark on the server side with simple extensions of the Spark Connect protocol.-Clients are deployed independently of Spark,reducing friction when upgrading to new versions of Spark and Spark Server applic
22、ations.Spark Server LibrariesThe future of extensibility in SparkDriverExecutorExecutorExecutorConnect ServerConnect ClientClient AppExtExtExtSpark Server LibrariesProtocol Extensibility-Spark Connect protocol provides extension points for Relations,Commands,and Expressions.-Extensions are registere
23、d during Spark startup and associated with custom Protobuf definitions and invoked if necessary.Spark Server Library ExampleSQL Query History DataFrame(in 200 LOC)Goal:Extend the Spark Session to return a DataFrame with the SQL Query executions for this particular Spark Session that can be used for
24、filtering and aggregation like any other DataFrame.What were going to build:-Spark session extension for the query history data.-Spark Connect extension for the API-Python Client to use it.https:/ Query History DFThe Server-Spark Session extension with a planner strategy.-Planner strategy that conve
25、rts a custom logical plan into an physical plan.-Logical-and physical plan nodes.Example:SQL Query History DFThe Plan NodesLogical Plan NodeExec NodeExample:SQL Query History DFThe Spark PlumbingDefine the Spark Session extension and register a new planner strategy.Loaded via spark.sql.extensions Sp
26、ark ConfExample:SQL Query History DFSpark PlumbingManual Testing-Create a Scala DF from the logical Plan.Example:SQL Query History DFExample:SQL Query History DFSpark Connect Server-Define the extension Protobuf message-Create the Spark Connect Relation plugin that converts the proto message into a
27、Spark logical planExample:SQL Query History DFPython Client-Generate Python code from Proto.-Create PySpark Connect logical plan wrapper-Monkey patch Spark SessionExample:SQL Query History DFRelated talksThursday03:30 PM PT03:30 PM PTEnglish SDK for Apache Spark:Boosting Development with LLMsPython with Spark ConnectThank You47Use Apache Spark from Anywhere Remote Connectivity with Spark ConnectDatabricks2023Stefania Leone,Martin GrundSr.Manager Product Management,DatabricksSr Staff Software Engineer