1、English SDK for Apache SparkBoosting Development with LLMsGengliang WangAllison WangAmanda LiuAbout UsThe Spark team at Gengliang WangGithub:gengliangwangAllison WangGithub:allisonwang-dbAmanda LiuGithub:asl31_DAIS_Title_SlideWhy Were Excited01Story Behind English SDK03Future Work02With DemoSDK Feat
2、ures1_DAIS_Title_SlideStory Behind English SDK5100+Data Sources1+Billions Annual Downloads100K+Stack Overflow questions40K+Commits3600+GitHub PR Authors208Number of countries and regions downloaded PySpark in 2022Apache Spark:Power&Complexity Apache Spark:A robust analytics engine for large-scale da
3、ta processing.Rich feature set provides great capability,but takes time to masterPrint length:951 pagesLLMs and Apache Spark:A Powerful Synergy LLMs have extensive resources to learn Apache Spark Over 37,000 commits on Github Over 120,000 questions on Stack Overflow LLMs understands Apache SparkLLMs
4、 and Apache Spark:A Powerful Synergy10GitHub Copilot Requires understanding of complex code Limited to editors,not usable in notebooks.Suggestions for Spark development can be inconsistent.The Challenge for Spark DevelopmentGitHub Copilot Requires understanding of complex code Limited to editors,not
5、 usable in notebooks.Suggestions for Spark development can be inconsistent.The Challenge for Spark DevelopmentERROR:A column or function parameter with name dept_id cannot be resolved14LangChainLangChain Facilitates the creation of LLM-powered applications Challenges for Spark development:Returns st
6、rings instead of PySpark objects like DataFrame,less seamless code integrationCan we have a easy-to-use tool which seamlessly integrates with Spark?What if we use English as code?English as codeDesign Elements:Integration Make AI the chauffeur and we take the luxury backseat,instead of AI as the cop
7、ilot and we do the drivingDesign Elements:Integration Accept English instructions as input Return PySpark classes like DataFrames for seamless Spark API integrationIntegrating with Spark APIsDesign Elements:IntegrationIntegrating with Spark APIsEnglish SDKdf=df.ai.transform(“get 4 week moving averag
8、e sales by dept”)Github CopilotDesign Elements:IntegrationIntegrating with Spark APIsLangChainagent=create_spark_dataframe_agent(llm=llm,df=df)agent.run(“get 4 week moving average sales by dept”)Final Answer:the 4 week moving average sales by dept is(dep_1,X),(dep_2,Y),(dep_3,Z)English SDKdf=df.ai.t
9、ransform(“get 4 week moving average sales by dept”)Design Elements:Cache Stores prompts and LLM results for reproducibility Less need for repeated LLM interactions,faster executionEnglish SDK for Spark Use English as code,instead of generating code Make Apache Spark accessible to all1_DAIS_Title_Sli
10、deSDK FeaturesUse English as code,instead of generating codepyspark.aiEnglish SDK for Apache SparkSimplify and accelerate your development processEnglishTell assistant what you wantDataFramesFunctionsPlotsKey Features Dataframe operations Visualization Interpretability Safety Reproducibility Data Va
11、lidation Python UDFsData Transformationgithub_df contains all historical pull requests data from Spark.Lets do some data analysis by computing the 7-day moving average of daily pull requests created.#1)Add a column called date from created_at#2)Aggregate the pull requests created by date#3)Compute t
12、he 7-day moving average of the pull requests createdgithub_df.withColumn(date,to_date(created_at).window?#moving average?Without English SDKData TransformationWith GPT-4Data TransformationWith GPT-4Oh no!ChatGPT does not know the schema of the data frame:(I need to tell ChatGPT the schema as well.Da
13、ta TransformationWith GPT-4Schema of github_dfData TransformationWith GPT-4Oh no!ChatGPT does not know the schema of the data frame:(I need to tell ChatGPT the schema as wellData TransformationWith English SDKData LoadingWithout English SDKholiday.csvdf=spark.read.csv(holiday.csv)Data LoadingWith En
14、glish SDKData LoadingWith English SDKData Loading&TransformationWith English SDKData Loading-the fun wayWith English SDK“What are the top 10 most popular songs by Taylor Swift?”Visualizationdf.pandas_api().plot()Without English SDKValueError:Plotly Express cannot process wide-form data with columns
15、of different type.VisualizationWith English SDKdf.ai.plot()VisualizationWithout English SDKVisualizationjoined_df.ai.plot(show the 7 day moving average and annotate the spark version with a red line)With English SDKVisualization#2joined_df.ai.plot(show the 7 day moving average and annotate the spark
16、 version with a red line)With English SDKVisualization#3joined_df.ai.plot(show the 7 day moving average and annotate the spark version with a red line)With English SDKVisualizationWith English SDKlabel_df.ai.plot(show a stacked bar plot by label since 2018 with top 10 labels)VisualizationWith Englis
17、h SDKlabel_df.ai.plot(show a pie chart of top 5 labels since 2022)VisualizationWith English SDKgithub_df.ai.transform(add a column called time_to_close that computes the days between created_at and closed_at).ai.plot(show a histogram of time_to_close and remove outliers)Interpretability Without Engl
18、ish SDKInterpretability With English SDK Why read the code to understand the logic?English SDK can explain it for you!In summary,this dataframe is retrieving the number of pull requests created per day and the 7-day moving average of the number of pull requests created for each day from the githubda
19、ta table in the hive_metastore.testing schema.df.ai.explain()SafetyWith English SDK We dont like executing arbitrary code Lets turn on the verbose mode to see how English SDK works under the hood Its generating SQL queries!INFO:SQL query for the transform:SELECT*,DATE(created_at)as dateFROM temp_vie
20、w_for_transformINFO:SQL query for the transform:SELECT date,COUNT(*)as num_pr_created FROM temp_view_for_transform GROUP BY dateINFO:SQL query for the transform:SELECT Date,Num_pr_created,AVG(num_pr_created)OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)AS 7_day_avgFROM temp_view_for_tr
21、ansformspark_ai=SparkAI(verbose=True)ReproducibilityWith English SDK LLM results are usually non-deterministic How can we ensure the results can be reproduced?Use LLM Cache#Cache the resultsspark_mit()#Uncachedf.ai.plot(“description”,cache=False)Data ValidationWithout English SDKHow do we validate t
22、he data of a DataFrame?For example,lets check if each date in github_df is valid.Data ValidationWithout English SDKData ValidationWithout English SDKData ValidationWithout English SDKData ValidationWithout English SDKData ValidationWithout English SDKData ValidationWith English SDKdf.ai.verify(expec
23、t.)result:boolData ValidationWith English SDKdf.ai.verify(expect the 7 day moving average to be greater than 0”)Expect all moving averages to be 0.Data ValidationWith English SDKExpect all moving averages to be 0.Data ValidationWith English SDKExpect all PRs are created between 2014 and 2023.Data Va
24、lidationWith English SDKExpect all PRs are created between 2014 and 2023.Data ValidationWith English SDKExpect outliers in the number of pull requests createdData ValidationWith English SDKExpect outliers in the number of pull requests createdPython User-Defined Function(UDF)Without English SDKudfde
25、f day_of_week(timestamp:datetime)-str:.Python User-Defined Function(UDF)With English SDKspark_ai.udfdef day_of_week(timestamp:datetime)-str:Find the day of the week .Python UDFWith English SDKPython UDFWith English SDKPython UDFWhat do these PRs mean?Python UDFAI on AIPython UDFAI on AIPython UDFWit
26、h English SDK sql(select title,label_title(title)from githubdata).show()Python UDFWith English SDK sql(select title,label_title(title)from githubdata).show()1_DAIS_Title_SlideFuture WorkFuture Work Multiple DataFrame operations Robust Text-to-SQL creation Automatic PySpark test generation User-defined Python Table Functions and Data SourcesRecap Make Spark more user friendly and accessible Allow you to concentrate on the data insightsJoin our journey and explore the English SDK today at pyspark.ai1_DAIS_Title_SlideQ&A