Apache Spark?的英文SDK Apache Spark?的英文SDK.pdf

編號:139032 PDF 78頁 5.42MB 下載積分:VIP專享
下載報告請您先登錄!

Apache Spark?的英文SDK Apache Spark?的英文SDK.pdf

1、English SDK for Apache SparkBoosting Development with LLMsGengliang WangAllison WangAmanda LiuAbout UsThe Spark team at Gengliang WangGithub:gengliangwangAllison WangGithub:allisonwang-dbAmanda LiuGithub:asl31_DAIS_Title_SlideWhy Were Excited01Story Behind English SDK03Future Work02With DemoSDK Feat

2、ures1_DAIS_Title_SlideStory Behind English SDK5100+Data Sources1+Billions Annual Downloads100K+Stack Overflow questions40K+Commits3600+GitHub PR Authors208Number of countries and regions downloaded PySpark in 2022Apache Spark:Power&Complexity Apache Spark:A robust analytics engine for large-scale da

3、ta processing.Rich feature set provides great capability,but takes time to masterPrint length:951 pagesLLMs and Apache Spark:A Powerful Synergy LLMs have extensive resources to learn Apache Spark Over 37,000 commits on Github Over 120,000 questions on Stack Overflow LLMs understands Apache SparkLLMs

4、 and Apache Spark:A Powerful Synergy10GitHub Copilot Requires understanding of complex code Limited to editors,not usable in notebooks.Suggestions for Spark development can be inconsistent.The Challenge for Spark DevelopmentGitHub Copilot Requires understanding of complex code Limited to editors,not

5、 usable in notebooks.Suggestions for Spark development can be inconsistent.The Challenge for Spark DevelopmentERROR:A column or function parameter with name dept_id cannot be resolved14LangChainLangChain Facilitates the creation of LLM-powered applications Challenges for Spark development:Returns st

6、rings instead of PySpark objects like DataFrame,less seamless code integrationCan we have a easy-to-use tool which seamlessly integrates with Spark?What if we use English as code?English as codeDesign Elements:Integration Make AI the chauffeur and we take the luxury backseat,instead of AI as the cop

7、ilot and we do the drivingDesign Elements:Integration Accept English instructions as input Return PySpark classes like DataFrames for seamless Spark API integrationIntegrating with Spark APIsDesign Elements:IntegrationIntegrating with Spark APIsEnglish SDKdf=df.ai.transform(“get 4 week moving averag

8、e sales by dept”)Github CopilotDesign Elements:IntegrationIntegrating with Spark APIsLangChainagent=create_spark_dataframe_agent(llm=llm,df=df)agent.run(“get 4 week moving average sales by dept”)Final Answer:the 4 week moving average sales by dept is(dep_1,X),(dep_2,Y),(dep_3,Z)English SDKdf=df.ai.t

9、ransform(“get 4 week moving average sales by dept”)Design Elements:Cache Stores prompts and LLM results for reproducibility Less need for repeated LLM interactions,faster executionEnglish SDK for Spark Use English as code,instead of generating code Make Apache Spark accessible to all1_DAIS_Title_Sli

10、deSDK FeaturesUse English as code,instead of generating codepyspark.aiEnglish SDK for Apache SparkSimplify and accelerate your development processEnglishTell assistant what you wantDataFramesFunctionsPlotsKey Features Dataframe operations Visualization Interpretability Safety Reproducibility Data Va

11、lidation Python UDFsData Transformationgithub_df contains all historical pull requests data from Spark.Lets do some data analysis by computing the 7-day moving average of daily pull requests created.#1)Add a column called date from created_at#2)Aggregate the pull requests created by date#3)Compute t

12、he 7-day moving average of the pull requests createdgithub_df.withColumn(date,to_date(created_at).window?#moving average?Without English SDKData TransformationWith GPT-4Data TransformationWith GPT-4Oh no!ChatGPT does not know the schema of the data frame:(I need to tell ChatGPT the schema as well.Da

13、ta TransformationWith GPT-4Schema of github_dfData TransformationWith GPT-4Oh no!ChatGPT does not know the schema of the data frame:(I need to tell ChatGPT the schema as wellData TransformationWith English SDKData LoadingWithout English SDKholiday.csvdf=spark.read.csv(holiday.csv)Data LoadingWith En

14、glish SDKData LoadingWith English SDKData Loading&TransformationWith English SDKData Loading-the fun wayWith English SDK“What are the top 10 most popular songs by Taylor Swift?”Visualizationdf.pandas_api().plot()Without English SDKValueError:Plotly Express cannot process wide-form data with columns

15、of different type.VisualizationWith English SDKdf.ai.plot()VisualizationWithout English SDKVisualizationjoined_df.ai.plot(show the 7 day moving average and annotate the spark version with a red line)With English SDKVisualization#2joined_df.ai.plot(show the 7 day moving average and annotate the spark

16、 version with a red line)With English SDKVisualization#3joined_df.ai.plot(show the 7 day moving average and annotate the spark version with a red line)With English SDKVisualizationWith English SDKlabel_df.ai.plot(show a stacked bar plot by label since 2018 with top 10 labels)VisualizationWith Englis

17、h SDKlabel_df.ai.plot(show a pie chart of top 5 labels since 2022)VisualizationWith English SDKgithub_df.ai.transform(add a column called time_to_close that computes the days between created_at and closed_at).ai.plot(show a histogram of time_to_close and remove outliers)Interpretability Without Engl

18、ish SDKInterpretability With English SDK Why read the code to understand the logic?English SDK can explain it for you!In summary,this dataframe is retrieving the number of pull requests created per day and the 7-day moving average of the number of pull requests created for each day from the githubda

19、ta table in the hive_metastore.testing schema.df.ai.explain()SafetyWith English SDK We dont like executing arbitrary code Lets turn on the verbose mode to see how English SDK works under the hood Its generating SQL queries!INFO:SQL query for the transform:SELECT*,DATE(created_at)as dateFROM temp_vie

20、w_for_transformINFO:SQL query for the transform:SELECT date,COUNT(*)as num_pr_created FROM temp_view_for_transform GROUP BY dateINFO:SQL query for the transform:SELECT Date,Num_pr_created,AVG(num_pr_created)OVER(ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)AS 7_day_avgFROM temp_view_for_tr

21、ansformspark_ai=SparkAI(verbose=True)ReproducibilityWith English SDK LLM results are usually non-deterministic How can we ensure the results can be reproduced?Use LLM Cache#Cache the resultsspark_mit()#Uncachedf.ai.plot(“description”,cache=False)Data ValidationWithout English SDKHow do we validate t

22、he data of a DataFrame?For example,lets check if each date in github_df is valid.Data ValidationWithout English SDKData ValidationWithout English SDKData ValidationWithout English SDKData ValidationWithout English SDKData ValidationWithout English SDKData ValidationWith English SDKdf.ai.verify(expec

23、t.)result:boolData ValidationWith English SDKdf.ai.verify(expect the 7 day moving average to be greater than 0”)Expect all moving averages to be 0.Data ValidationWith English SDKExpect all moving averages to be 0.Data ValidationWith English SDKExpect all PRs are created between 2014 and 2023.Data Va

24、lidationWith English SDKExpect all PRs are created between 2014 and 2023.Data ValidationWith English SDKExpect outliers in the number of pull requests createdData ValidationWith English SDKExpect outliers in the number of pull requests createdPython User-Defined Function(UDF)Without English SDKudfde

25、f day_of_week(timestamp:datetime)-str:.Python User-Defined Function(UDF)With English SDKspark_ai.udfdef day_of_week(timestamp:datetime)-str:Find the day of the week .Python UDFWith English SDKPython UDFWith English SDKPython UDFWhat do these PRs mean?Python UDFAI on AIPython UDFAI on AIPython UDFWit

26、h English SDK sql(select title,label_title(title)from githubdata).show()Python UDFWith English SDK sql(select title,label_title(title)from githubdata).show()1_DAIS_Title_SlideFuture WorkFuture Work Multiple DataFrame operations Robust Text-to-SQL creation Automatic PySpark test generation User-defined Python Table Functions and Data SourcesRecap Make Spark more user friendly and accessible Allow you to concentrate on the data insightsJoin our journey and explore the English SDK today at pyspark.ai1_DAIS_Title_SlideQ&A

友情提示

1、下載報告失敗解決辦法
2、PDF文件下載后,可能會被瀏覽器默認打開,此種情況可以點擊瀏覽器菜單,保存網頁到桌面,就可以正常下載了。
3、本站不支持迅雷下載,請使用電腦自帶的IE瀏覽器,或者360瀏覽器、谷歌瀏覽器下載即可。
4、本站報告下載后的文檔和圖紙-無水印,預覽文檔經過壓縮,下載后原文更清晰。

本文(Apache Spark?的英文SDK Apache Spark?的英文SDK.pdf)為本站 (2200) 主動上傳,三個皮匠報告文庫僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對上載內容本身不做任何修改或編輯。 若此文所含內容侵犯了您的版權或隱私,請立即通知三個皮匠報告文庫(點擊聯系客服),我們立即給予刪除!

溫馨提示:如果因為網速或其他原因下載失敗請重新下載,重復下載不扣分。
客服
商務合作
小程序
服務號
折疊
午夜网日韩中文字幕,日韩Av中文字幕久久,亚洲中文字幕在线一区二区,最新中文字幕在线视频网站