《人工智能基礎設施聯盟(AIIA):2022年人工智能基礎設施(AI Infra)調查報告(英文版)(100頁).pdf》由會員分享,可在線閱讀,更多相關《人工智能基礎設施聯盟(AIIA):2022年人工智能基礎設施(AI Infra)調查報告(英文版)(100頁).pdf(100頁珍藏版)》請在三個皮匠報告上搜索。
1、AI InfrastructureEcosystem2022Version 202207210011Table of ContentsManagement SummaryIntroduction The Three Approaches to AI/ML Platforms Build Your Own End-to-End Best of BreedThe Blueprint and the LandscapeOrchestration,Pipelines and Compute Engines The Two Types of Orchestration Compute Engines D
2、ata Engineering Orchestration and Pipelines Experimentation Pipelines Loosely Coupled and Tightly Coupled Orchestrators Metadata Engines,Experiment Trackers and Visualizers Current Trends and the Next Five Years The Next Five YearsModel Deployment and Serving Introduction Deployment Workflows Servin
3、g Patterns of Deployment Current Trends and the Next Five Years The Next Five YearsAI Supervision:Monitoring,Observability and Explainability Introduction Overview of Monitoring,Observability and Explainability Survey of the Field Understanding Monitoring,Observability,Explainability and Data Qualit
4、y Capabilities in AI/ML Systems Monitoring and Observability Explainability Explained Data Quality Current Trends and the Next Five Years35668101214181921252629293031323535394142434446474949 5152532 The Next Five YearsTraining Introduction Distributed Training and Training Multiple Models Supercompu
5、ters versus the Cloud Current Trends and the Next Five Years The Next Five YearsData Versioning,Lineage and Metadata Introduction Database Versioning File and Object Store Data Versioning Data Lineage Current Trends and the Next Five Years The Next Five YearsLabeling Introduction Understanding the B
6、asic and Advanced Capabilities of Labeling Platforms Advanced Capabilities Current Trends and the Next Five Years The Next Five YearsFeature Stores Current Trends and the Next Five Years The Next Five YearsSynthetic Data A Survey of Platforms Current Trends and the Next Five Years The Next Five Year
7、sAI Infrastructure Ecosystem Conclusions and Wrap-UpGlossary60555556585960616162646870707272747679798083838589939494973MANAGEMENT SUMMARYIntelligence is the great competitive advantage in history.Its not just whether we have it,but whether we can wield it effectively.But over the last decade,intelli
8、gence has undergone a profound change.Its no longer just confined to our heads because the rise of artificial intelligence and machine learning(AI/ML)techniques has now made it possible to productionize intelligence to solve previously unsolvable challenges.We can use AI to spot problems early on in
9、 pro-duction lines,predict customer churn,reclaim budgets,streamline support requests,translate languages,highlight key passages in legal documents,detect fraud,iterate on new design ideas and much,much more.Yet this production intelligence has largely remained the province of highly technical teams
10、 and big tech companies.Often these teams built their own AI/ML infrastructure from scratch because there was nothing on the market to sup-port their efforts.Yet over the last five years,we have seen a rapid proliferation of new tools and platforms that allow enterprises and small to medium business
11、es to benefit from the intelligence revolution.However,building the right AI/ML infrastructure that fits specific company needs is still a significant challenge.Only 26%of teams we surveyed were very satisfied with their current AI/ML infrastructure.Fifty-five percent were only somewhat satisfied,wh
12、ile 17%were somewhat unsatisfied,and 3%were very unsatisfied.In other words,most teams see a lot of room for improvement.4This is because the AI/ML infrastructure landscape is vast,complex and rapidly evolving.Its difficult to understand the capabilities of each platform and to see where they fit in
13、to existing systems without a lot of research and time invested.AI/ML systems are also complex because there is no one tool that does everything perfectly,so building a modern AI/ML stack involves many different tools and components.Even worse,marketing teams often obscure the capabilities of system
14、s or promise that a platform can do everything equally well,when the reality is usually very different.Finally,building a robust infrastructure requires buy-in from many different stakeholders across an organization,everyone from data scientists to data engineers to IT infrastructure architects to s
15、upport teams to network and security engineers.Up to this point,much of what has been written on AI/ML focuses on building excitement around AI,describing the state of ML adoption,or outlining the state of AI/ML research.While it is nice to read about the incredible possibilities that come from util
16、izing AI/ML in your company,what is truly needed is thorough coverage of the infrastructure available and the possible directions your company can take to achieve your business goals.Here we focus on how to sustainably build your AI/ML infrastructure to set your company up for success over the long
17、term,as your capabilities,needs and demands evolve.AI/ML teams are growing and this report aims to give every company the keys to build their AI/ML infrastructure by pro-viding a comprehensive and clear overview of the AI/ML infrastructure landscape.We provide insights on realistic capabilities and
18、tradeoffs for many different platforms,as well as projections about how infrastructure requirements may evolve over the next five years according to AI/ML experts.5The good news is that the majority of companies we surveyed found that the benefits they got from their AI/ML infra-structure outweighed
19、 the costs in two years or less.That means if you invest in the right infrastructure,after surveying the field and considering it carefully,you can reap rewards swiftly.By aligning your business goals with your wisely built AI/ML infrastructure,you can push the boundaries of what is pos-sible for yo
20、ur company now and in the future.INTRODUCTIONOver the past five years,weve seen a massive surge in the use of artificial intelligence(AI)and machine learning(ML)across every industry,as well as a stream of reports that cover the adoption rate of ML in enterprises and reports on the state of AI/ML re
21、search from prominent academic and research institutes like Stanford.This report has a distinctly different focus.Here we deliver a comprehensive overview of the state of AI/ML infrastruc-ture software,the software that powers the building,deploying,scaling and running of AI/ML models.The target aud
22、ience for this report is any executive or engineering team lead looking to find the right tools to establish or expand their in-house AI/ML practice.It will show you how the field has developed,as well as what the key capabil-ities of various AI infrastructure categories are now and where they are g
23、oing over the next five years.By the end,you 6will have a clear vision of where to invest your budget to accelerate your projects and take them to the next level.We describe the various categories according to the AI Infrastructure Alliances(AIIA)MLOps blueprint and then discuss each category in its
24、 own section.If you are unsure what category would make the biggest impact,then you can read the report from start to finish to get a better idea of what parts of the stack are essential to building or expanding your teams capabilities.Alternately,if you are already clear on what kinds of tools you
25、need next,then you can skip to the section that covers your needs straightaway.However,you may still wish to read the rest of the report as there may be capabilities in the space that you arent aware of yet and that may prove valuable to your efforts.The Three Approaches to AI/ML PlatformsThere are
26、three primary approaches to building an AI/ML platform.1.Build your own2.Buy an end-to-end solution3.Best of breedThese are roughly the same approaches that organizations need to consider when it comes to building a web applica-tion platform or an in-house IT system,but there are some wrinkles when
27、it comes to AI/ML that make it different.The most notable difference is the maturity of the space.Because enterprise AI/ML is a relatively young field,there is a vast array of products and services.In some categories of the space,there are clear market leaders,and in others there is a range of poten
28、tial choices that could make sense depending on your organizational needs.This wider range of platforms mirrors earlier technological advances.There were hundreds of car companies in the early days of the automobile industry,many of them small shops producing a few cars a year,before the assembly-li
29、ne processes pioneered by Ford took over.In the early days of the web,there were no fewer than fifty different web serv-ers,but most IT administrators would be hard pressed to think of more than three or four today,with Apache,NGINX,IIS and few other servers dominating the marketplace.Build Your Own
30、The first companies to adopt AI/ML techniques at scale were largely tech companies like Google,Tesla,OpenAI,Deep-Mind and Netflix.Because the field was new,they had little choice but to build their own solutions for building,training,deploying and running AI/ML models in production.But as those pion
31、eers blazed new territory and AI/ML techniques came out of the labs,universities and big-tech companies,enterprise software companies and startups emerged to ful-fill the demand for enterprises who generally dont have the in-house expertise or resources to build an entire bespoke IT system from scra
32、tch and then maintain it and updated it.The AIIA does not recommend that most companies attempt to build their own AI/ML platform from scratch or that they attempt to stitch one together from pure open-source components alone.This approach is incredibly complex and 7prone to failure.It is useful onl
33、y for highly advanced teams,with very specific and idiosyncratic use-case requirements that are not met by current software platforms.Even if you suspect that your needs are unique,we highly recommend you put together a team to study whether that is truly the case or whether you would be better serv
34、ed by a set of tools already developed.The good news is that,outside of very specialized circumstances,most companies are not building their own architec-ture from scratch anymore.According to our enterprise survey,only 20%of companies built their entire infrastructure in-house,while 45%use a mix of
35、 in-house and third-party tools,and 31%use a mix of third-party tools exclusively.While early pioneers had no choice but to build their own infrastructure,increasingly were seeing companies choose a mixture of buying and building or simply buying as more and more robust products make it to market.Th
36、is is because long-term maintenance of custom-built applications requires tremendous engineering resources.In addition,teams often discover too late that the platform they expected to do everything proved too brittle and rigid for use cases be-yond the initial design.In fact,that was the third-bigge
37、st challenge teams told us they faced when building their AI/ML infrastructure:the platform turned out to be only good for certain applications and not for others.8That was the case with the Michelangelo platform at Uber,headed by Mike Del Blaso,who later left Uber to start Tecton,a feature store pl
38、atform.While the platform proved highly useful for UberEats when they built it,Del Blaso later noted that the platform proved very effective for the few use cases they built it for,but it was not easily generalizable to other use cases.The Uber team moved on to newer platforms,and at the time of thi
39、s writing,many of the original com-ponents of Michelangelo have fractured into smaller,more nimble projects on the the Uber open-source Github.That said,we do expect advanced teams to have some bespoke parts of their platform,including but not limited to glue code,custom overarching workflow interfa
40、ces across platforms and in-house built programs,libraries and frameworks.This trend is expected to persist over the next five years,as software platforms evolve to meet the vast majority of AI/ML lifecycle needs,because during that evolution there will always be gaps that need to be filled.End-to-E
41、ndThe second solution is to buy a single,unified end-to-end AI/ML platform to serve all your machine-learning and analytics needs.These are platforms that attempt to cover every single aspect of the AI/ML lifecycle.Buying a single solution that handles the vast majority of an organizations needs is
42、a well-known and trusted buying practice in the industry,and so it is tempting to take the same approach to buying an AI/ML platform.For instance,many organizations standardized by using a single database,such as Oracle,in the mid-2000s or the VMware suite for virtualization in the 2010s,before the
43、rise of public clouds largely displaced VMware as the dominant force in enterprise data centers.9However,the AIIA recommends that organizations resist the siren song of a unified,end-to-end solution for AI/ML platforms for several reasons.The first is that we are still largely in the early adopter p
44、hase of the AI-infrastructure ecosystem.In his 1962 book Diffusion of Innovation,sociologist Everett Rogers showed us that people and enterprises fall into five distinct groups when it comes to taking on new tech.Geoffrey Moore built on these ideas in his business bestseller Crossing the Chasm.(Sour
45、ce:“Technology Adoption Lifecycle”by Craig Chelius,licensed under CC BY 3.0.)As is typical of the development in this phase of a technological cycle,there is no single solution that meets every need and has surpassed all of its competitors in every aspect of building,training,deploying,securing and
46、managing models in production.Instead there are a number of platforms in rapid evolution that will likely expand their capabilities over time,as well as consolidate and merge.Highly developed platforms that cover a broad range of needs tend to develop and solidify late in the early majority to late
47、majority stage of the technological development lifecycle.However,this has not stopped many marketing teams from declaring their solution the one solution to rule them all.We recommend that organizations cast a wary eye on vendor marketing.We also recommend that when considering a purchase,organizat
48、ions look very carefully at the capabilities outlined in this report and then ask serious questions of any vendor that claims to support every aspect of the AI/ML lifecycle.As of the date of this reports writing,we have found no single solution that legitimately covers the vast array of innovation h
49、appening in the space currently,which includes work on robust ingestion and storage of structured and unstructured data,data versioning and lineage,synthetic data generators,feature stores,model registries,highly scalable pipelining and orchestration systems,deployment systems,and highly scalable se
50、rving engines,as well as state-of-the-art monitoring,explainability and observability.10That said,it is entirely possible to select one or two vendors or platforms as the core of your AI/ML platforms and then build around those cores.Many platforms detailed in this report do cover a wide range of ca
51、pabilities that would serve the needs of complex enterprises.It is up to each organization to deeply understand their own needs,now and in the future,before buying their next-generation platform.For instance,if your use cases are largely structured and semi-structured today,with a focus on classic a
52、nalytics tasks like customer demand,churn prediction and fraud detection,then you may standardize something like Spark as your processing engine.However,you may discover later that youve moved into deep learning and unstructured use cases like video analytics only to discover that Spark is not ideal
53、 for those workloads.At the same time,you may get a limited number of monitoring capabilities from your core vendor,but not get the full range of monitoring,observability and explainability tools offered by a vendor dedicated to those capabilities.Lastly,its worth noting that while cloud vendors off
54、er solutions that seem more end-to-end,if we look a little closer,we often find thats not the case.Amazons SageMaker is a suite of tools that offers everything from data wrangling to pipe-lines.However,often those tools are not well integrated,and they exist as standalone tools in the suite,much as
55、if youd bought a series of tools yourself.Theyre also highly focused on particular use cases,such as structured data,and dont handle as easily unstructured data use cases like video,images,audio and free-form text.In addition,many of these tools are tools that cloud vendors developed themselves inst
56、ead of adopting a well-known,industry leading platform.This is often because a leader hasnt emerged because were still on the early part of the adoption curve.That also means that in the long run,as tools become more widespread,cloud vendors are likely to swap out big pieces of their suite for alter
57、natives that have gained more traction,which raises the question:why not start with the best of breed in the first place?Its unlikely that in the long run Amazons feature store will end up the standard feature store versus a dedicated open-source solution like Feast or a commercial platform like Mol
58、ecula or Tecton,all of which run on multiple clouds.Public clouds are best at commoditizing components that have already developed much further along the technology adoption curve.Because we are still at an early point on the adoption curve,its best to evaluate any public cloud AI/ML solution as no
59、different from any other vendors solution in terms of capabilities and not expect them to have already delivered a comprehensive,unified,end-to-end solution despite what their marketing might promise.Best of BreedThe AIIA advocates the best-of-breed approach for medium to advanced data science and d
60、ata engineering teams.That means taking a modular approach to building an AI/ML stack.Organizations should look to evaluate and select the leaders in different categories or consider the special capabilities of a specific vendor in the category that meets a need unique to their use case.We recommend
61、 that you choose one or two core platforms that meet a wide variety of your needs,including data processing,pipelining to versioning and lineage,experiment tracking and deployment.After youve selected a core platform,you can more easily choose satellite platforms that meet more specific needs,such a
62、s synthetic data,feature stores,or monitoring,observability and explainability.11At this point in the evolution of AI/ML systems,the best-of-breed solution will likely demand some integration work on your team,so ensure that the platforms you choose have clean,well-documented APIs,as well as simple
63、points of in-gress and egress into and out of them.If your core platform already has integrations with other platforms youre looking to adopt,thats especially promising,but be sure to investigate the depths of those integrations.Are the integrations loose,well developed or tightly integrated at mult
64、iple levels?To choose a core platform,carefully evaluate all of your current machine-learning use cases for the current moment,the next year and the next five years.Ensure that you completely understand the kinds of problems youre looking to solve now and over your future timeline.Because the state-
65、of-the-art technology in AI/ML and the infrastructure that supports it are rapidly advancing,the AIIA believes that its difficult to know beyond a shadow of a doubt whether you can support all the use cases that will develop beyond five years,and so it is best to focus on a platform that can support
66、 the vast majority of your workloads over that time horizon.However,even though you may not be able to predict every possible use case beyond that timeframe,give yourself room to expand into other use cases you might not imagine at the moment.That means choosing a platform with maximum flexibility,l
67、anguage agnosticism and the ability to process structured,semi-structured and unstructured data.Ensure that your core platforms have the ability to meet all of those potential use cases.While your team may be starting with low-hanging fruit use cases like churn prediction and customer-demand forecas
68、t-ing,its not enough to select a core tool that only meets those needs,especially if you see the possibility of more ad-vanced use cases like computer vision,audio transcription,NLP and more.Your core platform should be flexible enough to handle a wide variety of use cases.When evaluating each platf
69、orm,dont simply accept the marketing copy that claims a product can do anything and everything.Look to see well-documented use cases and examples that cover each and every aspect of what you hope to achieve now and in the future.A mistake in choosing a core platform is one of the most costly mistake
70、s a team can make.You may find yourself using a second or even third platform to accomplish tasks because the original core platform claimed to meet your needs in their marketing,but the reality proved very different.Mistakes in choosing the satellite platforms that support your core platform are mo
71、re correctable.Its easier to swap out a monitoring platform if it doesnt meet your needs versus swap-ping out your primary pipeline and orchestration system.Of course,while a best-of-breed approach provides the best chance of success in the current AI/ML infrastructure landscape,there are several do
72、wnsides that must be considered.The first is cost.There is a cost associated with buying multiple platforms,and that will have to be weighed against the cost of developing,upgrading and supporting an in-house platform or being forced to add on to an“end-to-end”solution that didnt end up being end-to
73、-end.The cost will also depend on whether you have a rich and varied set of use cases and whether you expect your team and use cases to expand over the next five years.The second challenge is support.You will not have“one throat to choke”when it comes to support and will need to deal with multiple t
74、eams.However,in the modern enterprise,we find that teams are 12already used to juggling multiple support contracts,and this is generally not considered as big of a barrier as it once was a decade ago.Despite those two caveats,the AIIA still considers the modular,best-of-breed approach to be the most
75、 effective way to build an AI/ML stack today that will meet a wide variety of needs now and tomorrow,while delivering the most flexibility and ROI to your team.THE BLUEPRINT AND THE LANDSCAPEThe major challenge any organization faces when trying to choose an AI/ML platform or set of platforms to mee
76、t their needs is how to categorize the capabilities of those platforms.Whats needed are a clear set of categories and capabili-ties that fit each category.They should match the reality of the features of the platforms that are available now.It seems simple,but it has proven challenging because the s
77、pace is evolving so quickly and it exists on a new branch of the software development tree.There are a number of overlaps with traditional software development,but there are some striking differences as well.For instance,there is no analogous step to training in hand-coded software develop-ment.A nu
78、mber of teams and organizations have tried to help create those categories,but by and large theyve made the problem worse.Almost inevitably,the categories are poorly thought out,ill-defined and overlapping.Unfortunately,marketing largely adds to the confusion by creating an all-too-familiar landscap
79、e guide with those poorly chosen catego-ries and then slotting each vendors logo into a neat little box.It creates a slick graphic that is shareable on social media but is utterly ineffective at helping people understand the AI/ML software landscape.The biggest problem with this kind of graphic is t
80、hat software often does not fit neatly into a single category.A platform might have labeling capabilities,ex-periment tracking capabilities,a feature store and more.Putting that platforms logo in the experiment tracking category alone is reductionist at best and outright wrong at worst.To really hel
81、p organizations understand AI infrastructure,weve come up with clear category abstractions that do not have many edge cases or overlap.For instance,computer vision is a sub-type of machine learning,but it doesnt have much to do with the capabilities of AI-infrastructure software.Many different platf
82、orms listed here support the building and training of computer-vision models along with other kinds of models.“Labeling platforms”is a baseline category that describes an entire range of capabilities in AI-infrastructure software.Weve taken a two-fold approach to helping companies to clearly underst
83、and the capabilities of the platforms in the ecosystem.First,weve distilled the capabilities down to the point where it would be difficult to boil them down any further,and we added them to a blueprint that represents an idealized stack.Vendors do not fit neatly into any single box but may have capa
84、bilities across multiple boxes.13Even a platform that functions first and foremost as a model serving framework has capabilities in other parts of the blueprint.Solidly colored boxes indicate complete support and focus for that set of capabilities,while striped shading indicates a partial set of cap
85、abilities in that area.14Second,we created a new kind of landscape guide,unique to the AIIA,that includes a high-level feature matrix,and we include vendor capabilities across each of the categories rather than pigeonholing them as having only one capabili-ty.We also rate the vendors on whether they
86、 offer complete support for a category or partial support.For instance,a company might have explainability features but not robust monitoring and observability features,so it would only rate as partial support.With those abstractions in mind,we break down the software contenders in each of the major
87、 categories below.We go over the major companies,startups and open-source platforms,as well as the various capabilities to expect in each category.We also explore the development of each of these capabilities.Some are highly developed and some less so.Lastly,we give you a projection of how these cap
88、abilities will develop over the next five years and what gaps need to be filled in for missing capabilities.ORCHESTRATION,PIPELINES,COMPUTE ENGINESCompanies and platforms covered in this section include:Apache Spark,Ray,Kubernetes,Databricks,Argo,Airflow,HPE,ClearML,Pachyderm,Comet ML,Neu.ro,DAGsHub
89、,Google Vertex,Amazon SageMaker,Azure Machine Learning,Valohai,Arrikto,Modzy,Kubeflow,Iguazio,Neptune AI,Infuse AI,Dbt,Flyte,Domino Data Labs,Dataiku,Prefect,Weights&Biases,H2O,ZenMLOrchestrators are the control engines in the AI/ML development lifecycle.They organize everything from how data is ing
90、ested,cleaned and transformed,to the training and tuning models,to the deployment of finished models.In short,theyre the puppet masters of your AI/ML workflow and one of the contenders for your core platform choice.The terms orchestrator and pipelines are often used interchangeably,and though they a
91、re similar terms,they have subtle differences.Orchestration involves the command and control of the various steps,whereas a pipeline is the se-ries of steps themselves.Its worth noting that while pipeline has become the standard term in the industry,its a bit of a misnomer.The movement of data and m
92、odels from ingestion to deployment is often a DAG or a decision tree with lots of branching steps,and it is also cyclical in that a model is never really done,but rather moves back through the process to learn from new training data or to get tuned and updated.Nevertheless,weve chosen to stick with
93、the standard terminology instead of inventing new terms so as not to create confusion.Weve also chosen to use the words orchestration and pipelines together to indicate both the flow of the data,code and models through their journey from beginning to end and back again,as well as the modules that co
94、n-trol that flow.Unlike with traditional coding pipelines,most notably continuous integration/continuous delivery(CI/CD)pipelines,which only track code and automated tests,AI/ML pipelines track code,tests,training and the movement and transformation of data.Returning to the blueprint for a moment,yo
95、u may have noticed that the AIIA divides orchestration pipelines into two areas of primary focus:15 Experimentation Pipelines Data Engineering PipelinesExperimentation pipelines are heavily focused on data science workflows.In an experimentation pipeline,a data scien-tist runs different experiments,
96、trains various versions of models,often in parallel,and packages up models for produc-tion to a serving engine.The primary focus is experimentation and building models.Notice that they tend to run from data cleaning,after data has been ingested and transformed,all the way through training and deploy
97、ment.Data engineering pipelines are more focused on the data engineer persona.Data engineers ingest data from various data sources,clean it,transform it,check it for errors and prepare it for use by data scientists,who tend to focus on data at a higher level of abstraction.Of course,there are data s
98、cientists who act as data engineers as well,but in more advanced teams we find the roles increasingly specialized.Data pipelines run from ingestion to packaging models up for production.At the end,they tend to overlap with experimentation pipelines in deployment,but they tend to do the underlying sy
99、stems work of deployment,such as packaging the model up with dependencies and scheduling it to a container or serving engine.Weve discovered that the teams surveyed often faced their biggest challenges with collecting and cleaning data,QAing and transforming data,which falls squarely on the shoulder
100、s of data engineers.This is reflected in the composition of teams as well.Most companies surveyed employed more data engineers than data scientists.16Its also reflected in where teams are spending the most time and money.17To further illustrate the difference between the two roles,consider the follo
101、wing:A data engineer might be more worried about connecting to external data sources via RBAC tokens,ingesting that data from Snowflake or Amazon Redshift or an object store like Amazon S3 and transforming the data into a format that is readable by the tools a data scientist is using,such as Pytorch
102、 or Tensorflow.On the other hand,a data scientist is largely concerned with data at a higher level.A data scientist focuses on understanding the features of the data,such as calculating age from date of birth or discovering the clusters of pixels that indicate the outlines of a person or a building
103、in an image.This is roughly equivalent to the old systems administrator and programmer dichotomy in traditional programming,though it is not a perfect analogy.Todays data engineers have the programming skills necessary to automate many aspects of their work,and many data scientists are comfortable a
104、ltering data to make it ready for feature extraction.In practice,there is often a lot of overlap between these two styles of pipelines and many platforms do aspects of both data engineering and experimentation but it is helpful to consider them separately because in our experience platforms tend to
105、lean towards one or the other end of the spectrum and that affects how their interface,API,programming and visualizations manifest.18The Two Types of OrchestrationThere are two major types of orchestration:Loosely coupled Tightly coupledBoth have their advantages and disadvantages.The basic differen
106、ce between them is simple.Loosely coupled orches-tration engines are not tied to the underlying execution.Airflow is an example of a loosely coupled orchestrator.It does not have a dedicated compute and scheduling engine.Because it is loosely coupled,it can execute tasks on a diverse set of compute
107、engines,such as Kubernetes or Spark,and it can even execute other distributed application frameworks like Ray or Iguazios MLRun.Tightly coupled orchestration systems are tightly bound to their underlying compute engine.Spark is a perfect example,as it is both an orchestrator and a distributed big-da
108、ta processing engine.Pachyderm is another example of an orches-trator that is tightly coupled with its underlying execution engine.The biggest advantage of a loosely coupled framework is that it is general purpose and can execute tasks on many different engines.It can also execute the tasks of other
109、 tightly coupled orchestrators.That makes it useful as an orches-trator of orchestrators.Loosely coupled frameworks tend to be higher level and more abstracted in terms of the kinds of actions they can perform.An example of a loosely coupled framework is Airflow.The biggest disadvantage of a loosely
110、 coupled framework is that because it is more general purpose,it cant take advantage of all the unique underlying execution engine capabilities without plugins.Even with plugins,a loosely coupled orchestrator is not a two-way street.Because Spark has the knowledge of the underlying data,it can share
111、 data between tasks.It also has knowledge of memory allocation and compute resources,and different scheduling options that it cant share with Airflow.Loosely coupled orchestrators often have no concept of the underlying data and so cant trigger pipelines based on changes in data.On the other hand,ti
112、ghtly coupled systems,such as ClearML,can trigger pipelines based on changes to the data,such as when new telematics data flows into a data lake,which in turn triggers a job with only changed data instead of the entire dataset.There are exceptions to this rule,such as the loosely coupled orchestrato
113、r,Prefect,which can pass state and data dependencies to underlying systems,but loosely coupled systems are always at a disadvantage to tightly coupled systems when they need to utilize a special feature of the underlying system.The disadvantage of tightly coupled systems is that their orchestration
114、often cannot extend to other systems to execute tasks,and you must deploy the underlying compute engine to make them work.Their orchestration is siloed to that system.Which style of orchestration system to use depends on an organizations needs and scale.Smaller teams may only need the capabilities o
115、f a specific engine and thus focus all their energy on that engine to meet their needs.Large enterprise 19AI/ML teams,with diverse model production needs,are likely to need more than one compute engine for different kinds of AI/ML capabilities,and so they may use a combination of loosely coupled fra
116、meworks,as well as the underlying tightly coupled framework at different stages in the machine-learning pipeline.Compute EnginesNow that weve explored orchestration,lets turn to the underlying compute engines.The bedrock of any AI/ML workload is its compute engine.These are the workhorses of executi
117、on in any AI/ML platform.They process data and code;schedule resources like memory,disk space,GPU,TPU and CPU;execute steps;and scale and parallelize execution of those steps.Were going to assume you have a generalized understanding of virtualization,containers and parallelization in general and not
118、 go deeply into the details of these systems.But they are important to reflect on for a moment because weve found that the vast majority of current AI/ML platforms rely on one of three compute engines:Kubernetes Spark RayThere are others,but it is worth focusing on these three as they encapsulate th
119、e three key focuses a compute engine can have,namely:General purpose compute engine Tightly coupled orchestration and compute Machine-learning application focused computeThere are some platforms that run“bare metal”in theory,but today that mostly means running on virtualized containers or serverless
120、 instances in the cloud or on virtualized or serverless local instances in a data center.Of the three,Kubernetes is the most general purpose compute engine.It evolved from Googles own internal container and orchestration capabilities and the desire to build a clean,universal cloud“operating system.”
121、You can see its history and timeline here.Kubernetes is generic enough that it can encapsulate and run other compute engines on top of it,making it the most general purpose of the compute engines.It is language agnostic,allowing teams to run any kind of code,including Bash,C/C+,Python,Rust and Java.
122、It also does not matter if you are building a web-scale distributed web application or a machine-learning training pipeline,it can run any kind of workload because it is so general purpose.Apache Spark is the second most well-known compute engine,and it has a long history.Spark has its own battle-te
123、sted parallel processing engine and does not rely on third-party scaling platforms like Kubernetes,although Spark can be deployed on Kubernetes.Its important to note that Spark was created before the AI/ML deep-learning revolution that 20started with AlexNet.It was primarily designed to deal with pr
124、ocessing big data,based on Googles MapReduce paper and as a better version of Hadoop,an early disk-bound version of a MapReduce-type system.Sparks biggest innovation was keeping and processing most of the data in memory,which made it up to one hundred times faster than Hadoop,which was heavily disk
125、dependent.Its a testament to the design of the system that it allows for additional use cases like AI/ML,but it is not always a perfect fit.Spark is most adept at dealing with structured data,such as columnar text data in a database,and semi-structured data,like JSON files,though there have been eff
126、orts to make it more capable at dealing with unstructured data,like video,audio,images and unstructured text like legal documents,specifically through the Delta Lake project,though that is still not the primary strength of the platform.As noted earlier,Spark is both an orchestrator/pipelining system
127、 and a compute engine.Its orchestration is built around Scala,though it does support other languages through ports and wrappers,such as Python.Ray is the third compute framework and a relative newcomer to the market.It was designed at UC Berkeleys RISELab,primarily as a tool for reinforcement learni
128、ng(RL)and as a“replacement for Spark”with a specific ML focus,since many kinds of ML jobs do not fit into MapReduce style paradigms.As computer science professor Michael Jordon wrote in an article from 2017,in the early days of the project:You need flexibility.You need toput together not just things
129、 like neural nets but planning and search and simulation.This creates all kinds of complex task dependencies.Its not very easy simply to write a MapReduce kind of paradigm.You can write it,but its not going to execute very effectively when you have very heterogeneous workloads and tasks.It needs to
130、be adapted to the performance of the algorithms,as the system is learning,to change its flow,its tasks.The platform has since been adapted to be a more general purpose compute framework for machine learning through Dataset,a distributed data loading and compute library,though it is still most often
131、used for reinforcement learning and as a serving engine.Ray is perhaps the newest and most cutting-edge of the frameworks,but it is simultaneously the least general purpose and the most specifically designed for ML workloads.It should be noted that calling Ray a replacement for Spark is not exactly
132、correct.Ray does not have its own underlying compute engine,and it is not low-level enough to act as one currently.It relies on general purpose compute frameworks like Kubernetes to run.Instead,it is a set of Python libraries for building distributed machine-learning applications and serving them.It
133、 focuses on some of the key dependencies for newer and more cutting-edge AI/ML techniques that do not fit into the do-a-task-and-then-wait paradigm of MapReduce.For example,reinforcement learning often has many parallel and less linear tasks that work together in a dependency tree,and they all need
134、to finish before the next step can proceed.Lastly,it is exclusively for Python applications.21Another potential advantage of Ray is you can deploy a single node without needing to deploy Kubernetes.This gives developers the chance to write code,test it and then scale it out to Kubernetes clusters wi
135、thout changing that code later,which lowers the barrier to entry.However,according to the documentation,“Ray Serve lacks the ability to declaratively configure your ML application via YAML files.In Ray Serve,you configure everything by Python code.”As noted earlier,we can think of the three kinds of
136、 platforms in different ways.Kubernetes is the most low-level.It is totally general purpose,and it does not matter if you are building a web application on top of it or a machine-learning training pipeline,but because of that,the logic of those applications exists higher up,so it is not enough by it
137、self to create a machine-learning orchestrator and pipeline.Spark is the most long-lived and mature of the tightly coupled orchestrators,but it was not designed with AI/ML in mind.It has built in scaling and processing capabilities,and at its best it can process data that fits cleanly in a database
138、with great flexibility and power.Ray is the newest of the frameworks and was designed with AI/ML in mind,and it takes into account the latest techniques in the space,such as RL,but it is not an execution engine in its own right and needs something like Kubernetes to run on.Understanding the capabili
139、ties of those engines helps us understand the offerings of many platforms on the market because they sit at the base of those platforms and inform their capabilities higher up the stack.Understanding the limits and peculiarities of those platforms in particular helps us see through some of the marke
140、ting claims of various offerings,especially when marketing teams promise capabilities that are not an easy fit within the limitations of each style of platform.Data Engineering Orchestration and PipelinesKubeflow is one of the earliest and most well-known orchestration/pipelining systems.It gestated
141、 at Google as an open-source project designed to work on Kubernetes,which is the de facto standard for cloud application hosting and scaling.Kubeflow is not a single project but a collection of projects.When referring to Kubeflow here,we are referring only to Kubeflow pipelines,which is by far the m
142、ost popular project in the group,along with notebooks.Kubeflow pipelines themselves are largely based on Argo Workflows.In relation to the AIIA blueprint,Kubeflow most closely fits the data engineering pipeline definition and that is reflected in the usage of the project,roughly 73%of which is by ML
143、 engineers,another term for data engineers,as of 2021.The Kubeflow project predates the concept of MLOps,and it reflects a more DevOps workflow in its design.It does not include a concept of a data-driven pipeline,where new datums are able to trigger events and kick off pipelines and automation step
144、s.It also lacks pure language agnosticism,with a heavy focus on Python.R support is rudimentary.It also lacks commercial support,which means any team running it in production will need to support it completely on their own or through the community.Several commercial products have built their stack t
145、o include Kubeflow,most notably Googles Vertex AI,HPEs Greenlake for MLOps and Arriktos MLOps platform.As Kubeflow pipelines are still in active development and often tricky to support,the AIIA does not recommend deploying the open-source version in production at this time,unless 22you have a strong
146、 open-source team with a history of supporting open-source projects for mission-critical applications.To date,no company has specifically created a dedicated,supported,standalone commercial version of pipelines.Instead,pipelines are wrapped into a larger tooling structure,such as Arriktos data snaps
147、hotting filesystem,Googles AutoML tools and HPEs suite of open-source tools for AI/ML in Greenlake.For teams that are not highly sophisticated in supporting upstream open-source projects,we recommend choosing a vendor that can support Kubeflow as part of a package with rapid bug fixes,a clean upgrad
148、e path and regression patching of earlier versions.AirFlow is another data-engineering-focused pipelining system and allows users to build DAG objects in Python to define their workflow as code.As mentioned earlier,it is loosely coupled.Many companies and organizations use Airflow for CI/CD-style or
149、chestration of AI/ML pipelines,but it wasnt purpose built for AI.It is 100%Python focused and doesnt allow any easy way to plug in other languages.Airflow works best with workflows that are mostly static and slowly changing.It is not built for large quantities of data from one task to the next,and t
150、he project is not recommended for“high-volume,data-intensive tasks,”according to the project readme.Furthermore,“a best practice is to delegate to external services specializing in that type of work.”Also,“Airflow is not a streaming solution,but it is often used to process real-time data,pulling dat
151、a off streams in batches.”Prefect was developed specifically to address some of Airflows limitations.Most notably,its a loosely coupled frame-work that takes the underlying data dependencies into account as well as the task state,whereas Airflow accounts only for state.It also adds capabilities like
152、 task versioning.It is a Python library that looks to minimize extra dependencies like its own scheduler.To run production or parallelizable workflows,it relies on Dask to run distributed.Dbt is another data-engineering-focused transformation engine,but it is not a pipelining engine,allowing for a s
153、eries of orchestrated steps.It is exclusively focused on SQL and running those queries directly in data warehouses,so it is mostly useful for BI and analytics but is not an ideal choice for AI/ML workloads;it has no support for unstructured workloads and should not be considered when looking for gen
154、eral orchestrators,either loosely coupled or tightly coupled,but it can be very useful for streamlining working with SQL backends.Pachyderm is a tightly coupled data-engineering-focused orchestrator/pipeline that also combines versioning and lineage tracking with an immutable data lake that includes
155、 data deduplication.It allows users to string together complex transformation steps in any language because it is container-based and thus language agnostic.It relies on Kubernetes execution to scale,unlike Airflow,which can run with or without Kubernetes.Pachyderm excels at unstructured data like v
156、ideo,audio and imagery,and semi-structured data like JSON,but it can handle structured data as well,such as CSV files.While there is an advantage to keeping all data in a unified data lake,organizations that are heavily reliant on struc-tured data will find databases more performant for highly trans
157、actional processing.Pachyderm automatically builds a DAG,rather than requiring the user to pre-build it,by treating each step in the pipeline as an atomic unit of work by using well-defined JSON or YAML definitions to define steps in its pipelines.Those definitions call individual containers to exec
158、ute code to transform,train and track models as they move through the ML lifecycle.However,while Pachyderm can do training and deploy models,it is primarily used in complex data preprocessing and data preparation.A team with complex training requirements should turn to frameworks like Horovod or Det
159、ermined AI(now part of HPE).When it comes to deployment,a dedicated deployment framework such as Algorithmia(now part of DataRobot),Ubiops,ML-Run or Seldon is usually used.23 As noted earlier,Apache Spark is both a processing/compute engine and a data engineering pipeline,though the tool can also be
160、 used for data science experimentation.Well focus on its pipelining capabilities here.Its important to differentiate between Spark,an open-source tool with multiple vendors such as Microsoft offering it as a service,and Databricks,which is the primary corporate backer of the project and the largest
161、player in Spark deployments.Spark also underpins many other platforms as the scaling and processing engine,as in some parts of the DataRobot AI cloud.Its important to understand that the capabilities of these various platforms are highly dependent on what they built on top of Spark,namely their vari
162、ous APIs and GUIs as well as their proprietary overlay applications.When youre evaluat-ing Spark as a pipelining tool versus a company backing Spark,you are evaluating very different things.In many ways,you should see Spark as a choice of compute engine with some core capabilities and limitations an
163、d then evaluate the vendor platforms built on top of Spark as a separate consideration.Lets start with Databrickss version of Spark.While Databrickss offerings have morphed into a suite of products that supports everything from ML to analytics to visualizations,Spark was primarily designed as a repl
164、acement for Hadoop,with a focus on data engineers working with big data.While Databricks shifted much of its focus to AI in its marketing,the primary use case for Spark was for big data and analytics workloads for most of its history.Because it uses Parquet files for storing data,which are columnar
165、database files,it excels at structured and semi-structured workloads.While Databricks does say its possible to use Spark as a unified storage backend for any kind of data with its lakehouse architecture,in practice this is rarely done if an organization is dealing with large audio,video or image fil
166、es(such as high-resolution Satellite images),as storing unstructured data in Parquet files is largely impractical,even with the compression capabilities of the format.However,Databrickss Spark is one of the most popular platforms for structured and semi-structured batch workloads,and it has a large
167、community supporting it.In addition,Databrickss Spark and their suite of tools surrounding it makes them one of the most well-supported and well-funded companies in the space for high-speed structured data processing and structured AI/ML workloads,such as churn prediction,demand forecasting and anom
168、aly detection.Scala is the primary language underpinning Spark,and though Databricks supports Python,which has emerged as the undisputed lingua franca of ML,the support of key Python libraries requires porting by Databricks which largely act as wrappers around Scala.This can create some anomalies in
169、 error debugging,and it means that teams may need to wait for a specific Python library or specific features of a Python library to be ported,rather than simply getting the latest Python library from its source repo and interacting with a system natively.That said,Databrickss Python libraries are on
170、e of their most consistently updated and widely supported.Now lets turn to DataRobot,one of the few systems that looks to focus on both data engineering pipelines and experimentation pipelines.Again,we shouldnt think of DataRobot as a single product but as a suite of tools.Some of these tools use di
171、fferent pipelining backends.The data-engineering-focused tool in the AI cloud is called Data Mesh and came out of an acquisition of Paxata.The tool has a strong GUI,and it excels at data prep for ML,though it is not a generic data engineering pipeline for any kind of transformation.It uses a set of
172、data connectors to a variety of backends,such as Amazon Redshift,Snowflake,MySQL,Oracle and SAP HANA.Since most of the data connectors are for databases,DataRobot is largely focused on structured and semi-structured data and most of their excellent visualization tools support columnar data,but they
173、can work with unstructured data as well.Well discuss the data science aspects of 24the DataRobot platform in the next section on experimentation pipelines.While DataRobot existed as a fully integrated set of tools and as a tightly coupled pipelining system,the team worked to decouple the tooling ove
174、r the last few years so that outside tools can be incorporated into the workflow more easily.Amazons Data Wrangler is part of the SageMaker toolset.Like DataRobot,SageMaker is not a single unified tool but a set of tools.Data Wrangler itself is less a generic pipelining tool and more of a visual way
175、 to do data transformations.It focuses completely on data cleaning,transformation and prep and on nearly 100%structured data transformations.It includes a wide variety of connectors and over 300 pre-baked transforms,such as one-hot encodes,and gives users the ability to design their own data wrangli
176、ng steps as well.It is not a generic data transformer,and it is strictly focused on creating as many pre-built data science transforms as possible for structured data.The largest challenge with using SageMaker is that it is designed as a totally standalone suite,unlike other tools that strive for ea
177、sy connectivity to third-party tools,and thus it is difficult to use it with third-party external tools.However,if a third-party provider has either a built-in integration or provides an API,then it is possible to use them with SageMaker.Azure Machine Learning is the most open-source focused of the
178、big cloud providers.It offers two distinct sets of proprietary pipelines and an open-source offering.Its Data Factory pipelines are focused on data ingestion and transformation,and they have many pre-built connectors for everything from major databases like Oracle,Snowflake and MS SQL server but als
179、o for files via FTP and NFS and object stores.Their Data Factory also includes predefined transformation steps that can ease the process of ingesting and transforming data.Their open-source alternative to Data Factory is Airflow.Dataiku has some of the most advanced visual tooling for constructing d
180、ata pipelines in a GUI,and it is able to act as a transformer for many kinds of data,though,again,the primary focus is on structured data.Its GUIs provide excellent dashboards and visualizations along with advanced markdown capabilities for columnar data and text.It does provide a series of plugins
181、to deal with unstructured data,but they are largely either Tier 2 supported or unsupported,which means they are used at an organizations own risk.Dataiku does provide some examples of unstructured use cases for deep learning in its knowledge base,but they are mostly pre-baked solutions such as objec
182、t detection with a pretrained model.Dataiku uses a mixture of pipelines on the backend to accomplish its goals.Under the hood,it can use its own pipeline engine,Hadoop/Spark or Kubernetes/Docker,and it can run computation directly in SQL databases such as Oracle or Snowflake.The platform provides on
183、e of the largest arrays of connectors for external data,from widely supported platforms common to most data transformation engines,such as Amazon Redshift,to less common ones like those pulling data directly from Twitter.The last data-engineering-focused pipeline worth a minor mention is Dagger,from
184、 the former creators of Docker.It includes some novel features,most notably the idea of portable container pipelines.Similar to Pachyderm,it uses templating to define steps in the pipeline,which also makes it language agnostic.It uses Googles CUE configuration language framework instead of YAML or J
185、SON.Dagger does not include any kind of data storage or any concept of data-driven pipelines.Lastly,Dagger was only launched recently,and it should be considered only for early adopters at this time.25Experimentation Pipelines By far the largest design pattern in AI/ML platforms is experimentation p
186、ipelines that focus on data scientists.These pipelines are usually DAGs(directed acyclic graphs),which are basically a conceptual representation of a series of steps or tasks,representing a mathematical abstraction of a data pipeline.Experimental pipelines tend to hide the actual DAG(a creation by t
187、he user)and include a well-known interface,such as Jupyter Notebooks,as the primary way to interact with the system.They do often allow the creation of steps through Python,the default language for most AI/ML tasks.By contrast,data-engineering-focused pipelines often require DAGs to be written direc
188、tly.Almost every AI/ML platform includes some notion of an experimentation pipeline,which tends to blur the lines between engineering-focused pipelines and experimentation pipelines.For instance,despite being used mostly by data engineers,Kubeflow includes support for Jupyter Notebooks,as does Pachy
189、derm and Arrikto.Nevertheless,we maintain the distinction at the AIIA by asking the question,where does the primary user of the platform spend their time?If the primary user is creating containers,writing steps to the pipeline in YAML or JSON and creating DAGs,then theyre mostly likely data engineer
190、s,and we would classify the purpose of the pipeline as data engineering.If most users are working in notebooks,running experiments,tuning hyperparameters and building models,then its mostly data scientists using the platform,and we would classify it as an experimentation pipeline.Just be aware that
191、many of the pipelines can function in both arenas,and you will have to make the choice based on the skills and composition of your team.As a side note,while many of the platforms discussed in this report use the phrase end-to-end in their marketing,the AIIA recommends that organizations adopt a more
192、 comprehensive definition that includes the work of data engineers and data scientists as a whole and accept that a collection of tools may be needed to serve both groups well.As such,organizations should evaluate the capabilities of any platform along those lines to decide if it truly has end-to-en
193、d capabilities.As such,most experimentation pipelines largely assume that the data is prepped and largely ready to go.They include data ingestion capabilities or the ability to point to stored data but generally do not include the ability to do steps like imputing missing or corrupted data,transcodi
194、ng audio from WAV to MP4,resizing all images or changing images to a different format,which is the work of data engineers.Again,sometimes they include this ability,but in general,advanced concepts of ETL are not part of experimentation-focused platforms.Instead,they focus on helping data scientists
195、extract features,testing multiple models against each other,tuning hyperparameters and finalizing a model for production.Experimentation engines tend to fall into two categories:Platforms that include their own orchestration/pipelining engines and/or schedulers Platforms that aggregate metadata from
196、 other pipelining systems and orchestrate and/or visualize those pipelines ClearML,DataRobot,Google Vertex,Amazon SageMaker,Azure Machine Learning,Neu.ro,Valohai,Iguazio,Flyte and 26ZenML fall into the first category,in that they have their own pipelining engines.While Neu.ro includes its own pipeli
197、ning engine,it primarily sees itself as an orchestrator and integrator of third-party tooling,so we place it here.Companies that focus on third-party integrations and look to act as a glue layer could constitute a third category,however since many platforms focus on integrations with third parties,s
198、uch as Dataiku,Iguazio and Domino Data Labs,we decided it would not add much value.Weights&Biases,Neptune AI,Comet.ML,Infuse AIs Piperider,Domino Data Labs,Infuse AI and DAGsHub would broadly fall into the third category in that they rely on other pipelining systems,such as Spark,to run.They put the
199、ir emphasis on visualizations and tracking experiments across different tools.You can usually recognize the second style of platform by a telltale sign in their marketing that usually reads something along the lines of“integration with a single line of code.”Metadata stores are designed to be easily
200、 integrated with other platforms,and they look to act as a single source of truth across the disparate metadata of the systems they connect with.Lets look at the platforms in turn,starting with the loosely and tightly coupled orchestrators.Loosely Coupled and Tightly Coupled Orchestrators ClearML is
201、 an open-source suite of tools that cover a broad range of AI/ML tasks.Its orchestration and pipelines serve both data engineering and experimentation,with an eye towards making the transition between the two simple and easy.The orchestration module introduces scheduling and remote job execution on
202、bare-metal machines,cloud infrastructure and Kubernetes clusters.The UI is geared towards building and testing models and getting them into production,allowing job scheduling directly from the experimentation UI.It also includes tools like a metadata index for data that allows data scientists to sea
203、rch through that data as well as split and sort it.The tools allow for classic data science tasks like slicing up training data and testing data on the fly,along with cloning previous experiments.Along the way,the platform develops artifacts from code and versions of all models created into a querya
204、ble model repository.It allows integrations with Tensorboard,Matplotlib and Git.Lastly,it allows for hyperparameter optimization on multiple machines,data preprocessing and model deployment,giving data scientists a more end-to-end experience to get models into production.DataRobots experimentation c
205、apabilities are largely built around Spark,but they also have a number of excellent proprietary interfaces that make the data science experience easier,particularly for structured use cases and extensive visualizations.In addition,DataRobot has one of the better AutoML capabilities on the market.We
206、do not spend much time discussing AutoML in this report because it calls up ideas of AI that can automatically do most of the work of a data scientist,when that is not the reality.However,DataRobots vision of AutoML is quite focused and clear.It essentially involves trying out a lot of known solutio
207、ns to a specific problem in ML all at once and offering users the ability to see which one works best without a lot of manual experimentation.A data science team can then focus on refining or fine tuning various methods from there.Like other large platforms in the space,DataRobot wants to create a t
208、rue end-to-end experience,and their tools not only deal with experimentation but also with some aspects of data engineering,as 27well as deployment,serving and monitoring.Vertex AI is Googles competitor to SageMaker and the Azure machine-learning cloud platform.Like other big platforms,its really a
209、suite of loosely connected tools rather than a seamless and unified experience.Its one of the only commercial platforms built on top of Kubeflow pipelines(and Tensorflow Extended),but it also uses a lot of proprietary tech under the hood,such as neural architecture search and their own feature store
210、.Unlike some of the other platforms it competes with,it doesnt focus as heavily on a tremendous GUI design,and thats never been Googles strongest asset anyway.But it does leverage Googles know how with containers,large scale systems and the command line.Its somewhat more modular than SageMaker,but i
211、t is still largely a walled garden,though its API is stronger than some more monolithic engines.The system is focused on teams with heavy programming and command-line experience.H2O is an open-source,in-memory big-data processing engine for machine-learning and analytics workloads that competes with
212、 Spark.Like many other large platforms profiled in this report,its best to think of it as a suite of tools with the in-memory pipeline as the core behind it.Its best suited to structured data workloads,but it does have support for images,audio and other unstructured data,and it can read common third
213、-party formats like Parquet files.It supports many common ML algorithms,with the developers custom writing them for distributed processing.That means that H2O is mostly used for running predefined machine-learning models versus developing your own neural networks.H2Os framework predates Pytorch and
214、other popular frameworks,and it competes with more widely adopted frameworks like Pytorch ML,in addition to being a fast pipelining system.It consists of four primary products:H2O,its proprietary in-memory pipeline,Deep Water,which integrates it and allows it to leverage Tensorflow,and Sparkling Wat
215、er,which integrates with Apache Spark.Driverless AI is an AutoML tool that competes with DataRobots offering and allows for automatic feature engineering,model selection and tuning.Amazon SageMaker is a suite of tools that includes everything from data wrangling to pipelines to a proprietary feature
216、 store to a labeling system.Its experimentation pipeline runs on the proprietary SageMaker Pipelines and is accessed primarily through SageMaker Studio.The suite of tools inside SageMaker expands regularly.Most of the visual interfaces are designed almost entirely for structured data workloads,and i
217、t is not ideally suited to running unstructured workloads.Its currently one of the most walled gardens of all the platforms and is designed to be an all-in-one integrated suite.Azure Machine Learning splits their experimentation engines into a proprietary pipelining system,Azure Machine Learning pip
218、elines,and a CI/CD-style orchestrator called Azure pipelines,which is comparable to Jenkins.Its pipelines are modular and allow independent execution running on Docker containers.Its orchestration allows the scripting of steps and orchestration of Kubernetes,VMs,Azure Functions,Azure Web Apps and mo
219、re.The orchestrator uses stages,gates and approvals to create a deployment strategy,and it allows orchestration steps from other CI systems,like Jenkins.Iguazios orchestration is built around their open-source MLRun python framework,which executes steps on their Nuclio serverless platform.The orches
220、trator is loosely coupled and designed to orchestrate a number of platforms,like Spark.It has built-in integrations with Tensorflow,Pytorch and other major machine-learning frameworks.Its interface is weighted towards model experimentation,featuring engineering and training,but its also excellent at
221、 deployment and 28can serve models as well,and it has built-in integrations with distributed analytics frameworks like Dask.Neu.ro is a loosely coupled orchestration system that prides itself on being able to easily integrate with other systems and to orchestrate them.Because it was designed with ML
222、 in mind,it is more granular than a more high-level and abstract loosely coupled orchestrator like AirFlow,which was designed for many tasks.Because of the development team focus on making it an orchestrator first,it includes many integrations with other tools in the ecosystem like Seldon Alibi,Seld
223、on Core,Prometheus and Grafana,DVC and other platforms,with the goal of knitting them together and weaving them into a single CI/CD cross platform workflow for ML.HPEs Ezmeral platform leverages a number of open-source platforms,like Spark and MLflow,under the hood,along with Kubernetes for scaling.
224、It also includes Airflow for loosely coupled orchestrations that can execute against Kubeflow pipelines,Spark or other external platforms.Because it leverages open-source platforms so extensively,it is more modular than other platforms and it allows for swapping in components more easily.It includes
225、 a proprietary interface called the App Workbench to help connect these different offerings,as well as an API to schedule jobs more easily.Similar to HPE Ezmeral,Shakudo built their platform on Kubernetes from a suite of open-source tools such as Tensorflow,Spark,MLFlow and Jupyter,while layering a
226、proprietary interface.This is a trend we expect to see continue in the space as open-source stacks mature.Integrating disparate open-source offerings is a challenge,and we expect to see more companies offer a clean interface to these solutions.Valohais platform is similar to Pachyderm in that it use
227、s Kubernetes on the backend,YAML/JSON to define steps in the pipeline and containers to execute code.That makes it language agnostic and even framework version agnostic.A customer may use one version of Anaconda in one step and another in a later step,while using a completely different language like
228、 Rust in yet another step.The platforms focus is primarily on deep learning,and its UI makes it lean closer to the data science side of the pipeline because it emphasizes experiments and comparing experiments.Because it is language and tool agnostic,it can be used for data engineering workloads,but
229、because it has a well-defined experimentation interface,we classify it primarily as an experimentation pipeline.Flyte is a tightly coupled framework that came out of Lyft and eventually found a home at the Linux Foundations LF AI&Data group.It was designed as a replacement for Airflow but with a foc
230、us on ML experimentation and deployment,and because it runs on Kubernetes clusters natively,it is considered more tightly coupled.The system looks to avoid YAML configurations for steps,instead focusing on direct code,such as Python code,to orchestrate steps,much like MLRun,though it also supports w
231、riting orchestration steps in Java and Scala.Because it is container-based,it can execute steps agnostically,in a manner similar to Pachyderm or Valohai.ZenML is a loosely coupled Python-based orchestrator that,like Neurolabs,prides itself on a wide range of integrations with everything from Spark t
232、o SageMaker to Argo.It was designed as an orchestrator of orchestrators,with the ability to stitch together an MLOps stack from a wide variety of proprietary and open-source components.29Metadata Engines,Experiment Trackers and Visualizers Now that weve covered the platforms that include their own o
233、rchestration engines,we can focus on the platforms that are more metadata focused,for the purpose of acting as systems of record across multiple platforms,and that offer visualizations and tracking.The first,Weights&Biases,is not an orchestrator or pipeline system itself,but rather it integrates wit
234、h other orchestrators to offer experiment tracking and visualizations of experiments.It also offers checkpointing and the ability to rerun experiments.Lastly,it monitors CPU and GPU performance.Comet.ML also emphasizes experimentation and visualization via easy integration with one line of code.Its
235、designed to be simple to integrate into existing pipelines,even home grown ones,and enables teams to track and compare those models more easily.It allows for workspaces that let teams consolidate,manage,collaborate and report on their machine-learning projects and experiments.Comet.ML provides autom
236、atic logging for a number of popular Python machine learning frameworks,even if those frameworks dont support logging natively.Domino Data Labs acts as a metadata store across platforms,and its Workbench platform acts as an easy interface to Spark and Ray compute engines.Its goal is to act as a syst
237、em of record across platforms and to make interfacing with other platforms easier,so it falls into the loosely coupled visual orchestrator category.Infuse AIs PipeRider system monitors changes across pipelines and notifies users when they break expectations to help avoid repeated failures.Like the o
238、ther frameworks in this space,it is designed to be easily integrated across other platforms,starting with DBT and moving to Snowflake,Tensorflow and Weights&Biases and MLflow.Since it is not designed to do the orchestration itself,it can be gradually introduced to a stack.DAGsHub is the last metadat
239、a and visualization platform well profile here.It includes the ability to compare and diff files and to create versions of DAGs,and it allows for team collaboration with the commenting,annotating and sharing of DAGs.It offers integrations with third-party tools like Jenkins,and it is designed for mo
240、dularity.Current Trends and the Next Five YearsOrchestration and compute are some of the most essential components in an AI/ML.They form the bedrock of how work gets done on the system,and a misstep in choosing the right framework can prove costly,as orchestrators and pipelines connect with all or m
241、ost other aspects of the system.This is an area in which every company should study the available options extensively and take their time to make sure the platforms match their needs and their future needs.Many of the companies in this space offer similar capabilities,and companies looking to make a
242、 decision on which platform to use should consider the following:301.The key capabilities of their team Are they strong programmers or do they need a powerful visual GUI or both?2.The range of capabilities of the platform3.The kinds of AI/ML projects they need to support Structured,semi-structured,u
243、nstructured or all of the above?4.Connector and integration support5.Customer team and support6.The proprietary tooling capabilities7.Their graphical front end We recommend that companies carefully consider not just their crrent AI/ML use cases but any future use cases they might want to work with i
244、n the future.There are a number of capable orchestration and pipelining platforms on the market,but not every one will meet every need.If you are not careful,you might find that youve selected a powerful system for working with structured data and databases but will struggle with deep learning later
245、.Its likely you may need a suite of tools to accomplish all of your goals as an organization.Its important to look beyond marketing.Many platforms promise that they can do everything and handle any kind of workload,but be sure to ask companies to demonstrate a range of capabilities and beware of sle
246、ight of hand.If a company promises they can handle high-resolution satellite imagery as easily as textual data,be sure to examine examples and case studies that demonstrate that capability in detail.It is also difficult to know what kinds of capabilities you will need for future workloads,and the li
247、mits of platforms may not make themselves known until your team has used the platform for a time and then pivoted to a new use case,only to find out it is challenging to achieve their aims within that platform.It is likely your team may need to use a range of tools,including one or more tightly coup
248、led data engineering frameworks,as well as a loosely coupled orchestrator and an experimentation engine.The Next Five YearsIn the current generation of platforms,Spark is one of the most dominant.Thats largely because it is the oldest existing codebase,and companies that built their capabilities aro
249、und it have had the most time to mature their interfaces,APIs and those capabilities.If you are working with structured workloads,Spark has excellent capabilities and thousands of successful use cases.However,MapReduce has limitations with regard to AI/ML.Newer AI paradigms like RL have cascading se
250、ts of task dependencies that do not easily fit into a MapReduce structure.Unstructured data,despite efforts like Delta Lake,is not an ideal fit on the platform,and that makes it less than ideal for deep learning.Just as Spark was built with data processing in mind,newer frameworks,like Ray or the Go
251、ogle Pathways training architecture,are being built from the ground up with existing AI/ML workloads and more cutting-edge workloads in mind.31As the field of AI/ML advances and new techniques come out of the research labs and become commonplace,we expect the limitations of MapReduce to continue to
252、show their age.We expect more frameworks like Ray and Pathways to emerge over the next five years and for them to compete to become the default distributed processing and distributed AI/ML application framework for the next ten years.We also expect frameworks for AI/ML that are more language agonist
253、ic;even though most major work starts in Python,other languages can and do add value to the various steps of the AI/ML workload.We also expect those platforms to absorb newer AI/ML workloads that rely on that underlying logic.That said,we firmly expect Spark to continue to be a powerhouse in structu
254、red data applications and to continue to excel at a broad range of other types of BI and analytics.While we do expect competition on the AI/ML framework front,we do not expect any other general purpose compute engine to emerge and compete with Kubernetes.Kubernetes itself emerged as the de facto win
255、ner against other strong contenders like Apache Mesos,Docker Swarm and Rancher.None of them had the broad mindshare and sticking power of Kubernetes or its general purpose capabilities and scaling potential,other than Mesos,which is no longer a serious competitor in the cloud compute space.Kubernete
256、s services are now available on premises and in every major public cloud,and we do not see serious competition to it suddenly emerging.As such,expect Kubernetes to largely act as the underlying compute layer,and AI/ML framework layers to run on top of it or a serverless platform like Nuclio or Amazo
257、ns Lambda.We expect competition at both the data engineering pipeline layer and the experimentation pipeline layer to remain fierce over the next five years.We will likely see some consolidation among the lightweight integration-style platforms like Weights&Biases,and there are a number of excellent
258、 visualization platforms already.Beyond that,we expect some companies to fail or run out of runway due to funding or management or market fit.We also see some acquisitions and mergers narrowing the space.That said,we certainly have not seen the last of new players to the field,as it is an exciting s
259、pace and still developing,so new players may offset the losses of older players who have failed.Finally,its likely that several platforms will begin to pull ahead of the competition,gobbling up market share and customers,but we do not expect the field to be completely decided in the next five years,
260、with one or two totally dominant players sitting at the top of the heap.MODEL DEPLOYMENT AND SERVINGCompanies and platforms covered in this section include:Seldon,DataRobot,Iguazio,Modzy,ClearML,ONNX,OctoML,UbiOps,Dask,Flask,NVIDIA Triton,Superwise,Cortex,BentoML,Domino Data Labs,LGN,Ray 32 Introduc
261、tionOver the last few decades,traditional code-based application deployments have evolved to allow for faster and more automated rollouts.We went from large waterfall deployment methodologies,where companies typically deployed new monolithic code once or twice a year,to Agile-and DevOps-style deploy
262、ments,where applications became more modular and discreet,with deployments happening monthly,weekly,multiple times per day and even hundreds of times per day.We went from small-scale applications running inside of corporate firewalls to web-scale applications that can serve millions of requests per
263、second on distributed infrastructure.That evolution in speed came through a virtuous loop of new strategies,new tools,more mature languages,and better libraries,as well as more robust serving and scaling frameworks.That loop includes a well-known and highly effective strategy of rolling out code,tes
264、ting it,rolling it back,serving requests and scaling it to meet surging requests,as well as the use of software tools to support serving and scaling.While AI/ML deployment builds on many of those foundations,AI/ML models also present some very different factors that currently make them more challeng
265、ing to deploy and scale.In 2020,Algorithmia(now DataRobot),found that only 14%of organizations were able to deploy a model in 07 days.They reported that 28%of organizations took 830 days to deploy a model,22%took 3190 days and 13%took 91365 days.Two years later,in 2022,our survey shows that the numb
266、ers have changed.Teams are getting better and faster at getting models into production,but many are still struggling.Only 5%were able to get a model into production in a day,with 17%taking a week or less and 34%getting models into production in 30 days or less.That said,a significant portion,23%,wer
267、e still taking 23 months to get models deployed,and 20%were still taking 6 months to a year.33Teams are also deploying and managing a growing number of models in production,with 37%managing between 510,21%managing 1125,7%managing 2699 and 7%managing a 100 or more.34That means its time to finally put
268、 the often quoted stat that 87%of models never make it into production to rest for good.Its just not true anymore.Of course,there is still room for improvement,but MLOps teams are starting to match their DevOps counterparts in terms of speed and skill.Google found in their 2021 annual survey,The Sta
269、te of DevOps,which tracks traditional application deployment and serving,that the majority of the teams they surveyed were elite or high-performing teams.They found that 26%of teams surveyed hit elite status,meaning they were able to deploy applications on demand or multiple times per day,versus onl
270、y 7%in 2018.Furthermore,40%of teams hit“high”status,meaning they were able to deploy once per week or at least once a month.Twenty-two percent of MLOps teams can deploy a model in a week or less and 34%can do it in 30 days or less,which means many MLOps teams would qualify for a“high”status.But with
271、 all that said,teams still say the hardest thing to plan for when building and operationalizing models is the time it takes to deploy a model.When we talk about model deployment and serving were really talking about two things:1.Deployment:The workflow to get a trained model into production2.Serving
272、:The platform that hosts and serves the model to respond to requests The workflow involves both the human workflow and/or an automated or partially automated series of steps to package up a trained model and get it into production,as well as the software tools to support the workflow.35When it comes
273、 to serving the model,were talking about a platform to receive and respond to requests with predictions and inference,as well as the software tools to maintain,manage and monitor those platforms.Typically the companies looking to assist teams with this part of the MLOps landscape are looking to help
274、 your team handle both the deployment workflow and to serve the model at scale.Deployment WorkflowsThe workflow consists of all the steps that are involved in taking a trained model and getting it into production.That might include creating an instance of the model that can receive requests for infe
275、rence,packaging up all of the specific runtime libraries in either a proprietary format or a container like Docker,defining ports and opening the firewall,running manual and automated tests,security checks,pushing the model version to a version tracking databases,running regression style tests,spinn
276、ing up instances of the model to receive requests and more.The speed that a model gets through this workflow depends on how much of the pipeline is automated versus manual.It should come as no surprise that teams relying on a lot of manual steps are further down the speed-to-production curve,taking
277、months or even a year to get a model into production.Teams that have a high degree of existing automation in the web applications and in-house applications are often better suited to carry that knowledge over to machine learning.Every company surveyed here offers some set of workflow wizards and aut
278、omation steps to help speed through the process of packaging up,versioning and deploying a model to a serving framework,including Seldon,DataRobot,Iguazio,Modzy,ClearML,OctoML,BentoML and UbiOps.The workflow tooling platforms use either their own proprietary wizard-like steps,or they combine their t
279、ools with external CI/CD tools like Git.They also typically leverage open-source package management solutions like Docker,though some support bare-metal deployments,and some,like BentoML offer their own unified model packaging format for deployment.Some companies,like LGN,offer a deployment workflow
280、 specifically geared to AI at the edge,such as on a smartphone,which has special considerations such as memory,and speed and size constraints.ServingAfter deployment comes serving,where a model serves requests for inference.This is similar to a web application in that it needs to receive,queue and r
281、espond to requests quickly.AI/ML models typically have higher compute and memory requirements versus traditional web applications,although there are exceptions with memory-hungry languages like Java.There are a number of key characteristics of a good model-serving framework:Framework agnostic Scalab
282、le36 Replicable Request queuing,batching and load balancing High concurrency and low latency Able to support GPU,TPU or other AI-accelerating hardware Well highlight how each of these appear in a typical model serving architecture as we describe the architecture below.Generally a model server takes
283、a model“endpoint,”which is an instance of the model that is able to process and return inference requests.Requests may come from external applications via REST,gRPC or HTTP/HTTPS or from a message streaming service such as Kafka or RabbitMQ.Requests are often queued up or batched by a scheduler,whic
284、h routes the requests back to the various instances of the model.The scheduling queue connects to the model agent,which in turn runs AI/ML frameworks,such as TensorFlow or Pytorch,or custom-built frameworks that deliver an inference response.The response is funneled back via HTTP/HTTPS,REST or RPC.S
285、ometimes the scheduler is limited to single instances of the model endpoint,in which case a deployment would need a software load balancer to route traffic across instances.In other platforms,the scheduler acts like a load balancer itself,intelligently routing to backend instances of the model endpo
286、int.Companies cited latency as the number one concern in production.This differs dramatically from in academia,where teams publishing a paper are looking to maximize their state-of-the-art results and throughput.In academia,teams dont much care if their model doesnt respond swiftly or uses a lot of
287、memory,but in production,latency really matters.If the ML model that powers Gmails suggested sentence completion doesnt respond faster than you can think of or type that sentence,then the model is largely worthless.37The primary architectures of serving a model are the following:Web service Containe
288、rs and microservices Serverless The first approach is to wrap the model up as a web service,with something like Flask,which works well for Python-based models.The second approach is to serve it as a container or series of containers,such as in a Kubernetes pod.The last approach is to serve it as a m
289、icroservice like OpenWisk or Nuclio or Fission,which is essentially a very thin container,and route requests to it via a service mesh like Istio.Scaling of models is generally done via one of the following:A proprietary or open-source clustering mechanism Containers and microservices controlled by a
290、n orchestrator such as Kubernetes Serverless platforms 38An example of proprietary or open-source clustering mechanisms are Spark clusters,which can pull from the MLflow model registry,which is a database of model versions.However,Spark is generally not used for serving,and Databricks considers serv
291、ing via MLflow a public preview and recommends it only for“low throughput and non-critical applications.”It is also currently limited to Conda-based Python applications only,which means its not framework agnostic.Another popular open-source autoscaling and clustering framework came out of the Rise L
292、ab at Berkley and spun off into a company that extends and maintains the framework,Anyscale,which offers a managed services cloud-based version of Ray.Ray is a compute framework for distributed machine-learning applications,as we discussed earlier,and it is not an alternative to containers or micros
293、ervices.Instead,when it comes to serving,it is similar to something like Dask,which offers a framework for building distributed analytics applications in Python.KServe is one of the more well-known and widely used open source serving frameworks.It creates Custom Resource Definitions for ML models on
294、 arbitrary frameworks.It was developed jointly with contributions from NVIDIA,Google,Bloomberg,IBM,and Seldon as a cloud native model server for Kubernetes.Recently,it graduated from the Kubeflow project as a standalone project.It can be deployed on Kubernetes or serverlessly.For serverless,the syst
295、em uses Knative Serving,which bakes in automatic scale-up and scale-down capabilities.It uses Istio to expose service endpoints to the API.It can support canary and blue/green deployments and supports a good range of ML frameworks.It can also work with existing serving engines like TorchServe,Tensor
296、flow serving and Triton.It can host PyTorch,TensorFlow,and ONNX runtimes through Triton.It can also serve XGBoost and SKLearn through Seldons MLServer.Most clustering mechanisms can also use Kubernetes or Slurm as a compute substrate,in essence acting as an orchestrator of orchestrators.Of course,th
297、e most well-known open-source clustering and orchestration platform is Kubernetes,and weve found that most of the frameworks here utilize Kubernetes or support it at some level for orchestration and scaling.Weve also found that most serving engines are agnostic to the AI/ML framework or support a wi
298、de variety of frameworks.Typically this accomplished by packaging the AI/ML framework up in a container.Framework agnosticism or wide support for frameworks is crucial because there has been a massive proliferation of open-source tools that data scientists leverage to train and build models,includin
299、g,but not limited to,Pytorch,Tensorflow,ScikitLearn,Caffee,Theano and XGBoost.There are also pre-trained models available,like HuggingFaces transformers library,which includes foundational models like BERT,CLIP and Perceiver IO,which teams can fine tune for their own tasks.A conversion system like O
300、NNX,which creates a common file format to represent machine-learning models,may help,but most teams surveyed ended up running their models in their native format to avoid challenges with conversions or performance issues.Teams may use a framework to compress or optimize the model before serving it,a
301、s with OctoMLs compression service based on the open-source TVM project,in order to lower latency,memory and compute requirements.As of the time of our survey,the majority of large-scale,pioneering AI/ML organizations like Google,Uber,Lyft and Netflix,as well as the top AI/ML research organizations
302、like OpenAI and DeepMind,use their own custom deployment 39workflows and tools,typically built around a collection of in-house proprietary and/or open tools,like Git,Docker and Kubernetes,which they saw as an extension of their existing automated CI/CD process.This is typical in any early stage soft
303、ware ecosystem because when the pioneers start to build their applications,there are simply not commercial or open frameworks to build on,so they have to roll their own.Over time,as the market develops,commercial vendors learn from the successes and failures of the pioneers and create stable softwar
304、e that gets adopted by the next set of organizations and enterprises.These organizations arent usually research organizations,but they have the staff and know-how to implement the research in their own environments and they begin to leverage open and proprietary tooling to advance state-of-the-art t
305、echnology.The vast majority of enterprises today,outside of the pioneering tech and research organizations,leverage one of three kinds of tools:open source,enterprise versions of open source or proprietary serving engines.Open-source tools like Seldon Core or Modzys Chassis are usually supported via
306、 their in-house IT operations on the existing Kubernetes platforms they already run for traditional web applications.Organizations often turn to commercial vendors that offer enhanced versions of open source tools,such as Seldon Deploy or Anyscale Rays managed service or proprietary serving framewor
307、ks like Algorithmia(now part of DataRobots AI Cloud).Most of the model-serving commercial platforms like Modzy,Seldon or DataRobots AI Cloud either support a wide variety of frameworks or they are framework agnostic.Scaling involves creating multiple instances of a model and routing traffic across t
308、he instances to handle large volumes or concurrent requests.Scaling on almost every platform surveyed is done via containers running on Kubernetes or on a dedicated microservices architecture,which is essentially very thin containers and which still often runs on Kubernetes and containers on the bac
309、kend.Examples of serverless platforms used for model endpoint serving and scaling are Nuclio,Amazons Lambda and OpenWhisk,which runs on Kubernetes and OpenShift.Examples of microservices are Algorithmias(now DataRobot)serving platform,Microsoft Azures Kubernetes service and Azure Service Fabric.The
310、vast majority of these platforms are managed versions of Kubernetes and containers with extensions to make it easier for them to manage the Kubernetes clusters.Beyond scaling models,weve noticed that many of the platforms support advanced deployment scenarios,such as canary and shadow deployments.It
311、s worth checking the support for advanced deployments with every vendor youre considering for your deployment and serving needs.Shadow and canary deployments in AI/ML differ from traditional applications,and its worth digging to discover whether your vendor understands those differences.For example,
312、a shadow deployment,where requests are sent to the current production version of a model and the test version of the model,may need to persist for a lot longer than a web application because it may take a longer time to establish whether the models inference is improved or degraded.40Patterns of Dep
313、loymentLastly,while we often refer to model deployment as singular,as if models were monolithic,typical production deployments are often a series of multiple models or transformation stages,along with traditional business logic to support those models.Its best to think of deployments as deployments
314、of a cluster of models and business logic that forms a complete application.The team at Anyscale noted at least four different common model deployment patterns that each come with their own difficulties.While its not an exhaustive list,weve found that it largely matches the typical production deploy
315、ments today.Each type of model presents its own problems in production,and its worth knowing whether your vendor has a detailed understanding of these kinds of deployments.The first is the pipeline-style deployment.Pipelines go through a series of linked stages.A request passes through various algor
316、ithms or models like beads on a string.A good example is something like a movie recommendation engine,where an inference request passes through various stages like an embeddings lookup,feature interaction,nearest neighbors and rankings before returning its result.Ensemble models are a collection of
317、upstream models that receive the same request and all return an answer to the requester.That could be for a number of reasons.The first is that you may have a newer version of a model running in a shadow deployment.That means a newer version of the model is live and receiving traffic,but you want to
318、 make sure that the model is providing good results,so the older model remains primary and acts as a baseline until the shadow model is ready to take over as the primary.41The second is to give aggregate answers,which can decrease inconsistencies across different models or different versions of a mo
319、del.A final request may simply take the average of the answers or select one of the answers to test its response with users.Lastly,dynamic selection may route requests to different models based on the models individual specialization.A pet owner might indicate they own a dog on a web form,and the se
320、rvice would route to the model that specializes in dogs instead of the model that specializes in cats.The third pattern is business logic.A model is usually surrounded by traditional coded business logic and rarely exists independently.That means traditional application code needs to interface with
321、the model.The last pattern is reinforcement learning.Reinforcement learning(RL)is unique in that an agent endpoint is making decisions and receiving a reward or punishment for its decision-making.This is the most advanced style of deployment,and for now it is largely confined to the most advanced re
322、search houses and large-scale tech and financial companies as it requires real-time data feeds and tremendous scalability.Ray is one of the few platforms built to handle RL.All of the patterns present challenges for scaling,and the top platforms provide fast interconnects between nodes and endpoints
323、,fault tolerance to address non-responsive or downed nodes,and error correction.Each of these styles of models has its own challenges for maintenance and uptime,and require excellent traditional IT monitoring and man-agement.For example,a pipeline model may fail at any stage,and if any stage fails,t
324、he entire output fails.An ensemble model may be more forgiving if it is simply taking the aggregate of answers from all the models,but if the failure per-sists,then the results are coming from a reduced set of models and thus suffering from the weakness in the individual models that the ensemble was
325、 looking to overcome.In the monitoring section of this report,well cover the kinds of monitoring typically found in modern AI/ML production deployments to ensure that models stay running smoothly.Current Trends and the Next Five YearsDeployment and serving are some of the most challenging and yet th
326、e most important pieces in the AI/ML puzzle.Too often,teams find themselves struggling to productionize machine-learning models after they were developed with a 42wide variety of open-source,homespun and proprietary tools.Memory issues,the security of the data and models,inference latency,dependenci
327、es,software versions,scaling and a thousand other factors can hurt performance.Drift can make a model useless in a short period of time,and teams need a smooth way to get models into production.Monitoring,as well see in a later section,requires different tools and different metrics than a typical we
328、b monitoring setup,requiring teams to familiarize themselves with additional platforms and concepts.There is a wide range of deployment assistance in almost every product surveyed.We recommend every team look carefully at the deployment process on any platform they are considering and take a close l
329、ook at how it works with a variety of simple and complex model deployments,from single models to pipeline models to ensemble models.Ask questions,such as what frameworks are supported?Is the process agnostic?How many methods of rollout are supported and how do they work?When it comes to serving mode
330、ls,we are seeing a split in approaches.The dominant approach is to containerize the model and serve it with all its dependencies and code and then replicate the container.This is the most popular approach and the most flexible.The reason for this is simple.There are thousands of libraries and dozens
331、 of major frameworks in multiple languages.Python is the most dominant,but apps that use R,Scala or Java are important too,especially in areas like fraud prevention,which has a long history in enterprises that use more than just Python.Containers are the most flexible way to deal with that diversity
332、.The second approach is to build a framework in a chosen language,usually Python.This is the approach Ray and Dask have taken,and its useful in that it goes where most machine-learning libraries currently exist and leverages that to allow people to build scalable,distributed machine-learning apps wi
333、thout altering much or any of their code.However,we recommend teams look at serving as being three layered.The first layer is an underlying general purpose compute,such as Kubernetes.The second layer is the leveraging of a serving engine like Seldon that uses something like Kubernetes but adds in paradigms and functions for machine learning.The third layer is a framework that runs atop either a fi