《stateof.ai:2024年人工智能全景報告(英文版)(212頁).pdf》由會員分享,可在線閱讀,更多相關《stateof.ai:2024年人工智能全景報告(英文版)(212頁).pdf(212頁珍藏版)》請在三個皮匠報告上搜索。
1、STATE OF AI REPORTSTATE OF AI REPORT.October 10,2024Nathan BenaichAIR STREET CAPITALAIR STREET CAPITALstateof.aiAbout the authors Introduction|Research|Industry|Politics|Safety|Predictions#stateofai|2stateof.ai 2024Nathan is the General Partner of Air Street Capital,a venture capital firm investing
2、in AI-first companies.He runs the Research and Applied AI Summit(RAAIS),the RAAIS Foundation(funding open-source AI projects),AI communities in the US and Europe,and Spinout.fyi(improving university spinout creation).He studied biology at Williams College and earned a PhD from Cambridge in cancer re
3、search as a Gates Scholar.Nathan Benaich#stateofai|3 Introduction|Research|Industry|Politics|Safety|Predictionsstateof.ai 2024Alex ChalmersAbout the authorsAlex is Platform Lead at Air Street Capital and regularly writes research,analysis,and commentary on AI via Air Street Press.Before joining Air
4、Street,he was an associate director at Milltown Partners,where he advised big technology companies,start-ups,and investors on policy and positioning.He graduated from the University of Oxford in 2017 with a degree in History.Artificial intelligence(AI)is a multidisciplinary field of science and engi
5、neering whose goal is to create intelligent machines.We believe that AI will be a force multiplier on technological progress in our increasingly digital,data-driven world.This is because everything around us today,ranging from culture to consumer products,is a product of intelligence.The State of AI
6、 Report is now in its seventh year.Consider this report as a compilation of the most interesting things weve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.We consider the following key dimensions in our report:-Research:Technology br
7、eakthroughs and their capabilities.-Industry:Areas of commercial application for AI and its business impact.-Politics:Regulation of AI,its economic implications and the evolving geopolitics of AI.-Safety:Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose t
8、o us.-Predictions:What we believe will happen in the next 12 months and a 2023 performance review to keep us honest.Produced by Nathan Benaich and Air Street Capital teamstateof.ai 2024#stateofai|4 Introduction|Research|Industry|Politics|Safety|Predictions#stateofai|5 Introduction|Research|Industry|
9、Politics|Safety|PredictionsArtificial intelligence(AI):a broad discipline with the goal of creating intelligent machines,as opposed to the natural intelligence that is demonstrated by humans and animals.Artificial general intelligence(AGI):a term used to describe future machines that could match and
10、 then exceed the full range of human cognitive ability across all economically valuable tasks.AI Agent:an AI-powered system that can take actions in an environment.For example,an LLM that has access to a suite of tools and has to decide which one to use in order to accomplish a task that it has been
11、 prompted to do.AI Safety:a field that studies and attempts to mitigate the risks(minor to catastrophic)which future AI could pose to humanity.Computer vision(CV):the ability of a program to analyse and understand images and video.Deep learning(DL):an approach to AI inspired by how neurons in the br
12、ain recognise complex patterns in data.The“deep”refers to the many layers of neurons in todays models that help to learn rich representations of data to achieve better performance gains.Diffusion:An algorithm that iteratively denoises an artificially corrupted signal in order to generate new,high-qu
13、ality outputs.In recent years it has been at the forefront of image generation and protein design.Generative AI:A family of AI systems that are capable of generating new content(e.g.text,images,audio,or 3D assets)based on prompts.Graphics Processing Unit(GPU):a semiconductor processing unit that ena
14、bles a large number calculations to be computed in parallel.Historically this was required for rendering computer graphics.Since 2012 GPUs have adapted for training DL models,which also require a large number of parallel calculations.Definitionsstateof.ai 2024(Large)Language model(LM,LLM):a model tr
15、ained on vast amounts of(often)textual data to predict the next word in a self-supervised manner.The term“LLM”is used to designate multi-billion parameter LMs,but this is a moving definition.Machine learning(ML):a subset of AI that often uses statistical techniques to give machines the ability to le
16、arn from data without being explicitly given the instructions for how to do so.This process is known as“training”a“model”using a learning“algorithm”that progressively improves model performance on a specific task.Model:a ML algorithm trained on data and used to make predictions.Natural language proc
17、essing(NLP):the ability of a program to understand human language as it is spoken and written.Prompt:a user input often written in natural language that is used to instruct an LLM to generate something or take action.Reinforcement learning(RL):an area of ML in which software agents learn goal-orient
18、ed behavior by trial and error in an environment that provides rewards or penalties in response to their actions(called a“policy”)towards achieving that goal.Self-supervised learning(SSL):a form of unsupervised learning,where manually labeled data is not needed.Raw data is instead modified in an aut
19、omated way to create artificial labels to learn from.An example of SSL is learning to complete text by masking random words in a sentence and trying to predict the missing ones.Transformer:a model architecture at the core of most state of the art(SOTA)ML research.It is composed of multiple“attention
20、”layers which learn which parts of the input data are the most important for a given task.Transformers started in NLP(specifically machine translation)and subsequently were expanded into computer vision,audio,and other modalities.Definitionsstateof.ai 2024#stateofai|6 Introduction|Research|Industry|
21、Politics|Safety|PredictionsModel type legendIn the rest of the slides,icons in the top right corner indicate input and output modalities for the model.Input/Output types:Text:Image:Code :Software tool use(text,code generation&execution):Video:Music:3D:Robot state:Biological modalityDefinitionsstateo
22、f.ai 2024#stateofai|7 Introduction|Research|Industry|Politics|Safety|PredictionsModel types:LLMs +:Multimodal LLMs+:Multimodal LLMs for Robotics :Text to Code :Text to Software tool use :Text to Image :Text to Video :Text to Music :Image to 3D :Text to 3D :Biological modelsResearch-Frontier lab perf
23、ormance converges,but OpenAI maintains its edge following the launch of o1,as planning and reasoning emerge as a major frontier.-Foundation models demonstrate their ability to break out of language as multimodal research drives into mathematics,biology,genomics,the physical sciences,and neuroscience
24、.-US sanctions fail to stop Chinese(V)LLMs rising up community leaderboards.Industry-NVIDIA remains the most powerful company in the world,enjoying a stint in the$3T club,while regulators probe the concentrations of power within GenAI.-More established GenAI companies bring in billions of dollars in
25、 revenue,while start-ups begin to gain traction in sectors like video and audio generation.Although companies begin to make the journey from model to product,long-term questions around pricing and sustainability remain unresolved.-Driven by a bull run in public markets,AI companies reach$9T in value
26、,while investment levels grow healthily in private companies.Politics-While global governance efforts stall,national and regional AI regulation has continued to advance,with controversial legislation passing in the US and EU.-The reality of compute requirements forces Big Tech companies to reckon wi
27、th real-world physical constraints on scaling and their own emissions targets.Meanwhile,governments own attempts to build capacity continue to lag.-Anticipated AI effects on elections,employment and a range of other sensitive areas are yet to be realized at any scale.Safety-A vibe-shift from safety
28、to acceleration takes place as companies that previously warned us about the pending extinction of humanity need to ramp up enterprise sales and usage of their consumer apps.-Governments around the world emulate the UK in building up state capacity around AI safety,launching institutes and studying
29、critical national infrastructure for potential vulnerabilities.-Every proposed jailbreaking fix has failed,but researchers are increasingly concerned with more sophisticated,long-term attacks.Executive Summarystateof.ai 2024#stateofai|8 Introduction|Research|Industry|Politics|Safety|PredictionsScore
30、card:Reviewing our predictions from 2023stateof.ai 2024#stateofai|9 Introduction|Research|Industry|Politics|Safety|Predictions#stateofai|10 Introduction|Research|Industry|Politics|Safety|Predictionsstateof.ai 2024Our 2023 PredictionEvidenceA Hollywood-grade production makes use of generative AI for
31、visual effects.YESLargely badly,but GenAI AI visual effects have been seen in Netflix and HBO productions.A generative AI media company is investigated for its misuse during in the 2024 US election circuit.Not yet,but theres still time.Self-improving AI agents crush SOTA in a complex environment(e.g
32、.AAA game,tool use,science).NONot yet,despite promising work on open-endedness,including strong game performance.Tech IPO markets unthaw and we see at least one major listing for an AI-focused company(e.g.DBRX).While the Magnificent Seven have enjoyed strong gains,private companies are hanging on un
33、til markets settle.However,AI chip company Cerebras has filed to IPO.The GenAI scaling craze sees a group spend$1B to train a single large-scale model.NONot quite yet-lets give it another year.The USs FTC or UKs CMA investigate the Microsoft/OpenAI deal on competition grounds.YESBoth regulators are
34、investigating this partnership.We see limited progress on global AI governance beyond high-level voluntary commitments.YESThe commitments from Bletchley and Seoul summits remain voluntary and high-level.Financial institutions launch GPU debt funds to replace VC equity dollars for compute funding.NOS
35、ome VC funds are rumored to be offering GPUs for equity,but were yet to see anyone go down the debt route.An AI-generated song breaks into the Billboard Hot 100 Top 10 or the Spotify Top Hits 2024.YESIt turns out this had already happened last year with“Heart on My Sleeve”,but weve also seen an AI-g
36、enerated song reach#27 in Germany and spend several days in the Top 50.As inference workloads and costs grow significantly,a large AI company(e.g.OpenAI)acquires or builds an inference-focused AI chip company.YESSam Altman is reportedly raising huge sums of money to do this,while each of Google,Amaz
37、on,Meta and Microsoft continue to build and improve their owned AI silicon.Section 1:Research#stateofai|11 Introduction|Research|Industry|Politics|Safety|Predictionsstateof.ai 2024 For much of the year,both benchmarks and community leaderboards pointed to a chasm between GPT-4 and the best of the re
38、st.However,Claude 3.5 Sonnet,Gemini 1.5,and Grok 2 have all but eliminated this gap as model performance now begin to converge.stateof.ai 2024#stateofai|12OpenAIs reign of terror came to an end,until Introduction|Research|Industry|Politics|Safety|Predictions On both formal benchmarks and vibes-based
39、 analysis,the best-funded frontier labs are able to rack up scores within low single digits of each other on individual capabilities.Models are now consistently highly capable coders,are strong at factual recall and math,but less good at open-ended question-answering and multi-modal problem solving.
40、Many of the variations are sufficiently small that they are now likely to be the product of differences in implementation.For example,GPT-4o outperforms Claude 3.5 Sonnet on MMLU,but apparently underperforms it on MMLU-Pro-a benchmark designed to be more challenging.Considering the relatively subtle
41、 technical differences between architectures and likely heavy overlaps in pre-training data,model builders are now increasingly having to compete on new capabilities and product features.The OpenAI team had clearly clocked the potential of inference compute early,with OpenAI o1 appearing within week
42、s of papers from other labs exploring the technique.stateof.ai 2024#stateofai|13the Strawberry landed,doubling down on scaling inference compute Introduction|Research|Industry|Politics|Safety|Predictions By shifting compute from pre-and post-training to inference,o1 reasons through complex prompts s
43、tep-by-step in a chain-of-thought(COT)style,employing RL to sharpen the COT and the strategies it uses.This unlocks the possibility of solving multi-layered math,science,and coding problems where LLMs have historically struggled,due to the inherent limitations of next-token prediction.OpenAI report
44、significant improvements on reasoning-heavy benchmarks versus 4o,with the starkest on AIME 2024(competition math),with a whopping score of 83.83 versus 13.4.However,this capability comes at a steep price:1M input tokens of o1-preview costs$15,while 1M output tokens will set you back$60.This makes it
45、 3-4x more expensive than GPT-4o.OpenAI is clear in its API documentation that it is not a like-for-like 4o replacement and that it is not the best model for tasks that require consistently quick responses,image inputs or function calling.The community were quick to put o1 through its paces,finding
46、that it performed significantly better than other LLMs on certain logical problems and puzzles.Its true edge shone through,however,on complex math and science tasks,with a viral video of a PhD student reacting with astonishment as it reproduced a year of his PhD code in approximately an hour.However
47、,the model remains weaker on certain kinds of spatial reasoning.Like its predecessors,it cant play chess to save its life yet.stateof.ai 2024#stateofai|14o1 showcases both areas of incredible strength and persistent weakness Introduction|Research|Industry|Politics|Safety|Predictions In April,Meta dr
48、opped the Llama 3 family,3.1 in July,and 3.2 in September.Llama 3.1 405B,their largest to-date,is able to hold its own against GPT-4o and Claude 3.5 Sonnet across reasoning,math,multilingual,and long-context tasks.This marks the first time an open model has closed the gap with the proprietary fronti
49、er.stateof.ai 2024#stateofai|15Llama 3 closes the gap between open and closed models Introduction|Research|Industry|Politics|Safety|Predictions Meta stuck to the same decoder-only transformer architecture that its used since Llama 1,with minor adaptations,namely more transformer layers and attention
50、 heads.Meta used an incredible 15T tokens to train the family.While this blew through the“Chinchilla-optimal”amount of training compute,they found that both the 8B and 70B models improved log-linearly up to 15T.Llama 3.1 405B was trained over 16,000 H100 GPUs,the first Llama model trained at this sc
51、ale.Meta followed up with Llama 3.2 in September,which incorporated 11B and 90B VLMs(Llamas multimodal debut).The former was competitive with Claude 3 Haiku,the latter with GPT-4o-mini.The company also released 1B and 3B text-only models,designed to operate on-device.Llama-based models have now rack
52、ed up over 440M downloads on Hugging Face.#stateofai|16 Introduction|Research|Industry|Politics|Safety|PredictionsBut how open are open source models?stateof.ai 2024 With open source commanding considerable community support and becoming a hot button regulatory issue,some researchers have suggested
53、that the term is often used misleadingly.It can be used to lump together vastly different openness practices across weights,datasets,licensing,and access methods.#stateofai|17 Introduction|Research|Industry|Politics|Safety|PredictionsIs contamination inflating progress?stateof.ai 2024 With new model
54、 families reporting incredibly strong benchmark performance straight out-of-the-gate,researchers have increasingly been shining a light on dataset contamination:when test or validation data leaks into the training set.Researchers from Scale retested a number of models on a new Grade School Math 1000
55、(GSM1k)that mirrors the style and complexity of the established GSM8k benchmark,finding significant performance drops in some cases.Similarly,researchers at X.ai re-evaluated models using a dataset based on the Hungarian national finals math exam that post-dated their release,with similar results.Bu
56、t benchmarking challenges cut both ways.There are alarmingly high error rates in some of the most popular benchmarks that could be leading us to underestimate the capabilities of some models,with safety implications.Meanwhile,the temptation to overfit is strong.stateof.ai 2024#stateofai|18Researcher
57、s try to correct problems in widely used benchmarks Introduction|Research|Industry|Politics|Safety|Predictions A team from the University of Edinburgh flagged up the number of mistakes in MMLU,including the wrong ground truth,unclear questions,and multiple correct answers.While low across most indiv
58、idual topics,there were big spikes in certain fields,such as virology,where 57%of the analyzed instances contained errors.On a manually corrected MMLU subset,models broadly gain in performance,although worsened on professional law and formal logic.This says inaccurate MMLU instances are being learne
59、d during pre-training.In more safety-critical territory,OpenAI has warned that SWE-bench,which evaluates models ability to solve real-world software issues,was underestimating the autonomous software engineering capabilities of models,as it contained tasks that were hard or impossible to solve.The r
60、esearchers partnered with the creators of the benchmark to create SWE-bench Verified.The LMSYS Chatbot Arena Leaderboard has emerged as the communitys favorite method of formalizing evaluation by“vibes”.But as model performance improves,its beginning to produce counterintuitive resultsstateof.ai 202
61、4#stateofai|19Live by the vibes,die by the vibesor close your eyes for a year and OpenAI is still#1 Introduction|Research|Industry|Politics|Safety|Predictions The arena,which allows users to interact with two randomly selected chatbots side-by-side provides a rough crowdsourced evaluation.However,co
62、ntroversially,this led to GPT-4o and GPT-4o Mini receiving the same scores,with the latter also outperforming Claude Sonnet 3.5.This has led to concerns that the ranking is essentially becoming a way of assessing which writing style users happen to prefer most.Additionally,as smaller models tend to
63、perform less well on tasks involving more tokens,the 8k context limit arguably gives them an unfair advantage.However,the early version of the vision leaderboard is now beginning to gain traction and aligns better with other evals.Deficiencies in both reasoning capabilities and training data mean th
64、at AI systems have frequently fallen short on math and geometry problems.With AlphaGeometry,a symbolic deduction engine comes to the rescue.stateof.ai 2024#stateofai|20Are neuro-symbolic systems making a comeback?Introduction|Research|Industry|Politics|Safety|Predictions A Google DeepMind/NYU team g
65、enerated millions of synthetic theorems and proofs using symbolic engines,using them to train a language model from scratch.AlphaGeometry alternates between the language model proposing new constructions and symbolic engines performing deductions until a solution is found.Impressively,It solved 25 o
66、ut of 30 on a benchmark of Olympiad-level geometry problems,nearing human International Mathematical Olympiad gold medalist performance.The next best AI performance scored only 10.It also demonstrated generalisation capabilities-for example,finding that a specific detail in a 2004 IMO problem was un
67、necessary to for the proof.+Research suggests that models are robust in the face of deeper layers-which are meant to handle complex,abstract,or task-specific information-being pruned intelligently.Maybe its possible to go even further.stateof.ai 2024#stateofai|21Its possible to shrink models with mi
68、nimal impact on performance Introduction|Research|Industry|Politics|Safety|Predictions A Meta/MIT team looking at open-weight pre-trained LLMs concluded that its possible to do away with up to half a models layers and suffer only negligible performance drops on question-answering benchmark.They iden
69、tified optimal layers for removal based on similarity and then“healed”the model through small amounts of efficient fine-tuning.NVIDIA researchers took a more radical approach by pruning layers,neurons,attention heads,and embeddings,and then using knowledge distillation for efficient retraining.The M
70、INITRON models,derived from Nemotron-4 15B,achieved comparable or superior performance to models like Mistral 7B and Llama-3 8B while using up to 40 x fewer training tokens.As Andrej Karpathy and others have argued,current large model sizes could be a reflection of inefficient training.Using these b
71、ig models to refine and synthesize training data,could help train capable smaller models.stateof.ai 2024#stateofai|22as distilled models become more fashionable Introduction|Research|Industry|Politics|Safety|Predictions Google have embraced this approach,distilling Gemini 1.5 Flash from Gemini 1.5 P
72、ro,while Gemma 2 9B was distilled from Gemma 2 27B,and Gemma 2B from a larger unreleased model.There is also community speculation that Claude 3 Haiku,a highly capable smaller model,is a distilled version of the larger Opus,but Anthropic has never confirmed this.These distillation efforts are going
73、multimodal too.Black Forest Labs have released FLUX.1 dev,an open-weight text-to-image distilled from their Pro model.To support these efforts,the community has started to produce open-source distillation tools,like arcee.ais DistillKit,which supports both Logit-based and Hidden States-based distill
74、ation.Llama 3.1 405B is also being used for distillation,after Meta updated its terms so output logits can be used to improve any models,not just Llama ones.As big tech companies think through large-scale end user deployment,were starting to see high-performing LLM and multimodal models that are sma
75、ll enough to run on smartphones.stateof.ai 2024#stateofai|23Models built for mobile compete with their larger peers Introduction|Research|Industry|Politics|Safety|Predictions Microsofts phi-3.5-mini is a 3.8B LM that competes with larger models like 7B and Llama 3.1 8B.It performs well on reasoning
76、and question-answering,but size restricts its factual knowledge.To enable on-device inference,the model was quantized to 4 bits,reducing its memory footprint to approximately 1.8GB.Apple introduced MobileCLIP,a family of efficient image-text models optimized for fast inference on smartphones.Using n
77、ovel multimodal reinforced training,they improve the accuracy of compact models by transferring knowledge from an image captioning model and an ensemble of strong CLIP encoders.Hugging Face also got in on the action with SmolLM,a family of small language models,available in 135M,360M,and 1.7B format
78、s.By using a highly curated synthetic dataset created via an enhanced version of Cosmopedia(see slide 31)the team achieved SOTA performance for the size.Its possible to shrink the memory requirements of LLMs by reducing the precision of their parameters.Researchers are increasingly managing to minim
79、ize the performance trade-offs.stateof.ai 2024#stateofai|24Strong results in quantization point to an on-device future Introduction|Research|Industry|Politics|Safety|Predictions Microsofts BitNet uses a“BitLinear”layer to replace standard linear layers,employing 1-bit weights and quantized activatio
80、ns.It shows competitive performance compared to full-precision models and demonstrates a scaling law similar to full-precision transformers,with significant memory and energy savings.Microsoft followed up with BitNet b1.58,with ternary weights to match full-precision LLM performance at 3B size while
81、 retaining efficiency gains.Meanwhile,ByteDances TiTok(Transformer-based 1-Dimensional Tokenizer)quantizes images into compact 1D sequences of discrete token for image reconstruction and generation tasks.This allows images to be represented with as few as 32 tokens,instead of hundreds or thousands.+
82、Parameter-efficient fine-tuning(e.g.via LoRA)is nothing new,but Stanford researchers believe a more targeted approach offers greater efficiency and adaptation.stateof.ai 2024#stateofai|25Will representation fine tuning unlock on-device personalization?Introduction|Research|Industry|Politics|Safety|P
83、redictions Inspired by model interpretability research,ReFT(Representation Fine-tuning)doesnt alter the models weights.Instead,it manipulates the models internal representations at inference time to steer its behavior.While it comes with a slight interference penalty,ReFT requires 15-65x fewer param
84、eters compared to weight-based fine-tuning methods.It also enables more selective interventions on specific layers and token positions,enabling fine-grained control over the adaptation process.The researchers show its potential in few-shot adaptation where a chat model is given a new persona with ju
85、st five examples.Combined with the small storage footprint for learned interventions,it could be used for real-time personalization on devices with sufficient compute power.Models that combine attention and other mechanisms are able to maintain or even improve accuracy,while reducing computational c
86、osts and memory footprint.stateof.ai 2024#stateofai|26Hybrid models begin to gain traction Introduction|Research|Industry|Politics|Safety|Predictions Selective state-space models like Mamba,designed last year to handle long sequences more efficiently,can to some extent compete with transformers,but
87、lag on tasks that require copying or in-context learning.That said,Falcons Mamba 7B shows impressive benchmark performance versus similar-sized transformer models.Hybrid models appear to be a more promising direction.Combined with self-attention and MLP layers,the AI21s Mamba-Transformer hybrid mode
88、l outperforms the 8B Transformer across knowledge and reasoning benchmarks,while being up to 8x faster generating tokens in inference.In a nostalgia trip,there are early signs of a comeback for recurrent neural networks,which had fallen out of fashion due to training and scaling difficulties.Griffin
89、,trained by Google DeepMind,mixes linear recurrences and local attention,holding its own against Llama-2 while being trained on 6x fewer tokens.By transferring knowledge from a larger,more powerful model,one could improve the performance of subquadratic models,allowing us to harness their efficiency
90、 on downstream tasks.stateof.ai 2024#stateofai|27And could we distill transformers into hybrid models?Itscomplicated.Introduction|Research|Industry|Politics|Safety|Predictions MOHAWK is a new method for distilling knowledge from a large,pre-trained transformer model(teacher)to a smaller,subquadratic
91、 model(student)like a state-space model(SSM).It aligns i)the sequence transformation matrices of the student and teacher models ii)and the hidden states of each layer,then iii)transfers the remaining weights of the teacher model to the student model to finetune it.The authors create Phi-Mamba,a new
92、student model combining Mamba-2 and an MLP block and a variant called Hybrid-Phi-Mamba that retains some attention layers from the teacher model.Mohawk can train Phi-Mamba and Hybrid-Phi-Mamba to achieve performance close to the teacher model.Phi-Mamba is distilled with only 3B tokens,less than 1%of
93、 the data used to train either the previously best-performing Mamba models and 2%for the Phi-1.5 model itself.Work with transformer alternatives and hybrid models is interesting,but at this stage remains niche.One paradigm still seems to rule them all.stateof.ai 2024#stateofai|28Either way,the trans
94、former continues to reign supreme(for now)Introduction|Research|Industry|Politics|Safety|PredictionsTransformers vs other paradigms74%22%Last years report pointed to the divides of opinion around synthetic data:with some finding it useful,others fearing its potential to trigger model collapse by com
95、pounding errors.Opinion seems to be warming.stateof.ai 2024#stateofai|29Synthetic data starts gaining more widespread adoption Introduction|Research|Industry|Politics|Safety|Predictions As well as being the main source of training data for the Phi family,synthetic data was used by Anthropic when tra
96、ining Claude 3 to help represent scenarios that might have been missing in the training data.Hugging Face used Mixtral-8x7B Instruct to generate over 30M files and 25B tokens of synthetic textbooks,blog posts,and stories to recreate the Phi-1.5 training dataset,which they dubbed Cosmopedia.To make t
97、his process easier,NVIDIA released the Nemotron-4-340B family,a suite of models designed specifically for synthetic data generation,available via a permissive license.Metas Llama can also be used for synthetic data generation.It also appears possible to create synthetic high-quality instruction data
98、 by extracting it directly from an aligned LLM,with techniques like Magpie.Models fine-tuned this way sometimes perform comparably to Llama-3-8B-Instruct.As model builders motor ahead,researchers have focused on trying to assess if theres a tipping point in the quantity of synthetic data that trigge
99、rs these kinds of outcomes and if any mitigations work stateof.ai 2024#stateofai|30but Team Model Collapse isnt going down without a fight Introduction|Research|Industry|Politics|Safety|Predictions A Nature paper from Oxford and Cambridge researchers found model collapse occurs across various AI arc
100、hitectures,including fine-tuned language models,challenging the idea that pre-training or periodic exposure to small amounts of original data can prevent degradation(measured by Perplexity score).This creates a“first mover advantage”,as sustained access to diverse,human-generated data will become in
101、creasingly critical for maintaining model quality.However,these results are primarily focused on a scenario where real data is replaced with synthetic data over generations.In practise,real and synthetic data usually accumulates.Other research suggests that,provided the proportion of synthetic data
102、doesnt get too high,collapse can usually be avoided.Team Hugging Face built a 15T token dataset for LLM pre-training,using 96 CommonCrawl snapshots,which produces LLMs that outperform other open pre-training datasets.They also released an instruction manual.stateof.ai 2024#stateofai|31Web data is de
103、canted openly at scale-proving quality is key Introduction|Research|Industry|Politics|Safety|Predictions FineWeb,the dataset,was created through a multi-step process including base filtering,independent MinHash deduplication per dump,selected filters derived from the C4 dataset,and the teams custom
104、filters.The text extraction using the trafilatura library produced higher quality data than default CommonCrawl WET files,even though the resulting dataset was meaningfully smaller.They found deduplication drove performance improvements,up to a point,before hitting a point of diminishing returns,and
105、 then worsening it.The team also used llama-3-70b-instruct to annotate 500k samples from FineWeb,scoring scoring each for their educational quality on a scale from 0 to 5.FineWeb-edu,which filtered out samples scored below 3,outperformed FineWeb and all other open datasets,despite being significantl
106、y smaller.While retrieval and embeddings are not new,growing interest in retrieval augmented generation(RAG)has prompted improvements in the quality of embedding models.stateof.ai 2024#stateofai|32Retrieval and embeddings hit the center stage Introduction|Research|Industry|Politics|Safety|Prediction
107、s Following the playbook thats proven effective in regular LLMs,massive performance improvements have come from scale(GritLM has 47B parameters vs the 110M common among prior embedding models).Similarly,the usage of broad web scale corpora and improved filtering methods have led to large improvement
108、s in the smaller models.Meanwhile,ColPali is a vision-language embedding model that exploits the visual structure of documents,not just their text embeddings,to improve retrieval.Retrieval models are one of the few subdomains where open models commonly outperform proprietary models from the biggest
109、labs.On the MTEB Retrieval Leaderboard,OpenAIs embedding model ranks 29th,while NVIDIAs open NV-Embed-v2 is top.Traditional RAG solutions usually involve creating text snippets 256 tokens at a time with sliding windows(128 overlapping the prior chunk).This makes retrieval more efficient,but signific
110、antly less accurate.stateof.ai 2024#stateofai|33Context proves a crucial driver of performance Introduction|Research|Industry|Politics|Safety|Predictions Anthropic solved this using contextual embeddings,where a prompt instructs the model to generate text explaining the context of each chunk in the
111、document.They found that this approach leads to a reduction of top-20 retrieval failure rate of 35%(5.7%3.7%).It can then be scaled using Anthropics prompt caching.As Fernando Diaz of CMU observed in a recent thread,this is a great example of techniques pioneered on one area of AI research(e.g.early
112、 speech retrieval and document expansion work)being applied to another.Another version of“what is new,is old”.Research from Chroma shows that the choice of chunking strategy can affect retrieval performance by up to 9%in recall.Researchers are now pioneering novel approaches,like Ragnark,which intro
113、duces a novel web-based arena for human evaluation through pairwise system comparisons.This addresses the challenge of assessing RAG quality beyond traditional automated metrics.Meanwhile,Researchy Questions provides a large-scale collection of complex,multi-faceted questions that require in-depth r
114、esearch and analysis to answer,drawn from real user queries.Many commonly used RAG benchmarks are repurposed retrieval or question answering datasets.They dont effectively evaluate the accuracy of citations,the importance of each piece of text to the overall answer,or the impact of conflicting point
115、s of information.stateof.ai 2024#stateofai|34Evaluation for RAG remains unsolved Introduction|Research|Industry|Politics|Safety|Predictions As compute clusters grow larger,they become harder to build and maintain.Clusters require high-bandwidth,low latency connections and are sensitive to device het
116、erogeneity.Researchers see the potential for alternatives.stateof.ai 2024#stateofai|35Frontier labs face up to the realities of the power grid and work on mitigations Introduction|Research|Industry|Politics|Safety|Predictions Google DeepMind has proposed Distributed Low-Communication(DiLoCo),an opti
117、mization algorithm that allows training to occur on multiple loosely connected“islands”of devices.Each island performs a large number of local update steps before communicating with the others,reducing the need for frequent data exchange.Theyre able to demonstrate fully synchronous optimization acro
118、ss 8 of these islands while reducing communication 500 x.GDM also proposed a refined version of DiLoCo,optimized for asynchronous settings.Researchers at Prime Intellect released an open-source implementation and replication of DiLoCo,while scaling it up 3x,to demonstrate its effectiveness on 1B par
119、ameter models.Data curation is an essential part of effective pre-training,but is often done manually and inefficiently.This is both hard to scale and wasteful,especially for multimodal models.stateof.ai 2024#stateofai|36Could better data curation methods reduce training compute requirements?Introdu
120、ction|Research|Industry|Politics|Safety|Predictions Usually,an entire dataset is processed upfront,which doesnt account for how the relevance of a training example can change over the course of learning.These methods are frequently applied before training,so cannot adapt to changing needs during tra
121、ining.Google DeepMinds JEST selects entire batches of data jointly,rather than individual examples independently.The selection is guided by a learnability score(determined by a pre-trained reference model)which evaluates how useful it will be for training.Its able to integrate data selection directl
122、y into the training process,making it dynamic and adaptive.JEST uses lower-resolution image processing for both data selection and part of the training,significantly reducing computational costs while maintaining performance benefits.Models produced by DeepSeek,01.AI,Zhipu AI,and Alibaba have achiev
123、ed strong spots on the LMSYS leaderboard,displaying particularly impressive results in math and coding.stateof.ai 2024#stateofai|37Chinese(V)LLMs storm the leaderboards despite sanctions Introduction|Research|Industry|Politics|Safety|Predictions The strongest models from Chinese labs are competitive
124、 with the second-most powerful tier of frontier model produced by US labs,while being challenging the SOTA on certain subtasks.These labs have prioritized computational efficiency to compensate for constraints around GPU access,learning to stretch their resources much further than their US peers.Chi
125、nese labs have different strengths.For example,DeepSeek has pioneered techniques like Multi-head Latent Attention to reduce memory requirements during inference and an enhanced MoE architecture.Meanwhile,01.AI has focused less on architectural innovation and more on building a strong Chinese languag
126、e dataset to compensate for its relative paucity in popular repositories like Common Crawl.+To drive international uptake and evaluation,Chinese labs have become enthusiastic open source contributors.A few models have emerged as strong contenders in individual sub-domains.stateof.ai 2024#stateofai|3
127、8And Chinese open source projects win fans around the world Introduction|Research|Industry|Politics|Safety|Predictions DeepSeek has emerged as a community favorite on coding tasks,with deepseek-coder-v2 for its combination of speed,lightness,and accuracy.Alibaba released the Qwen-2 family recently,a
128、nd the community has been particularly impressed by its vision capabilities,ranging from challenging OCR tasks to its ability to analyse complex art work.At the smaller end,the NLP lab at Tsinghua University has funded OpenBMB,a project that has spawned the MiniCPM project.These are small 1M diverse
129、 CRISPR-Cas operons,including various effector systems,that were mined from 26.2 terabases of assembled microbial genomes and metagenomes,spanning diverse phyla and biomes.Generated sequences are 4.8x more diverse vs.natural proteins from the CRISPR-Cas atlas.The median identity to the nearest natur
130、al protein typically fell between 40-60%.A model fine-tuned on Cas9 proteins can generate novel editors that were then validated in human cells.One such editor offered the best editing performance and 71.7%sequence similarity to SpCas9 and was open sourced as OpenCRISPR-1.We previously profiled how
131、LLMs(e.g.ProGen2)pre-trained on large and diverse datasets of natural protein sequences could be used to design functional proteins with vastly different sequences to their natural peers.Now,Profluent has finetuned ProGen2 on their CRISPR-Cas Atlas to generate functional genome editors with novel se
132、quences that,importantly,were shown to edit the DNA of human cells in vitro for the first time.stateof.ai 2024#stateofai|50Language models that learn to design human genome editors Introduction|Research|Industry|Politics|Safety|Predictions Two works from late 2023,PoseCheck and PoseBusters,showed th
133、at ML models for molecule generation and protein-ligand docking gave structures(poses)with gross physical violations.The fundamental problem with research at the intersection of biology and ML is that there are very few people with the skills to both train a frontier model and give it a rigorous bio
134、logical appraisal.stateof.ai 2024#stateofai|51Yet,evals and benchmarking in BioML remains poor Introduction|Research|Industry|Politics|Safety|Predictions+Even the AlphaFold3 paper didnt get away without a few bruises when a small start-up showed that using a slightly more advanced conventional docki
135、ng pipeline beat AF3.A new industry consortium led by Valence Labs,including major pharma companies(i.e.Recursion,Relay,Merck,Novartis J&J,Pfizer),is developing Polaris,a benchmarking platform for AI-driven drug discovery.Polaris will provide high-quality datasets,facilitate evaluations,and certify
136、benchmarks.Meanwhile,Recursions work on perturbative map-building led them to create a new set of benchmarks and metrics.In 2022,equivariant message passing neural networks(MPNN)combined with efficient many-body messages(MACE)were introduced at NeurIPS.Now,the authors present MACE-MP-0,which uses th
137、e MACE architecture and is trained on the Materials Project Trajectory dataset,which contains millions of structures,energies,magnetic moments,forces and stresses.The model reduces the number of message passing layers to two by considering interactions involving four atoms simultaneously,and it only
138、 uses nonlinear activations in selective parts of the network.It is capable of molecular dynamics simulation across a wide variety of chemistries in the solid,liquid and gaseous phases.To determine the properties of physical materials and how they behave under reactions,it is necessary to run atomic
139、-scale simulations that today rely on density functional theory.This method is powerful,but slow and computational expensive.While faster,alternative approaches that calculate force fields(interatomic potentials)tend to have insufficient accuracy to be useful,particularly for reactive events and pha
140、se transitions.stateof.ai 2024#stateofai|52Foundation models across the sciences:inorganic materials Introduction|Research|Industry|Politics|Safety|Predictions+Characterising and generating structures for proteins that are not found in soluble form but are in membrane environments is challenging and
141、 hinders the development of drugs meant to target membrane receptors.So too is the design of protein folds that are large and include non-local topologies.Can AF2 and sequence models remedy this and give drug designers access to a larger soluble proteome with previously inaccessible folds?stateof.ai
142、 2024#stateofai|53Expanding the protein function design space:challenging folds and soluble analogues Introduction|Research|Industry|Politics|Safety|Predictions To do so,the authors first use an inverted AF2 model that generates an initial sequence given a target fold structure.These sequences are t
143、hen optimised by ProteinMPNN before structures are re-predicted by AF2 followed by filtering on the basis of structure similarity to the target structure.This AF2-MPNN pipeline was tested on three challenging folds:IGF,BBF and TBF,which have therapeutic utility.It was also possible to generate solub
144、le analogues of membrane-only folds which could massively speed up drug discovery targeting membrane-bound receptor proteins.+Deep learning,originally inspired by neuroscience,is now making into modelling the brain itself.BrainLM is a foundation model built on 6,700 hours of human brain activity rec
145、ordings generated by functional magnetic resonance imaging(fMRI),which detects changes in blood oxygenation(left figure).The model learns to reconstruct masked spatiotemporal brain activity sequences and,importantly,it can generalise to held-out distributions(right figure).This model can be fine-tun
146、ed to predict clinical variables e.g.age,neuroticism,PTSD,and anxiety disorder scores better than a graph convolutional model or an LSTM.stateof.ai 2024#stateofai|54Foundation models for the mind:learning brain activity from fMRI Introduction|Research|Industry|Politics|Safety|Predictions+Classical a
147、tmospheric simulation methods like numerical weather prediction are costly and unable to make use of diverse and often scarce atmospheric data modalities.But,foundation models are well suited here.Microsoft researchers created Aurora,a foundation model that produces forecasts for a wide range of atm
148、ospheric forecasting problems such as global air pollution and high-resolution medium-term weather patterns.It can also adapt to new tasks by making use of a general-purpose learned representation of atmospheric dynamics.stateof.ai 2024#stateofai|55Foundation models across the sciences:the atmospher
149、e Introduction|Research|Industry|Politics|Safety|Predictions The 1.3B model is pre-trained on 1M hours of weather and climate data from 6 datasets,including forecasts,analysis data,reanalysis data,and climate simulations.The models encodes heterogeneous inputs into a standard 3D representation of th
150、e atmosphere across space and pressure-levels,which is evolved over time at inference by a vision transformer and decoded into specific predictions.Importantly,it is the first model to predict atmospheric chemistry(6 major air pollutants,e.g.ozone,carbon monoxide),which involves hundreds of stiff eq
151、uations,better than numerical models.The model is also 5,000 x faster than the Integrated Forecasting System that uses numerical forecasting.+MindEye2,is a generative model that maps fMRI activity to a rich CLIP space from which images of what the individual sees are reconstructed using a fine-tuned
152、 Stable Diffusion XL.The model is trained on the Natural Scenes Dataset,an fMRI dataset built from 8 subjects whose brain responses were captured for 30-40 hours as they looked at hundreds of rich naturalistic stimuli from the COCO dataset scanning sessions for 3 seconds each.stateof.ai 2024#stateof
153、ai|56Foundation models for the mind:reconstructing what you see Introduction|Research|Industry|Politics|Safety|Predictions+Decoding speech from brain recordings with implantable microelectrodes could enable communication for patients with impaired speech.In a recent case,a 45-year-old man with amyot
154、rophic lateral sclerosis(ALS)with tetraparesis and severe motor speech damage underwent surgery to implant microelectrodes into his brain.The arrays recorded neural activity as the patient spoke in both prompted and unstructured conversational settings.At first,cortical neural activity was decoded i
155、nto a small vocabulary of 50 words with 99.6%accuracy by predicting the most likely English phoneme being attempted.Sequences of phonemes were combined into words using an RNN,before moving to a larger 125,000-word vocabulary enabled by further training.stateof.ai 2024#stateofai|57Speaking what you
156、think Introduction|Research|Industry|Politics|Safety|Predictions+Franois Chollet,the creator of Keras,has partnered with Zapier co-founder Mike Knoop to launch the ARC prize,offering a$1M prize fund for teams that make significant progress on the ARC-AGI benchmarkstateof.ai 2024#stateofai|58A new ch
157、allenge aims to refocus the industry on the path to AGI Introduction|Research|Industry|Politics|Safety|Predictions Chollet created the benchmark back in 2019 as a means of measuring models ability to generalize,focusing on tasks that are easier for humans and hard for AI.The tasks require minimal pr
158、ior knowledge and emphasise visual problem-solving and puzzle-like tasks to make it resistant to memorization.Historically,LLMs have performed poorly on the benchmark,with performance peaking at about 34%.Chollet is sceptical of LLMs ability to generalize to new problems outside of their training da
159、ta and is hoping the prize will encourage new research directions that will lead to a more human-like form of intelligence.The highest score so far is 46(short of the 85 target).Its been achieved by the Minds AI team,who have used an LLM-based approach,employing active inference,fine-tuning the LLM
160、on test task examples and expanding it with synthetic examples to improve performance.On novel tasks,where LLMs are unable to rely on memory and retrieval,performance often degrades.This suggests that they still often struggle to generalize beyond familiar patterns without external help.stateof.ai 2
161、024#stateofai|59LLMs still struggle with planning and simulation tasks Introduction|Research|Industry|Politics|Safety|Predictions Even advanced LLMs like GPT-4 have difficulty reliably simulating state transitions in text-based games,especially for environment-driven changes.Their inability to consi
162、stently grasp causality,physics,and object permanence,makes them poor world-modellers,even on relatively straightforward tasks.Researchers found that LLMs accurately predict direct action results,like a sink turning on,around 77%of the time,but struggle with environmental effects,such as water filli
163、ng a cup in the sink,achieving only 50%accuracy for these indirect changes.Other research evaluated LLMs on planning domains,including Blocksworld,and Logistics.GPT-4 produced executable plans 12%of the time.However,using iterative prompting with external verification,Blocksworld plans hit 82%accura
164、cy and Logistics plans 70%accuracy after 15 rounds of feedback.When re-run with o1,performance jumped but was still far from perfect.Researchers are exploring methods to generate stronger internal reasoning processes,variously targeting both training and inference.The latter approach appears to unde
165、rpin OpenAI o1s jump in capabilities.stateof.ai 2024#stateofai|60Can LLMs learn to think before they speak?Introduction|Research|Industry|Politics|Safety|Predictions Quiet-STaR from a joint Stanford-Notbad AI team generates internal rationales during pre-training,using a parallel sampling algorithm
166、and custom meta-tokens to mark the beginning and end of these thoughts.The approach employs a reinforcement learning-inspired technique to optimize the usefulness of generated rationales,rewarding those that improve the models ability to predict future tokens.Meanwhile,Google DeepMind have targeted
167、inference,showing that for many types of problems,strategically applying more computation at test time can be more effective than using a much larger pre-trained model.A Stanford/Oxford team have also looked at scaling inference compute,finding that repeated sampling can significantly improve covera
168、ge.They suggest that using weaker and cheaper models with many attempts can outperform single attempts from their stronger and more expensive peers.One path to improving the robustness of LLM reasoning is to embrace an open-ended approach such that theyre capable of generating new knowledge.stateof.
169、ai 2024#stateofai|61Open-endedness gathers momentum as a promising research direction Introduction|Research|Industry|Politics|Safety|Predictions In a position paper,a Google DeepMind team framed open-ended systems as able to“continuously generate artifacts that are novel and learnable to an observer
170、”.They outline potential paths towards open-ended foundation models,including reinforcement learning,self-improvement,task generation,and evolutionary algorithms.On the self-improvement front,we saw STRATEGIST,a method for allowing LLMs to learn new skills for multi-agent games.The researchers used
171、a bi-level tree search approach,combining high-level strategic learning with low-level simulated self-play for feedback.It outperformed RL and other LLM-based approaches on Game of Pure Strategy and The Resistance:Avalon at action planning and dialogue generation.After prolonged training beyond the
172、point of overfitting(known as grokking),some researchers have argued that transformers learn to reason over parametric knowledge through composition and comparison tasks.stateof.ai 2024#stateofai|62But were implicit reasoning capabilities staring us in the face the whole time?Introduction|Research|I
173、ndustry|Politics|Safety|Predictions Researchers at Ohio State University argued that a fully grokked transformer outperformed then SOTA models like GPT-4-Turbo and Gemini-1.5-Pro on complex reasoning tasks with a large search space.They conducted mechanistic analyses to understand the internal worki
174、ngs of the models during grokking,revealing distinct generalizing circuits for different tasks.However,they found that while fully grokked models performed well on comparison tasks(paring attributes based on atomic facts),they were less good at out-of-distribution generalization in composition tasks
175、.This raises questions about whether these are really meaningful reasoning capabilities versus memorization by another name,although the researchers believe that enhancing the transformer with better cross-layer memory sharing could resolve this.Drawing on a combination of LLMs and evolutionary algo
176、rithms,FunSearch uses an LLM to generate and modify programs,guided by an evaluation function that scores the quality of solutions.Searching for programs rather than direct solutions allows it to discover concise,interpretable representations of complex objects or strategies.This form of program sea
177、rch is one of the avenues that Chollet believes has the most potential to solve the ARC challenge.The Google DeepMind team applied it to the cap set problem in extremal combinatorics and online bin picking.In both cases,FunSearch discovered novel solutions that surpassed human-designed approaches.st
178、ateof.ai 2024#stateofai|63Program search unlocks new discoveries in the mathematical sciences Introduction|Research|Industry|Politics|Safety|Predictions For agents to be useful,they need to be robust to real-word stochasticity,which SOTA models have historically struggled with.Were beginning to see
179、signs of progress.stateof.ai 2024#stateofai|64RL drives improvements in VLM performance Introduction|Research|Industry|Politics|Safety|Predictions DigiRL is a novel autonomous reinforcement learning approach for training in-the-wild device control agents specifically for Android devices.The method i
180、nvolves a two-stage process:offline reinforcement learning followed by offline-to-online reinforcement learning.It achieves a 62.7%task success rate on the Android-in-the-Wild dataset,a significant improvement on the prior SOTA.In 2019,Uber published Go-Explore,an RL agent that solved hard-explorati
181、on problems by archiving discovered states and iteratively returning to and exploring from promising ones.In 2024,LLMs are supercharging it.stateof.ai 2024#stateofai|65while LLMs improve RL performance Introduction|Research|Industry|Politics|Safety|Predictions Intelligent Go-Explore(IGE)uses an LLM
182、to guide state selection,action choice,and archive updating,rather than the original Go-Explores hand-crafted heuristics.This enabled more flexible and intelligent exploration in complex environments.This approach also allowed IGE to recognize and capitalize on promising discoveries,a key aspect of
183、open-ended learning systems.It significantly outperformed other LLM agents on mathematical reasoning,grid worlds,and text-based adventure games.Switching from GPT-4 to GPT-3.5 resulted in a significant performance drop across all environments,suggesting that IGEs performance scales with the capabili
184、ties of the underlying language model.To improve planning,approaches like MCTS,which helped to power AlphaGo,are slowly returning to the fore.Early results are promising,but will they be enough?stateof.ai 2024#stateofai|66Who remembers Monte Carlo Tree Search?Introduction|Research|Industry|Politics|
185、Safety|Predictions MultiOn and Stanford combined an LLM with MCTS,along with a self-criticism mechanism and direct preference optimization,to learn from different success and failure criteria.They found this improved Llama-3 70Bs zero-shot performance from 18.6%to 81.7%in real-world booking scenario
186、s,after a day of data collection,and up to 95.4%with online search.The longer-term question will be whether next-token prediction loss is too fine-grained.This risks limiting the ability of RL and MCTS to achieve agentic behavior by focusing too much on individual tokens and hindering the exploratio
187、n of broader,more strategic solutions.One of the big bottlenecks for training RL agents is a shortage of training data.Standard approaches like converting pre-existing environments(e.g.Atari)or manually building them are labor-intensive and dont scale.stateof.ai 2024#stateofai|67Could foundation mod
188、els make it easier to train RL agents at scale?Introduction|Research|Industry|Politics|Safety|Predictions Genie(winner of a Best Paper award at ICML 2024)is a world model that can generate action-controllable virtual worlds.It analyzed 30,000 hours of video game footage from 2D platformer games,lear
189、ning to compress the visual information and infer the actions that drive changes between frames.By learning a latent action space from video data,it can handle action representations without requiring explicit action labels,which distinguishes it from other world models.Genie is both able to imagine
190、 entirely new interactive scenes and demonstrate significant flexibility:it can take prompts in various forms,from text descriptions to hand-drawn sketches,and bring them to life as playable environments.This approach demonstrated applicability beyond games,with the team successfully applying the hy
191、perparameters from the game model to robotics data,without fine tuning.Imperial and UBCs OMNI-EPIC used LLMs to create a theoretically endless stream of RL tasks and environments to help agents build upon previously learned skills.The system generates executable Python code that can implement simula
192、ted environments and reward functions for each task,and employs a model to assess whether newly generated tasks are sufficiently novel and complex.#stateofai|68Could foundation models make it easier to train RL agents at scale?Introduction|Research|Industry|Politics|Safety|Predictions New lab Sakana
193、 AI has been focused on attempting to enhance the creative capabilities of current frontier models.One of their first papers looks at using foundation models to automate research itself.stateof.ai 2024#stateofai|69Are scientists inventing their AI replacement?Introduction|Research|Industry|Politics|
194、Safety|Predictions The AI Scientist is an end-to-end framework designed to automate the generation of research ideas,implementation,and the production of research papers.After being given a starting template,it brainstorms novel research directions,before executing the experiments,and writing them u
195、p.The researchers claim their LLM-powered reviewer evaluates the generated papers with near-human accuracy.The researchers used it to generate example papers about diffusion,language modeling,and grokking.These were convincing at first glimpse,but contained some flaws on closer examination.Yet,the s
196、ystem periodically displayed signs of unsafe behavior,e.g.importing unfamiliar Python libraries and editing code to extend experiment timelines.Metas TestGen-LLM combines multiple LLMs,prompts and configurations to leverage different models strengths to improve unit testing coverage for Android code
197、 on Instagram and Facebook.stateof.ai 2024#stateofai|70An ensemble approach appears to drive strong performance improvements in code Introduction|Research|Industry|Politics|Safety|Predictions It uses an assured approach,filtering generated tests to ensure they build successfully,pass reliably,and in
198、crease coverage before recommending them.This is the first large-scale industrial deployment of an approach that combines LLMs with verifiable guarantees of code improvement,addressing concerns about LLM hallucinations and reliability in a software engineering context.In deployment,TestGen-LLM impro
199、ved about 10%of test classes it was applied to,with 73%of its recommendations accepted by developers.#stateofai|71 Introduction|Research|Industry|Politics|Safety|PredictionsSelf-driving embraces more modalities stateof.ai 2024 Wayves LINGO-2 is the second generation of its vision-language-action mod
200、el,that,unlike its predecessor,can both generate real-time driving commentary and control a car,linking language explanations directly with decision-making and actions.Meanwhile,the company is using generative models to enhance its simulator with more real-world detail.PRISM-1 creates realistic 4D s
201、imulations of dynamic driving scenarios using only camera inputs.It enables more effective testing and training by accurately reconstructing complex urban environments,including moving elements like pedestrians,cyclists,and vehicles,without relying on LiDAR or 3D bounding boxes.+Last years Metas Seg
202、ment Anything impressed with its ability to identify and segment objects in images given any prompt.In July,they released Segment Anything 2(SAM 2),which stunned observers.stateof.ai 2024#stateofai|72Segment Anything gets boosters and expands to video Introduction|Research|Industry|Politics|Safety|P
203、redictions Meta has extended SAM to include video segmentation,training it on their own dataset(SA-V)of 51,000 real-world videos and 600,000 spatio-temporal masks.This dataset has been made available,along with the model,under an Apache 2.0 license.To build a unified model that works for both video
204、as well as image,Meta have made some adaptations.For example,they have included a memory mechanism to track objects across frames and an occlusion head to handle objects that disappear or reappear.They find it is more accurate and 6x faster the SAM 1 at image segmentation,while able to surpass the a
205、ccuracy of prior leading video segmentation models with 3x fewer interactions.The model is,however,less efficient at segmenting multiple objects simultaneously in video and can struggle with longer clips.LLMs and VLMs demonstrate their potential to help resolve data bottlenecks and resolve longstand
206、ing usability hurdlesstateof.ai 2024#stateofai|73Robotics(finally)becomes fashionable(again)as the big labs pile in Introduction|Research|Industry|Politics|Safety|Predictions20212024 Despite all eyes being on Gemini,the Google DeepMind team has steadily been increasing its robotics output,improving
207、the efficiency,adaptability,and data collection of robots.stateof.ai 2024#stateofai|74Google DeepMind quietly emerges as a robotics leader Introduction|Research|Industry|Politics|Safety|Predictions The team created AutoRT,a system that uses a VLM for environmental understanding and an LLM to suggest
208、 a list of creative tasks the robot could carry out.These models are then combined with a robot control policy.This helps to scale up deployment quickly in previously unseen environments.RT-Trajectory enhances robotic learning through video input.For each video in the dataset of demonstrations,a 2D
209、sketch of the gripper performing the task is overlaid.This provide practical visual hits to the model as it learns.The team have also improved the efficiency of transformers.SARA-RT is a novel up-training method to convert pre-trained or fine-tuned robotic policies from quadratic to linear attention
210、,while maintaining quality.Researchers have found Gemini 1.5 Pros multimodal capabilities and long context window makes it an effective way of interacting with robots via natural language.+Historically,robotics had significantly fewer open source datasets,tools,and libraries than other areas of AI-c
211、reating an artificially high barrier to entry.Hugging Faces LeRobot aims to bridge the gap,hosting pretrained models,datasets with human-collected demonstrations,and pre-trained demonstrations.And the communitys loving it.stateof.ai 2024#stateofai|75Hugging Face pulls down barriers to entry Introduc
212、tion|Research|Industry|Politics|Safety|Predictions+Well-established in image and audio generation,diffusion models continue to demonstrate their effectiveness in generating complex action sequences in robotics.stateof.ai 2024#stateofai|76Diffusion models drive improvements in policy and action gener
213、ation Introduction|Research|Industry|Politics|Safety|Predictions A number of research groups are aiming to bridge the gap between high-dimensional observation and low-dimensional action spaces in robot learning.They create a unified representation that allows the learning algorithm to understand the
214、 spatial implications of actions.Diffusion models excel at modeling these kinds of complex,non-linear multimodal distributions,while their iterative denoising process allows for the gradual refinement of actions or trajectories.There are multiple ways of attacking this.Researchers at Imperial and Sh
215、anghai Qizhi Institute have opted for RGB images,which offer rich visual information and compatibility with pre-trained models.Meanwhile,a team at UC Berkeley and Stanford have leveraged point clouds,for their explicit 3D information.+Robotics policies have often been hampered by a lack of generaliz
216、ability,due to limited real-world data.Rather than finding more data,researchers are injecting more structure and knowledge to what we already have.stateof.ai 2024#stateofai|77Can we stretch existing real-world robotics data further than we currently do?Introduction|Research|Industry|Politics|Safety
217、|Predictions One approach,outlined by a Carnegie Mellon team,involves learning more“affordance”information from human video data,such as hand possess,object interactions,and contact points.This information can then be used to finetine existing visual representations to make them more for suitable ro
218、botic tasks.This consistently improved performance on real-world manipulation tasks.Meanwhile,a Berkeley/Stanford team found that chain-of-thought reasoning could have a similar impact.Rather than just predicting actions directly,the enhanced models are trained to reason step-by-step about plans,sub
219、-tasks,and visual features before deciding on actions.This approach uses LLMs to generate training data for the reasoning steps.+Its challenging to model the intricacies of human behavior with imitation learning,which relies on human demonstrators.While effective,its difficult to implement at scale.
220、Stanford has some workarounds.stateof.ai 2024#stateofai|78Can we overcome the data bottleneck for humanoids?Introduction|Research|Industry|Politics|Safety|Predictions HumanPlus is a full-stack system for humanoids to learn from human data.It combines a real-time shadowing system and an imitation lea
221、rning algorithm.The shadowing system uses a single RGB camera and a low-level policy to allow human operators to control the humanoids whole body in real-time.This low-level control policy is trained on a large dataset of human motion data in simulation and transfers to the real world without additi
222、onal training.The imitation learning component enables efficient learning of autonomous skills from shadowing data.It uses binocular egocentric vision and combines action prediction with forward dynamics prediction.The system demonstrates impressive results on a variety of tasks,including complex ac
223、tions like wearing a shoe and walking,using only up to 40 demonstrations.+Boston Dynamics Spot showcased progress in mobility and stability for embodied AI but lacked manipulation skills.Researchers are now addressing this gap.A Stanford/Columbia team combined real-world demonstration data with simu
224、lation-trained controllers to focus on controlling the robots gripper movement rather than individual joints.This approach simplifies transferring manipulation skills from stationary arms to mobile robots.Meanwhile,a UC San Diego team developed a two-part system:a low-level policy for executing comm
225、ands and a high-level policy for generating visual-based commands,enhancing the robots manipulation capabilities.stateof.ai 2024#stateofai|79Back with a vengeance:robot doggos Introduction|Research|Industry|Politics|Safety|Predictions+While consumer demand for the Vision Pro lacklustre so far,its ta
226、king robotics research by storm,where its high-res,advanced tracking,and processing power is being leveraged by researchers working on teleoperation-controlling robot movements and actions at a distance.Systems like Open-TeleVision and Bunny-Vision Pro use it to help enable precise control of multi-
227、finger robotic hands(at a 3000 mile distance in the case of the former),demonstrating improved performance on complex manipulation tasks compared to previous approaches.They address challenges such as real-time control,safety through collision avoidance,and effective bimanual coordination.stateof.ai
228、 2024#stateofai|80The Apple Vision Pro emerges as the must-have robotics research tool Introduction|Research|Industry|Politics|Safety|Predictions Last year,a non-finetuned GPT-4 via one API call was highly competitive with Googles Med-PaLM2 on certain medical knowledge benchmarks.Gemini has ridden t
229、o the rescue.stateof.ai 2024#stateofai|81To finetune or not to finetune(in medicine)?Introduction|Research|Industry|Politics|Safety|Predictions The Med-Gemini family of multimodal models for medicine are finetuned from Gemini Pro 1.0 and 1.5 using various medical datasets and incorporate web search
230、for up-to-date information.They achieved SOTA 91.1%accuracy on MedQA,surpassing GPT-4.For multimodal tasks(e.g.in radiology and pathology),Med-Gemini set a new SOTA on 5 out of 7 datasets.When quality errors in questions were fixed,model performance improved and it exhibited strong reason across oth
231、er benchmarks.It also achieved high precision and recall in retrieving rare findings in lengthy EHRs-a challenging needle-in-a-haystack task.In a preliminary study,clinicians rated Med-Geminis outputs equal or better than human-written examples in most cases.High-quality medical imaging datasets are
232、 hard to come by or,even so,license to for research or commercial products.They are also not immune to distributional shifts.And yet,realistic image generators have flooded the internet in the last year.Could these be repurposed to generate realistic medical images that are useful for model training
233、,despite the large visual and semantic differences between natural images and medical images?stateof.ai 2024#stateofai|82Generating synthetic data in medicine By jointly fine-tuning both the U-Net and the CLIP text encoder from Stable Diffusion a large dataset of real chest x-rays(CXR)and correspond
234、ing radiologist reports,it is possible to generate synthetic CXR scans with high fidelity and conceptual correctness as evaluated by board-certified radiologists.Generated CXRs can be used for data augmentation and self-supervised learning.Consistent with other modalities,supervised classification p
235、erformance drops slightly when training purely synthetic data.Introduction|Research|Industry|Politics|Safety|Predictions Moreover,generative models can improve fairness of medical classifiers by enriching training datasets with synthetic examples that fill out underrepresented data points.Traditiona
236、l Robotic Process Automation(RPA),embodied by UiPath,has struggled with high set-up costs,brittle execution,and burdensome maintenance.Two novel approaches,FlowMind(JP Morgan)and ECLAIR(Stanford),use foundation models to address these limitations.FlowMind focuses on financial workflows,using LLMs to
237、 generate executable workflows via APIs.In experiments on the NCEN-QA dataset,FlowMind achieved 99.5%accuracy in workflow understanding.ECLAIR takes a broader approach,using multimodal models to learn from demonstrations and interact directly with graphical user interfaces across various enterprise
238、settings.On web navigation tasks,ECLAIR improved completion rates from 0%to 40%.stateof.ai 2024#stateofai|83Enterprise automation set to get an AI-first upgrade Introduction|Research|Industry|Politics|Safety|Predictions As AI emerges as the new competitive battleground,big tech companies begin to ho
239、ld more details of their work close to their chest.Frontier labs have meaningfully cut publication levels for the first time since this report began,while academia gets into gear.stateof.ai 2024#stateofai|84The global balance of power in AI research remains unchanged,but academia gains Introduction|
240、Research|Industry|Politics|Safety|PredictionsYear-on-year change in AI publication levelsProportion of AI publications by countrySection 2:Industrystateof.ai 2024#stateofai|85 Introduction|Research|Industry|Politics|Safety|Predictions Amid growing demand for its hardware to power demanding gen AI wo
241、rkloads,every major lab depends on NVIDIA for its hardware.Its market cap hit$3T in June,only the third US company to reach this milestone(following Microsoft and Apple).Following blowout earnings in Q2,its position looks as unassailable as ever.stateof.ai 2024#stateofai|86NVIDIA becomes the worlds
242、most powerful company Introduction|Research|Industry|Politics|Safety|PredictionsChatGPT launches NVIDIA has already booked significant pre-sales on its new Blackwell family of GPUs and is making a strong play for governments.stateof.ai 2024#stateofai|87and its ambitions are only growing Introduction
243、|Research|Industry|Politics|Safety|Predictions The new Blackwell B200 GPU and GB200 Superchip promise significant performance gain over the Hopper architecture of H100 fame.NVIDIA claims it can reduce cost and energy consumption 25x over an H100.In a mark of NVIDIAs power,every major AI lab CEO prov
244、ided a supporting quote in the press release.While the Blackwell architecture was delayed by manufacturing issues,the company is still confident of booking several billion in revenue from it by the end of the year.Jensen Huang,NVIDIAs Founder and CEO is expanding the pitch,outlining the companys vis
245、ion of sovereign AI.He has argued that every government needs to build its own LLM to preserve its national heritage.Youll never guess whose hardware he thinks is optimal for the task AMD and Intel have started to invest in their software ecosystems,while AMD has made a heavy pitch to the open sourc
246、e community using ROCm(its CUDA competitor).However,they are yet to develop compelling alternatives to NVIDIAs portfolio of networking solutions.AMD is hoping its planned$4.9B acquisition of server builder ZT Systems will change this.Meanwhile,Intel has seen its hardware sales decline.Short of regul
247、atory intervention,a change in research paradigm or supply constraints,NVIDIAs position seems unassailable.stateof.ai 2024#stateofai|88Established competitors fail to narrow the gap Introduction|Research|Industry|Politics|Safety|PredictionsQ2 2024 revenueQ2 2024 earnings per share We looked at the$6
248、B invested in AI chip challengers since 2016 and asked what would have happened if investors had just bought the equivalent amount of NVIDIA stock at that days price.The answer is lime green:that$6B would be worth$120B of NVIDIA stock today(20 x!)vs.the$31B(5x)in its startup contenders.stateof.ai 20
249、24#stateofai|89Buying NVIDIA stock wouldve been far better than investing in its start-up contenders Introduction|Research|Industry|Politics|Safety|PredictionsNAV lengers$investedNAV challengersNote:Market pricing and valuation data retrieved as of 9 Oct 2024.NAV=net asset value.#stateofai|90 Introd
250、uction|Research|Industry|Politics|Safety|PredictionsBut not everyone believes the line can only go up stateof.ai 2024 A vocal minority of analysts and commentators arent convinced.They point to the decline in GPU scarcity,how only a few companies are currently generating reliable revenue from AI-fir
251、st offerings,and how even Big Techs infrastructure build-out is unlikely to be big enough to justify the companys current valuation.The market is currently ignoring these voices and seems more inclined to agree with early Tesla investor James Andersons view that the company could be worth“double-dig
252、it trillions”in a decade.The number of large-scale NVIDIA A100 GPU clusters has stayed constant as the industry focuses its dollars on the H100 and shinier Blackwell systemsmore on the next slide!stateof.ai 2024#stateofai|91Compute Index:NVIDIA A100 clusters Introduction|Research|Industry|Politics|S
253、afety|PredictionsThe real large-scale GPU cluster growth has come from H100s.The largest continues to be Metas 350k H100s,followed by xAIs 100k cluster and Teslas 35k.Meanwhile,Lambda,Oracle and Google have been building large clusters summing over 72k H100s.Companies including Poolside,Hugging Face
254、,DeepL,Recursion,Photoroom and Magic have built over 20k worth H100 capacity.Moreover,the first GB200 clusters are going live(e.g.10,752 at the Swiss National Supercomputing Centre),while OpenAI will have access to 300,000 by the end of next year.stateof.ai 2024#stateofai|92Compute Index:NVIDIA H100
255、 clusters(while GB200s are loading)Introduction|Research|Industry|Politics|Safety|PredictionsBy last years count,NVIDIA was used 19x more than all of its peers combined in AI research papers(note the log-scale y-axis!).This year,this lead has compressed to 11x,due in part to the 522%growth in papers
256、 that use TPUs(gap is now 34x with NVIDIA).We also note the 353%growth in the use of Huaweis Ascend 910,the 61%growth of large AI chip start-up contenders and the new appearance of Apples silicon.stateof.ai 2024#stateofai|93Compute Index:NVIDIA continues to be the preferred option in AI research pap
257、ers Introduction|Research|Industry|Politics|Safety|PredictionsUsage of A100s continues to grow(+59%YoY)alongside the H100(+477%)and the 4090(+262%),albeit from a much lower base.The V100(now 7 years old,-20%),continues to be used at half the rate of the A100(now 4 years old),further demonstrating th
258、e longevity of NVIDIA systems for AI research.stateof.ai 2024#stateofai|94Compute Index:NVIDIA continues to be the preferred option in AI research papers Introduction|Research|Industry|Politics|Safety|PredictionsMeanwhile in start-up land,Cerebras appears to be pulling out ahead the pack with 106%gr
259、owth in the number of AI research papers that make use of its wafer scale systems.Groq,which launched their LPU recently,saw its first usage in AI research papers last year.Meanwhile,Graphcore was acquired by SoftBank in late mid-2024.Unlike their common enemy,NVIDIA,these AI chip start-ups have mos
260、tly pivoted from selling systems to inference interfaces on top of open models.stateof.ai 2024#stateofai|95Compute Index:AI chip start-ups Introduction|Research|Industry|Politics|Safety|PredictionsEver since the A100 launch in 2020,NVIDIA has been cutting down the time to ship its next datacenter GP
261、U while significantly increasing the TFLOPs they deliver.In fact,timelines have come down by 60%from A100 to H100 and down a further 80%from H200 to GB200.During that time,TFLOPs have gone up 6x.Large cloud companies are buying huge amounts of these GB200 systems:Microsoft between 700k-1.4M,Google 4
262、00k and AWS 360k.OpenAI is rumored to have at least 400k GB200 to itself.stateof.ai 2024#stateofai|96More TFLOPs:NVIDIA compresses its product release timelines Introduction|Research|Industry|Politics|Safety|PredictionsThe speed of data communication between GPUs within a node(scale-up fabric),as we
263、ll as between nodes(scale-out fabric),is critical to large-scale cluster performance.NVIDIAs technology for the former,NVLink,has bandwidth per link,the number of links and the number of total GPUs connected per node increasing significantly in the last 8 years.Coupled to their InfiniBand technology
264、 for connecting nodes intro large-scale clusters,NVIDIA is ahead of the pack.Meanwhile,Chinese companies like Tencent have reportedly innovated around sanctions for similar outcomes.Their Xingmai 2.0 high-performance computing network,which is said to support over 100,000 GPUs in a single cluster,im
265、proves network communication efficiency by 60%and LLM training by 20%.That said,it is not clear whether Tencent possesses clusters of this size.stateof.ai 2024#stateofai|97Scaling up and out with faster connections between GPUs and nodes Introduction|Research|Industry|Politics|Safety|Predictions#sta
266、teofai|98 Introduction|Research|Industry|Politics|Safety|PredictionsBut running large clusters continues to be an art and a science of interruptionsstateof.ai 2024 On publishing their Llama 3 family of models,Meta shared a breakdown of the 8.6 job interruptions per day they experienced during a 54-d
267、ay period of pre-training Llama 3 405B.GPUs tend to experience failures more frequently than CPUs and all clusters are by no means created equal.Continuous monitoring is essential,misconfigurations and dead-on arrival components happen too often due to insufficient testing,and low-cost power,afforda
268、ble networking rates and availability are paramount.More on power needs in the Politics section!While big tech companies have long produced their own hardware,these efforts are accelerating as they seek to at least improve their bargaining power with NVIDIA-but these arent tackling the most challeng
269、ing workloads.stateof.ai 2024#stateofai|99Big labs seek to weaken their NVIDIA addiction Introduction|Research|Industry|Politics|Safety|Predictions Known for its TPUs,Google has unveiled the Axion,built on the Armv9 architecture and instruction set.These will be made available through Cloud for gene
270、ral-purpose workloads and achieves 30%better performance than the fastest general-purpose Arm-based instances currently available.Meta has unveiled the second generation of its in-house AI inference accelerator,which more than doubles the compute and memory bandwidth of its predecessor.The chip is c
271、urrently used for ranking and recommendation algorithms,but Meta plans to expand its capabilities to cover training for generative AI.Meanwhile,OpenAI has been hiring from Googles TPU team and is in talks with Broadcom about developing a new AI chip.Sam Altman has also reportedly been in talks with
272、major investors,including the UAE government,for multi-trillion dollar initiative to boost chip production.Riding the NVIDIA tidal wave,AI chip challengers are fighting for a slice of the(VC and customer)pie.stateof.ai 2024#stateofai|100And a handful of challengers demonstrate signs of traction Cere
273、bras,known for their Wafer-Scale Engine,that integrates an entire supercomputers worth of compute onto one wafer-sized processor,has filed to IPO on$136M in revenue for H1 2024(up 15.6x YoY),87%of which came from Abu Dhabi-based and state-backed G42.The company has raised over$700M with customers in
274、 the compute-intensive energy and pharma sectors.It recently launched an inference service to serve LLMs with faster token generation.Introduction|Research|Industry|Politics|Safety|Predictions Meanwhile,Groq raised a$640M Series D at a$2.8B valuation for its Language Processing Unit,designed solely
275、for AI inference tasks.So far,Groq has landed partnerships with Aramco,Samsung,Meta,and green compute provider Earth Wind&Power.Both companies are focusing on speed as a core differentiator and are working on cloud services,with Cerebras recently launching an inference.This helps them swerve NVIDIAs
276、 software ecosystem advantage,but gives them a new(challenging)competitor in the form of cloud services providers.Known for betting big,SoftBank is entering the arena,tasking subsidiary Arm with launching its first AI chips in 2025 and acquiring struggling UK start-up Graphcore for a rumoured$600-70
277、0M.stateof.ai 2024#stateofai|101While SoftBank starts to build its own chip empire(after prematurely selling NVIDIA)Introduction|Research|Industry|Politics|Safety|Predictions Arm is already a player in the AI world,but historically,its instruction set architecture has not been optimal for the large-
278、scale parallel processing infrastructure required for datacenter training and inference.Its also struggled against NVIDIAs entrenched data center business and mature software ecosystem.With a current market cap of over$140B,markets arent bothered.The company is reportedly already in talks with TSMC
279、and others about manufacturing.SoftBank also scooped up Graphcore,which pioneered Intelligent Processing Units,a processor designed to handle AI workloads more efficiently than GPUs and CPUs,using small volumes of data.Despite its sophistication,the hardware was often not a logical choice for genAI
280、applications as they took off.The company will operate semi-autonomously under the Graphcore brand.Meanwhile,Softbanks talks with Intel on designing a GPU challenger stalled after they were unable to agree on requirements.As US export controls widen,previously sanctions-compliant chips have found th
281、emselves on the wrong side of tougher performance thresholds.That hasnt deterred chip manufacturers.stateof.ai 2024#stateofai|102The US Commerce Department plays whack-a-mole with chip manufacturers Introduction|Research|Industry|Politics|Safety|Predictions In last years report,we documented how NVI
282、DIA had booked over$1B in sales of the A800/H800(their special China-compliant chip)to major Chinese AI labs.The US then banned sales to China,forcing a rethink.US Commerce Secretary Gina Raimondo has warned that“if you redesign a chip around a particular cut line that enables China to do AI Im goin
283、g to control it the very next day”.NVIDIAs new China chip,the H20 is theoretically significantly weaker than top-line NVIDIA hardware,if you measure by raw computing power.However,NVIDIA have optimised it for LLM inference workloads,meaning it is now 20%faster than the H100 on reasoning tasks.NVIDIA
284、 are set to book$12B in sales.China,proportionally,however is becoming less important to US chip manufacturers.Its gone from representing 20%of NVIDIAs data center business to“mid-single digits”,according to NVIDIA.While Chinese labs face restrictions in their ability to import hardware,there are cu
285、rrently no controls on their local affiliates renting access to it overseas.ByteDance rents access to NVIDIA H100s via Oracle in the US,while Alibaba and Tencent are reportedly in conversations with NVIDIA about setting up their own US-based datacenters.Meanwhile,Google and Microsoft have directly p
286、itched big Chinese firms on their cloud offerings.The US is planning to make hyperscalers report this kind of usage via a KYC scheme,but is yet to draw up plans to prohibit it.stateof.ai 2024#stateofai|103but opts not to restrict the use of hardware by Chinese labs in US data centers Introduction|Re
287、search|Industry|Politics|Safety|Predictions There is a growing number of increasingly sizeable NVIDIA chip sales to Chinese end customers via Asian intermediary dealers(particularly Malaysia,Hong Kong and Japan)who facilitate trades using shell companies with fictitious business presences and even t
288、emporary data centers.stateof.ai 2024#stateofai|104Small-scale no more:Semiconductor smugglers get increasingly sophisticated Introduction|Research|Industry|Politics|Safety|Predictions In one case,a Chinese electric appliance company placed a$120 million order for a 2,400 NVIDIA H100 cluster via a M
289、alaysian broker.Given the size of the order,NVIDIA mandated an in-person check to ensure proper installation of the system.The broker told The Information,which reported on this event,that he had“coordinated the rental,installation and activation of the servers in a spare data center facility in Joh
290、or Bahru,a Malaysian town adjacent to the Singapore border and home to large clusters of data centers.NVIDIA inspectors checked the servers there and left.Shortly afterward,the servers were whisked away to China via Hong Kong.”Another Hong Kong-based chip broker has accumulated restricted 4,800 H100
291、s via purchases from Dell and Supermicro using shell companies based in non-US sanctioned countries.These were sold for$230M to a Chinese buyer,a substantial premium on their acquisition cost of$180M.Many of the buzziest start-ups working on generative AI are raising on record,often three digit reve
292、nue multiples.While these might be an indication of investor confidence in future returns,it sets a high bar,as many of these companies currently have no identified path to profitability.However,this isnt true for everyone,as the biggest model providers see revenue begin to ramp up.stateof.ai 2024#s
293、tateofai|105But wheres the revenue?Introduction|Research|Industry|Politics|Safety|Predictions OpenAI is on course to see revenues triple in the space of a year,but training,inference,and staffing costs mean losses are continuing to mount.Theyre not the only leader in search of functional economics.s
294、tateof.ai 2024#stateofai|106and wheres the margin?Introduction|Research|Industry|Politics|Safety|Predictions Meta has produced an incredible vibe shift in public markets by ditching their substantial metaverse investments and pivoting hard into open source AI with their Llama models.Mark Zuckerberg
295、is,arguably,the de facto messiah of open source AI,counterpositing vs.OpenAI,Anthropic,and Google DeepMind.stateof.ai 2024#stateofai|107Perhaps its neither:vibes are all you need(to recover your share price)Introduction|Research|Industry|Politics|Safety|Predictions1:Oct 28,21:Metaverse investment an
296、nounced.12362:Nov 9,22:Major layoffs and metaverse tempering.3:Feb 24,23:Llama 1.6:Jul 23,24:Llama 3.1 405B.+457%+$1.134T-68.4%-$601Btl;dr FAIR+GenAI+Llama saved Meta544:Jul 18,23:Llama 2.5:Apr 18,24:Llama 3.77:Sept 25,24:Llama 3.2.stateof.ai 2024#stateofai|108The top quality model,OpenAIs o1,comes
297、at a significant price and latency premiums Introduction|Research|Industry|Politics|Safety|Predictions As the model menu matures,developers are choosing the right tool for the job(and their budget).stateof.ai 2024#stateofai|109Inferencing all the way down:models get cheaper Introduction|Research|Ind
298、ustry|Politics|Safety|Predictions60 x drop!202320242023100 x drop!2024 Once thought to be prohibitively expensive to serve,the inference cost of serving strong models is dropping.stateof.ai 2024#stateofai|110Google Gemini produced a strong model series with very competitive pricing Introduction|Rese
299、arch|Industry|Politics|Safety|PredictionsNote:Pricing for 128k token prompts and outputs.Retrieved 4 Oct 2024 Prices on Gemini 1.5 Pro and 1.5 Flash have been dropped by 64-86%a few months after launchl while offering strong performance,e.g.Flash-8B is 50%cheaper than 1.5 Flash yet comparable across
300、 many benchmarks.76%cut86%cut#stateofai|111 Introduction|Research|Industry|Politics|Safety|Predictionsstateof.ai 2024Over the summer,Anthropic and then Vercel launched the capability for their chat agents Claude and V0 to open coding environments in which code is written and run in the browser to so
301、lve a users request.This brings previously static code snippets to life,enabling users to iterate with the agent in real time,and to reduce the barrier for creating software products.Needless to say,social media GenAI fans loved this!Below are examples of Claude Artifacts and V0 generating a playabl
302、e Minesweeper game from a single prompt.Chat agents as interactive developer sidekicks#stateofai|112 Introduction|Research|Industry|Politics|Safety|Predictionsstateof.ai 2024The most successful technology companies like Apple,Google,or TikTok have taken a product-first,rather than simply building a
303、foundational technology and an API.As base model performance converges,OpenAI,Anthropic,and Meta are visibly putting more thought into what their product looks and feels like-whether its Artifacts from Claude,OpenAIs Advanced Voice functionality,or Metas hardware partnerships and lip-syncing tools.S
304、imply building a good model wont be all you need.as AI labs move from building models to designing products European leaders have been desperate to point to a domestic success story as US labs have occupied the spotlight.For now,Mistral remains the continents primary bright spark.stateof.ai 2024#sta
305、teofai|113While les grands modles catch on,but another European challenger loses steam Introduction|Research|Industry|Politics|Safety|Predictions With over 1B in the bank,Mistral has emerged as the undisputed European foundation model champion,demonstrating both impressive computational efficiency a
306、nd multilingual capabilities.Au Large,its flagship model is available via Azure as part of the companys new partnership with Microsoft.The company has started striking partnerships with both French companies like BNP Paribas and international start-ups like Harvey AI.The company is also beginning to
307、 bulk out its US sales function.Meanwhile,self-styled German sovereign AI champions Aleph Alpha have struggled.Following a controversy in German media about how its$500M funding round was declared,the companys closed models have underperformed freely available peers.As a result,the company appears t
308、o be pivoting to licensing Llama 2-3 and DBRX.In last years report,we touched on Databricks and Mosaics LLM combined strategy,which focused on fine-tuning models on customers data.Is the bring your own model era over?stateof.ai 2024#stateofai|114 Introduction|Research|Industry|Politics|Safety|Predic
309、tions The Mosaic research team,now folded into Databricks,open-sourced DBRX in March.A 132B MoE model,DBRX was trained on just over 3,000 NVIDIA GPUs at a cost of$10M.Databricks is pitching the model as a foundation for enterprises to build on and customize,while remaining in control of their own da
310、ta.Meanwhile,Snowflakes Arctic is pitched as the most efficient model for enterprise workflows,based on a set of metrics covering tasks including coding and instruction following.Its unclear how much enterprises are willing to invest in costly custom model tuning,given the constant set of releases a
311、nd improvements driven bigger players.With readily available open source frontier models,the appeal of training custom models is increasingly unclear.Databricks and Snowflake pivot to build their own modelsbut can they compete?Given the high compute costs involved,model builders increasingly rely on
312、 partnership arrangements with established Big Tech companies.Antitrust regulators worry that this will further entrench incumbents.stateof.ai 2024#stateofai|115Regulators scrutinize the relationships between key generative AI players Introduction|Research|Industry|Politics|Safety|Predictions Regula
313、tors have particularly zeroed in on the close relationship between OpenAI and Microsoft,along with Anthropics ties to Google and Amazon.Regulators fear that big tech companies are either essentially buying out competition or providing friendly service provision deals to companies that theyve investe
314、d in-potentially disadvantaging competitors.Theyre particularly nervous about the influence NVIDIA wields over the ecosystem and its decision to make direct investments.France is contemplating NVIDIA-specific charges.Big Tech companies are attempting to place some clear blue water between themselves
315、 and start-ups,with Microsoft and Apple both voluntarily surrendering their OpenAI board observer seats.#stateofai|116 Introduction|Research|Industry|Politics|Safety|Predictionsleading to the rise of pseudo-acquisitions as an exit strategystateof.ai 2024 Regulatory action can only do so much to shap
316、e a market,when economic logic dictates otherwise.Giving the converging performance of many of the rest and the companies high cap-ex needs,consolidation is unsurprising.Given some of the regulatory hurdles,weve seen the rise of pseudo-acquisitions,where a Big Tech company i)hires the founders and m
317、uch of the team of a start-up;ii)the start-up exits the model-building game to focus on its enterprise offer;iii)investors are paid out via a licensing agreement.This model has been used by Microsoft with Inflection and Amazon with Adept.However,regulators have become wise to the move and regulators
318、 on both sides of the Atlantic are beginning to scrutinize these arrangements.By far the most widely-used AI-powered developer tool,Copilot adoption is growing 180%year-over-year and its annual revenue run rate is now$2B(double its 2022 figure).Copilot(40%of Github revenue)alone is now a bigger busi
319、ness than Github was when Microsoft acquired it.However,its just one of a number of coding companies,some of which are raising blockbuster rounds.stateof.ai 2024#stateofai|117Github reigns supreme,but an ecosystem of AI coding companies is growing Introduction|Research|Industry|Politics|Safety|Predi
320、ctions$626M,Series B$465M,Series C$252M,Series AFigures are total raised and latest round as of 7 Oct 2024$68M,Series A$196M,Series B$243M,Series C$60M,Series A In a now familiar cycle,were seeing specialist tools and frameworks gain popularity before struggling to scale and enter production,while i
321、ncumbents demonstrate impressive resilience and adaptability.stateof.ai 2024#stateofai|118ML tools for AI struggle(again)Introduction|Research|Industry|Politics|Safety|Predictions Following the explosive growth of vector databases,the uniqueness of searching in vector space has worn off.Existing dat
322、abase providers have launched their own vector search methods.Hyperscalers like AWS,Azure,and Google Cloud have expanded their native DB offerings to support vector search and retrieval at scale,while data clouds like MongoDB,Snowflake,Databricks and Confluent are seeking to capture RAG workloads fr
323、om their existing customer base.Core Vector DB providers like Pinecone and Weviate now support traditional keyword search,such as ElasticSearch and OpenSearch along with introducing support for simple and efficient filtering and clustering.Over in framework land,the likes of LangChain and LlamaIndex
324、,having achieved popularity for experimentation,their high-level abstractions and limited flexibility have been called out as a source of friction by some developers,as their needs become more sophisticated.While H is being cagey about the specifics of its work,its early team contained experts in re
325、inforcement learning and multi-agent systems.Other agentic efforts are already up and running.stateof.ai 2024#stateofai|119Are AI agents going commercial?Introduction|Research|Industry|Politics|Safety|Predictions Devin,launched by Cognition,made a splash in March.Pitched as“the first AI software eng
326、ineer”,it is meant to plan and execute tasks requiring thousands of decisions,while fixing mistakes and learning over time.The product itself split users,attracting fans,as well as detractors who point to the need for guardrails and manual intervention.Either way,investors are impressed,and within s
327、ix months of launch,the company secured a$2B valuation.Devin has an open source competitor in OpenDevin,which beat the proprietary Devin on SWE-bench by 13 percentage points.MultiOn is also betting big on RL,with its autonomous web agent-Agent Q(see slide 65)-combining search,self-critique,and RL.It
328、 will be made available to users later this year.Metas TestGen-LLM has gone from paper to product at breakneck space(4 months),being integrated into Qodos Cover-Agent.With$165M raised,Perplexity has emerged as the buzziest AI-first search challenger,while Google is rolling out its own search summari
329、es.Both companies are finding that the output is only as good as the information.stateof.ai 2024#stateofai|120AI-powered search begins to make a dent,amid teething problems Introduction|Research|Industry|Politics|Safety|Predictions Within a 18 months of being founded,Perplexity hit a$1B valuation,wi
330、th rumours that it is already looking to potentially triple it.The LLM analyzes user input,sourcing responses either via a web search or from its knowledge base,before producing a summary with in-line citations.Google has ruled out a summary boxes to illustrate the potential of Gemini to power up it
331、s standard offering.Both services,however,have struggled with reliability issues.Gemini was found to be using satirical Reddit posts as advice sources(e.g.advising users to consume a rock a day),while Perplexity struggles with the same hallucination issues that hit other LLM-powered services.OpenAI
332、has started trialling a prototype search function-SearchGPT-which will eventually be integrated into ChatGPT.While we dont know technical specifics yet,promotional imagery suggests a Perplexity-esque user experience.While copyright concerns are nothing new in generative AI,2024 saw model builders co
333、me under greater scrutiny from media organizations,record labels,and content creators.stateof.ai 2024#stateofai|121Industry attitudes to copyright diverge as anger from content creators rises Introduction|Research|Industry|Politics|Safety|Predictions OpenAI and Google are negotiating with major media organizations,hoping that licensing arrangements will take the sting out of criticism.In a similar