Can ChatGPT Pass the Supply Chain Test

Blog

Can ChatGPT Pass the Supply Chain Test? Blue Yonder Reveals Study Results 

There’s been a lot of talk about how generative AI will change working in the supply chain. At Blue Yonder, we wanted to examine those impacts via a benchmarking study. In our research experiment, we explored how capable large language models (LLMs) are out-of-the-box — and if they can be effectively applied to supply chain analysis to address the real issues faced in supply chain management.

LLMs, including ChatGPT, are a type of artificial intelligence trained on massive amounts of data, which allows them to learn the patterns, grammar and semantics of language. Over the past few years, LLMs have exploded in growth and are used in a range of applications worldwide, including content creation, customer service and market research.  

IDC data reveals that the software and information services, banking and retail industries are projected to allocate approximately $89.6 billion to AI in 2024, with generative AI accounting for more than 19% of the total investment.  

This rapidly evolving technology offers businesses increased creativity, efficiency and decision-making capabilities — which have the power to revolutionize industries and processes. So how do LLMs currently handle supply chain situations?  

About Blue Yonder’s generative AI benchmark study

Our generative AI supply chain test is based loosely on the viral ChatGPT experiment called the Uniform Bar Examination.  In this study, the latest version of ChatGBT passed the bar exam with a high combined score of 297, approaching the 90th percentile of all test takers. By passing the bar with a nearly top 10% score, LLMs demonstrate the capacity for generative AI to comprehend and apply legal principles and regulations. This groundbreaking study sparked global conversation and highlighted the transformative potential of AI.

Blue Yonder decided to take this conversation a step further by studying how leading LLM systems could do on supply chain industry exams. We had LLMs face off against two standard certification tests, the CPSM and the CSCP. Our objective? To see if LLMs could function as supply chain professionals, understanding the niche rules and context of the supply chain industry with no training.  

We designed the experiment to programmatically run each LLM through the practice tests, with no context around the test, no access to the internet and no coding ability. We wanted to assess how the LLMs would perform straight out of the box, enabling a consistent and unbiased evaluation.  

Both the CPSM and the CSCP certification tests are multiple-choice. Rather than have the LLMs simply select an answer, we set up an output for the models to explain each choice they selected. This approach allowed us to gain valuable insight into each model’s reasoning process and understand why it was getting answers wrong or right, helping us evaluate each model’s abilities.  

After updated versions of the LLMs were released, we ran the test again this summer to collect new benchmark results.

So, can LLMs pass supply chain exams?  

Impressively, the LLMs performed surprisingly well on the supply chain exams without any training. We first looked at LLMs’ out-of-the-box performance, with no context, then added some advantages.

Stage 1: No context, no Internet access, no coding ability

While most models achieved a solid passing grade with no context, Claude 3.5 Sonnet stood out, securing an impressive 79.71% accuracy on the CPSM certification test. On the CSCP exam, OpenAI’s o1-Preview and GPT 4o models edged out Claude Opus, scoring 48.30% accuracy compared to the latter’s 45.7%.

can-chatgpt-pass-the-supply-chain-test-body-01

While LLMs performed well in certain areas, they also showed limitations, particularly when faced with mathematics-related questions or deeply domain-specific questions.  

When examining only the math problems in each certification exam, OpenAI o1 Mini showcased a significant improvement in accuracy for OpenAI models, outperforming the Claude models tested.  

can-chatgpt-pass-the-supply-chain-test-body-02

These results were generated based on no context, no Internet access and no coding ability. Next, we explored what would happen if we started to give the LLMs more assistance.

Stage 2: Adding internet access 

In the next stage of testing, we gave the LLM programs access to the internet — allowing them to search using you.com. With that added capability, OpenAI GPT 4 Turbo achieved the most significant advancement — from 42.38% to 48.34% — on the CSCP test.

When looking at questions that were initially missed on the first no-context test, the Claude Sonnet model achieved an accuracy rating of approximately 53.84% for the CPSM questions, and 20% for CSCP questions. 

While Internet access allowed the models to search for information independently, it also introduced the potential for inaccuracies due to unreliable online information sources.

Stage 3: Providing context with RAG 

For the next test, we used a RAG (retrieval augmented generation) model, providing the LLMs with study materials from the tests.  Using RAG, the LLMS outperformed both the no-context and open internet access tests on non-mathematical questions, achieving the highest accuracy scores for both tests.  

can-chatgpt-pass-the-supply-chain-test-body-03

Stage 4: Adding coding abilities  

Finally, for the next test, we gave the models the ability to write and run their own code using Code Interpreter and Open Interpreter frameworks.  

Using these frameworks, the LLMs could write code to help solve the mathematical questions in the exams, which they struggled with in the first iteration of the test. With coding abilities, the LLMs outperformed the no-context test by an average of approximately 28% in accuracy across all models for mathematical questions.  

Are LLMs useful for solving supply chain problems?

On the whole, the LLM systems passed the industry-standard supply chain exams. This performance presents a very exciting possibility for integrating LLMs into supply chain management. However, the models aren’t perfect yet. They struggled with both math problems and specific supply chain logic.  

With the added ability to write code, the LLMs were able to overcome many of the math problems — but still needed very specific supply chain context to solve some of the more complex questions within the exams.

What our study revealed is that generative AI can be extremely useful for solving supply chain problems, with the right tools and training.

Fortunately, that’s what Blue Yonder excels at. We’re committed to harnessing the power of generative AI to create practical, innovative solutions for supply chain challenges. Our newly launched AI Innovation Studio is a hub for developing these solutions, bridging the gap between complex AI technologies and real-world applications.

Our focus is on creating intelligent agents tailored to specific roles within the supply chain, ensuring that these agents are equipped to solve the real, authentic problems and challenges faced right now. Learn more about AI and machine learning at Blue Yonder, or contact us to start a one-on-one conversation.
 

Atteignez les meilleures performances de votre chaîne d'approvisionnement grâce à l'IA prédictive et générative

Concentrez-vous sur les décisions et laissez l'IA gérer les données avec des décennies d'expérience spécialisée et d'innovation éprouvée qui apportent des résultats transformateurs à votre entreprise.