{"id":24022,"date":"2024-09-25T15:25:07","date_gmt":"2024-09-25T14:25:07","guid":{"rendered":"https:\/\/www.riskinsight-wavestone.com\/?p=24022"},"modified":"2024-09-25T15:25:09","modified_gmt":"2024-09-25T14:25:09","slug":"which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally","status":"publish","type":"post","link":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/","title":{"rendered":"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally."},"content":{"rendered":"\n<p style=\"text-align: justify;\">Ever since the launch of ChatGPT in November 2022, many companies began developing and releasing their own Large Language Models (LLMs). \u00a0So much so that we are currently in a phase that many experts describe as an \u201cAI Race\u201d. Not just between companies \u2013 but countries and international organizations as well. This AI race describes the global frenzy to build better models alongside the guidelines and regulations to handle them. <strong>But what exactly is a better model?<\/strong><\/p>\n<p style=\"text-align: justify;\">To answer this question, researchers and engineers from around the world came up with a standardized system to test LLMs in various settings, knowledge domains and to quantify it in an objective manner. These tests are commonly known as \u201cBenchmarks\u201d, and different benchmarks reflect very different use cases.<\/p>\n<p style=\"text-align: justify;\">However, for the average user, these benchmarks alone don\u2019t mean much. There is a clear lack of awareness for the end-user: a 97.3% result in the \u201cMMLU\u201d benchmark is hard to read and to transpose into their daily tasks.<\/p>\n<p style=\"text-align: justify;\">To avoid such confusions, the article introduces factors that limit down a user\u2019s LLM choice, the most popular and widely used LLM benchmarks, their use cases and how they can help users choose the most optimal LLM for themselves.<\/p>\n<p style=\"text-align: justify;\">\u00a0<\/p>\n<h2 style=\"text-align: justify;\"><a name=\"_Toc171702525\"><\/a>Factors that Impact LLM Choice<\/h2>\n<p style=\"text-align: justify;\">Various factors impact to quality of the model: the cut-off date and internet access, multi-modality, data privacy, context window, and speed and parameter size. These factors must be solidified first before moving on to benchmark assessments and model comparison since they limit which models you can use in the first place.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702526\"><\/a>Cut-off Date and Internet Access<\/h3>\n<p style=\"text-align: justify;\">Almost all models on the market have a knowledge cut-off date. This is the date where data collection for model training ends. For example, if the cut-off date is September 2021, then the model has no way of knowing any information after that date. Cut-off dates are usually 1-2 years before the model has been released.<\/p>\n<p style=\"text-align: justify;\">However, to overcome this issue, some models such as Copilot (GPT4) and Gemini have been given access to the internet, allowing them to browse the web. This has allowed models with cut-off dates to still have access to the most recent news and articles. This also allows the LLMs to provide the user with references which reduces the risk of hallucination and makes the answer more trustworthy.<\/p>\n<p style=\"text-align: justify;\">Nevertheless, internet access is a product of the model\u2019s packaging rather than the model itself, thus it is limited to models on the internet, primarily closed-source cloud-hosted ones. For this reason, it is important to consider what your needs are and if having up-to-date information is really all that important in achieving your goals.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702527\"><\/a>Multi-Modality<\/h3>\n<p style=\"text-align: justify;\">Different applications require different uses for LLMs. While most of us use them for their text generation abilities, many LLMs are in fact able to analyze images, and voices and reply with images as well.<\/p>\n<p style=\"text-align: justify;\">However, not all LLMs have this ability. The ability to analyze different forms of input (text, image, voice) is \u201cmulti-modality\u201d. This is an important factor to consider since if your task requires the analysis of voice messages or corporate diagrams then it is important to look for models that are multi-modal such as Claude 3 and ChatGPT.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702528\"><\/a>Data Privacy<\/h3>\n<p style=\"text-align: justify;\">A risk of using most models in the market right now is data privacy and leakage. More specifically, data privacy and safety in LLMs can be separated into two parts:<\/p>\n<ol style=\"text-align: justify;\">\n<li><strong>Data privacy in pre-training and fine-tuning<\/strong>, this is whether the model has been trained on data that contains PIIs and if it could leak those PIIs during chats with users<strong>. <\/strong>This is a product of the model\u2019s training dataset and fine-tuning process.<\/li>\n<li><strong>Data privacy in re-training and memory,<\/strong> this is whether the model would use chats with users to re-train, potentially leaking information from one chat to another. However, this risk is only limited to some online models. This is a product of the packaging of the model and the software layer(s) between the model and the user.<\/li>\n<\/ol>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702529\"><\/a>Context Window<\/h3>\n<p style=\"text-align: justify;\">Context Window refers to the number of input tokens that a model can accept. Thus, a larger context window means that the model can accept a larger input text. For example, the latest Google model, the Gemini 1.5 pro, has a 1 million token context window which gives it the ability to read entire textbooks and then answer you based on the information in the textbooks.<\/p>\n<p style=\"text-align: justify;\">For context, a 1 million token window allows the model to analyze ~60 full books purely from user input before answering the user prompt.<\/p>\n<p style=\"text-align: justify;\">Thus, it is apparent that models with larger context windows can often be customized to answer questions based on specific corporate documents without using RAG (Retrieval-augmented generation) which is the most common solution for this problem in the market.<\/p>\n<p style=\"text-align: justify;\">However, LLMs often bill users based on the number of input tokens used and thus expect to be billed more when using the larger context window. Additionally, it isn\u2019t common for models to take upwards of 10 minutes before answering when using a larger context window.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702530\"><\/a>Speed and Parameter Size<\/h3>\n<p style=\"text-align: justify;\">LLMs have technical variations that can impact the speed of processing the user prompt and the speed of generating a response. The most important technical variation that affects LLM speed is parameter size, which refers to the number of variables the model has internally. This number, usually in billons, reflects how sophisticated a model is but also indicates that the model might require more time to generate a response.<\/p>\n<p style=\"text-align: justify;\">However, the internal architecture of the model also matters. For instance, some of the latest 70B+ parameter models in the market can reply in real-time while some 8B parameter models need minutes to generate a response.<\/p>\n<p style=\"text-align: justify;\">Overall, it is important to consider the trade-off between speed on one hand and parameter size (sophistication and complexity) on the other, although this is also highly dependent on the internal model architecture and the environment it is used in (API, Cloud service, or self-deployed etc.)<\/p>\n<p style=\"text-align: justify;\">Nevertheless, speed specifically is a key distinguisher that borders the line between factor and benchmark since it is measured and used to compare the different STOA models. However, speed isn\u2019t a standardized pragmatic form of assessment and for this reason isn\u2019t considered a benchmark.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702531\"><\/a>Next Steps<\/h3>\n<p style=\"text-align: justify;\">After having reviewed the factors, users can now limit their LLM choice and use the benchmarks covered in the next section to help them choose the most optimal model. This helps the user maximize their efficiency and only benchmark the models that are relevant to them (from a cut-off date, speed, data privacy, etc. perspective).<\/p>\n<p style=\"text-align: justify;\">\u00a0<\/p>\n<h2 style=\"text-align: justify;\"><a name=\"_Toc171702532\"><\/a>How Benchmarks are Conducted<\/h2>\n<p style=\"text-align: justify;\">Benchmarks are tools used to assess LLM performance in a specific area. Benchmarks can be conducted in different ways \u2013 the key distinguisher being the number of example question-answer pairs the LLM is given before it is asked to solve a real question.<\/p>\n<p style=\"text-align: justify;\">Benchmarks assess the LLM\u2019s ability to do a certain task. Most benchmarks will ask an LLM a question and compare the LLM\u2019s answer with a reference correct answer. If it matches, then the LLM\u2019s score increases. In the end, the benchmarks output an Acc\/Accuracy score which is a percentage of the number of questions an LLM answered correctly.<\/p>\n<p style=\"text-align: justify;\">However, depending on the method of assessment, the LLM might get some context on the benchmark, type of questions or more. This is done through multi-shot or multi-example testing.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702533\"><\/a>Multi-shot Testing<\/h3>\n<p style=\"text-align: justify;\">Benchmarks are conducted in three distinct ways.<\/p>\n<ol style=\"text-align: justify;\">\n<li>Zero-Shot<\/li>\n<li>One-Shot<\/li>\n<li>Multi-shot (often multiples of 2 or 5)<\/li>\n<\/ol>\n<p style=\"text-align: justify;\">Where shots refer to the number of times a sample question was given to the LLM prior to its assessment.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24029\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-1-EN.png\" alt=\"\" width=\"605\" height=\"194\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-1-EN.png 605w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-1-EN-437x140.png 437w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-1-EN-71x23.png 71w\" sizes=\"auto, (max-width: 605px) 100vw, 605px\" \/><\/p>\n<p style=\"text-align: center;\"><em>Figure 1: illustration of 3-shot vs. 0-shot prompting<\/em><\/p>\n<p style=\"text-align: justify;\">The reason we have different-shot testing is because certain LLMs outperform others in short-term memory and context usage. For example, LLM1 could have been trained on more data and thus outperforms LLM2 in zero-shot prompting. However, LLM2\u2019s underlying technology allows it to have a superior reasoning, and contextualizing ability that would only be measured through one-shot or multi-shot assessment.<\/p>\n<p style=\"text-align: justify;\">For this reason, each time an LLM is assessed, multiple shot settings are used to ensure that we get a complete understanding of the model and its capabilities.<\/p>\n<p style=\"text-align: justify;\">For instance, if you are interested in finding a model that contextualizes well and is able logically reason through new and diverse problems, consider looking at how the model\u2019s performance increases as the number of shots increases. If a model has significant improvement, it means that it has a strong ability to reason and learn from previous examples.<\/p>\n<p style=\"text-align: justify;\">\u00a0<\/p>\n<h2 style=\"text-align: justify;\"><a name=\"_Toc171702534\"><\/a>Key Benchmarks and Their Differentiators<\/h2>\n<p style=\"text-align: justify;\">Many benchmarks often evaluate the same thing. Thus, it is important when looking at benchmarks to understand what they are assessing, how they are assessing it and what its implications are.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702535\"><\/a>Massive Multitask Language Understanding (MMLU)<\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24038\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-1-EN.png\" alt=\"\" width=\"626\" height=\"225\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-1-EN.png 626w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-1-EN-437x157.png 437w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-1-EN-71x26.png 71w\" sizes=\"auto, (max-width: 626px) 100vw, 626px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24006\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-2.png\" alt=\"\" width=\"1386\" height=\"339\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-2.png 1386w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-2-437x107.png 437w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-2-71x17.png 71w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-2-768x188.png 768w\" sizes=\"auto, (max-width: 1386px) 100vw, 1386px\" \/><\/p>\n<p style=\"text-align: center;\"><em>Figure 2: example of an MMLU question<\/em><\/p>\n<p style=\"text-align: justify;\">MMLU is one of the most widely used benchmarks. It is a large multiple-choice question format dataset that covers 57 unique subjects at an undergraduate level. These subjects include Humanities, Social Sciences, STEM and more. For this reason, MMLU is considered as the most comprehensive benchmark for testing an LLM\u2019s general knowledge across all domains. Additionally, it is also used to find gaps in the LLMs pre-training data since it isn\u2019t rare for an LLM to be exceptionally good at one topic and underperforming in another.<\/p>\n<p style=\"text-align: justify;\">Nevertheless, MMLU only contains English-language questions. So, a great result in MMLU doesn\u2019t necessarily translate to a great result when asking general knowledge questions in French, or Spanish. Additionally, MMLU is purely multiple choice which means that the LLM is tested only on its ability to pick the correct answer. This doesn\u2019t necessarily mean the LLM is good at generating coherent, well-structured, and non-hallucinatory answers when prompted with open-ended questions.<\/p>\n<p style=\"text-align: justify;\">An MMLU result can be interpreted as the percentage of questions that the LLM was able to answer correctly. Thus, for MMLU, a higher percentage is a better score.<\/p>\n<p style=\"text-align: justify;\">Generally, a high average MMLU score across all 57 fields indicates that the model was trained on a large amount of data containing information from many different topics. Thus, a model performing well in MMLU is a model that can effectively be used (perhaps with some prompt engineering) to answer FAQs, examination questions and other common everyday questions.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702536\"><\/a>HellaSwag (HS)<\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24036\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-2-EN.png\" alt=\"\" width=\"620\" height=\"222\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-2-EN.png 620w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-2-EN-437x156.png 437w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-2-EN-71x25.png 71w\" sizes=\"auto, (max-width: 620px) 100vw, 620px\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24000\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-3.png\" alt=\"\" width=\"2063\" height=\"351\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-3.png 2063w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-3-437x74.png 437w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-3-71x12.png 71w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-3-768x131.png 768w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-3-1536x261.png 1536w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-3-2048x348.png 2048w\" sizes=\"auto, (max-width: 2063px) 100vw, 2063px\" \/><\/p>\n<p style=\"text-align: center;\"><em>Figure 3: example of a HellaSwag question<\/em><\/p>\n<p style=\"text-align: justify;\">HellaSwag is an acronym for \u201cHarder Endings, Longer contexts, and Low-shot Activities for Situations with Adversarial Generations\u201d. It is another English-focused multiple choice massive (10K+ questions) benchmark. However, unlike MMLU, HS does not assess factual or domain knowledge. Instead, HS focuses on coherency and LLM reasoning.<\/p>\n<p style=\"text-align: justify;\">Questions like the one above challenge the LLM by asking it to choose the continuation of the sentence that makes the most human sense. Grammatically, these are all valid sentences but only one follows common sense.<\/p>\n<p style=\"text-align: justify;\">The reason this benchmark was chosen is because it works in tandem with MMLU. While MMLU assesses factual knowledge, HS assesses whether the LLM would be able to use that factual knowledge to provide you with coherent and sensical responses.<\/p>\n<p style=\"text-align: justify;\">A great way to visualize how MMLU and HS are used is by imagining the world we live in today. We have engineers and developers that possess great understanding and technical knowledge but have no way to communicate it properly due to language and social barriers. Because of this, we have consultants and managers that may not possess the same depth of knowledge, but instead have the ability organize, and communicate the engineers\u2019 knowledge coherently and concisely.<\/p>\n<p style=\"text-align: justify;\">In this case, MMLU is the engineer and HS is the consultant. One assesses the knowledge while the other assesses the communication.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702537\"><\/a>HumanEval (HE)<\/h3>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24034\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-3-EN.png\" alt=\"\" width=\"620\" height=\"222\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-3-EN.png 620w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-3-EN-437x156.png 437w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-3-EN-71x25.png 71w\" sizes=\"auto, (max-width: 620px) 100vw, 620px\" \/><\/p>\n<p style=\"text-align: justify;\">While MMLU and HS test the LLM\u2019s ability to reason and answer accurately, HumanEval is the most popular benchmark to purely assess the LLM\u2019s ability to generate useable code for 164 different scenarios. Unlike the previous two, HumanEval is not multiple choice based and instead allows the LLM to generate its own response. However, not all responses are accepted by the benchmark. Whenever an LLM is asked to code a solution to a scenario, HumanEval tests the LLM\u2019s code with a variety of test and edge cases. If any of these test cases fail, then the LLM fails.<\/p>\n<p style=\"text-align: justify;\">Additionally, HumanEval also expects that the code generated by the LLM is algorithm optimized for time and space. Thus, if an LLM outputs a certain algorithm while there is a more optimal algorithm available then it loses points. Because of this reason, HumanEval also tests the LLM\u2019s ability to accurately understand the question and respond in a precise manner.<\/p>\n<p style=\"text-align: justify;\">HumanEval is an important benchmark, even for non-technical use cases since it accurately reflects LLM\u2019s general sophistication and quality in an indirect way. For most models, the target audience is developers and tech enthusiasts. For this reason, this is a strong positive correlation between greater HumanEval scores and greater scores in many other benchmarks signifying that the model is of higher quality. However, it is important to keep in mind that this is merely a correlation, not a causation, and so things might differ in the future as models start targeting new users.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702538\"><\/a>Chatbot Arena<\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24032\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-4-EN.png\" alt=\"\" width=\"622\" height=\"227\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-4-EN.png 622w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-4-EN-437x159.png 437w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Table-4-EN-71x26.png 71w\" sizes=\"auto, (max-width: 622px) 100vw, 622px\" \/> <img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24004\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-4.png\" alt=\"\" width=\"1386\" height=\"348\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-4.png 1386w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-4-437x110.png 437w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-4-71x18.png 71w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-4-768x193.png 768w\" sizes=\"auto, (max-width: 1386px) 100vw, 1386px\" \/><\/p>\n<p style=\"text-align: center;\"><em>Figure 4: example of Chatbot Arena interface<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-24002\" src=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-5.png\" alt=\"\" width=\"341\" height=\"248\" srcset=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-5.png 341w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-5-263x191.png 263w, https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/Figure-5-54x39.png 54w\" sizes=\"auto, (max-width: 341px) 100vw, 341px\" \/><\/p>\n<p style=\"text-align: center;\"><em>Figure 5: Chatbot Arena July 2024 rankings<\/em><\/p>\n<p style=\"text-align: justify;\">Unlike the past three benchmarks, Chatbot arena is not an objective benchmark, but a subjective ranking of all the available LLMs in the market. Chatbot Arena collects users\u2019 votes and determines which LLM provides the best overall user experience including the ability to maintain complex dialogues, understand user inquiries and other customer satisfaction factors.\u00a0 Chatbot Arena\u2019s subjective nature makes it the best benchmark assessing the end-user experience. However, this subjectivity also makes it non-reproducible and difficult to really quantify.<\/p>\n<p style=\"text-align: justify;\">The current user rankings put OpenAI\u2019s GPT-4o at the top of the list with a sizable margin between it and second place. This ranking has great merit since it is collected from the opinion of 1.3M user votes. However, these voters are primarily from a tech background and thus the ranking might be biased towards models with greater coding abilities.<\/p>\n<p style=\"text-align: justify;\">The rankings are built on top of the ELO system, which is a zero-sum system where models gain ELO by producing better replies than their opposing model and the opposing model loses ELO.<\/p>\n<h3 style=\"text-align: justify;\"><a name=\"_Toc171702539\"><\/a>Overall benchmarking<\/h3>\n<p style=\"text-align: justify;\">Benchmarks can have internal biases and limitations. Benchmarks can be used together to better represent the model\u2019s capabilities. Newer models are more advantaged because of their architecture, training data size, and leakage of benchmark questions.<\/p>\n<p style=\"text-align: justify;\">The three + one (chatbot arena) benchmarks mentioned are the most popular and widely used in research to compare LLMs. The combination mentioned (MMLU, HellaSwag, HumanEval and Chatbot Arena) assess many sides of the LLM, from its factual understanding and coherence to coding and user experience. For this reason, these four benchmarks alone are widely used in many rankings online since they are able to reflect the true nature of the LLM.<\/p>\n<p style=\"text-align: justify;\">However, one thing to consider is that the newest LLM models are heavily advantaged because of two primary reasons.<\/p>\n<ol style=\"text-align: justify;\">\n<li>They are built on a more robust architecture, have better underlying technologies and have more data to train on due to later cut-off dates and larger hardware capacity.<\/li>\n<li>Many questions from the benchmarks have leaked into the model\u2019s training data.<\/li>\n<\/ol>\n<p style=\"text-align: justify;\">Nevertheless, there are many more benchmarks available on the net that assess different parts of the LLM and are often used in tandem to paint a complete picture of the model\u2019s performance.<\/p>\n<p style=\"text-align: justify;\">\u00a0<\/p>\n<h2 style=\"text-align: justify;\"><a name=\"_Toc171702540\"><\/a>Factors, Benchmarks and How to Choose Your LLM<\/h2>\n<p style=\"text-align: justify;\">By using the aforementioned factors and benchmarks, you can effectively compare LLMs in a quantifiable and objective way \u2013 helping you make an informed decision and choose the most optimal model for your business need and task.<\/p>\n<p style=\"text-align: justify;\">Additionally, each of the above benchmarks has strengths and weaknesses that make them unique and great in different aspects. However, at Wavestone we recognize the importance of diversification to minimize risk. For this reason, we developed a checklist that allows users to make a more informed decision when it comes to choosing a set of benchmarks to follow and using them to compare the latest models. The checklist covers a wide variety of domains, benchmarks and factors that give the end-user more granular control over their benchmark choice.<\/p>\n<p style=\"text-align: justify;\">The tool, also a priority tracker, allows users to set different weights for the benchmarks to accurately reflect their business needs and task natures. For example, a consultant might prioritize multi-modality for diagram and chart analysis over mathematical skills and thus give multi-modality a higher weighting.<\/p>\n<p style=\"text-align: justify;\">\u00a0<\/p>\n<h2 style=\"text-align: justify;\"><a name=\"_Toc171702541\"><\/a>Finishing thoughts<\/h2>\n<p style=\"text-align: justify;\">In the rapidly evolving landscape of LLMs, understanding the nuances of different models and their capabilities is crucial. Before considering any LLM, several factors must be taken into consideration, including cut-off date, data privacy, speed, parameter size, context window, and multi-modality. After considering these factors, users can consult different benchmarks to make a more informed decision. The ones covered in this article, MMLU, HellaSwag, HumanEval, and Chatbot Arena, provide a robust system to quantitatively evaluate these models in various domains.<\/p>\n<p style=\"text-align: justify;\">In conclusion, the AI Race is not just about developing better models but also about leveraging and using these models effectively. The journey of choosing the most optimal LLM is not a sprint but a marathon, requiring continuous learning, adaptation, and strategic decision-making through benchmarking and testing. As we continue to explore the potential of LLMs, let us remember that the true measure of success lies not in the sophistication of the technology but in its ability to add value to our work and lives.<\/p>\n<p style=\"text-align: justify;\">\u00a0<\/p>\n<h3>Acknowledgements<\/h3>\n<p>We would like to thank Awwab Kamel Hamam for his contribution to this article.<\/p>\n<p>\u00a0<\/p>\n<h2 style=\"text-align: justify;\"><a name=\"_Toc171702542\"><\/a>Further Reading and Reference<\/h2>\n<p style=\"text-align: justify;\">[1] D. Hendrycks et al., \u201cMeasuring Massive Multitask Language Understanding.\u201d arXiv, 2020. doi: 10.48550\/ARXIV.2009.03300. Available: <a href=\"https:\/\/arxiv.org\/abs\/2009.03300\">https:\/\/arxiv.org\/abs\/2009.03300<\/a><\/p>\n<p style=\"text-align: justify;\">[2] D. Hendrycks et al., \u201cAligning AI With Shared Human Values.\u201d arXiv, 2020. doi: 10.48550\/ARXIV.2008.02275. Available: <a href=\"https:\/\/arxiv.org\/abs\/2008.02275\">https:\/\/arxiv.org\/abs\/2008.02275<\/a><\/p>\n<p style=\"text-align: justify;\">[3] M. Chen et al., \u201cEvaluating Large Language Models Trained on Code.\u201d arXiv, 2021. doi: 10.48550\/ARXIV.2107.03374. Available: <a href=\"https:\/\/arxiv.org\/abs\/2107.03374\">https:\/\/arxiv.org\/abs\/2107.03374<\/a><\/p>\n<p style=\"text-align: justify;\">[4] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, \u201cHellaSwag: Can a Machine Really Finish Your Sentence?\u201d arXiv, 2019. doi: 10.48550\/ARXIV.1905.07830. Available: <a href=\"https:\/\/arxiv.org\/abs\/1905.07830\">https:\/\/arxiv.org\/abs\/1905.07830<\/a><\/p>\n<p style=\"text-align: justify;\">[5] W.-L. Chiang et al., \u201cChatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.\u201d arXiv, 2024. doi: 10.48550\/ARXIV.2403.04132. Available: <a href=\"https:\/\/arxiv.org\/abs\/2403.04132\">https:\/\/arxiv.org\/abs\/2403.04132<\/a><br \/><br \/><\/p>\n<p style=\"text-align: justify;\">\u00a0<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Ever since the launch of ChatGPT in November 2022, many companies began developing and releasing their own Large Language Models (LLMs). \u00a0So much so that we are currently in a phase that many experts describe as an \u201cAI Race\u201d. Not&#8230;<\/p>\n","protected":false},"author":1516,"featured_media":24019,"comment_status":"open","ping_status":"closed","sticky":true,"template":"page-templates\/tmpl-one.php","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3266,3922],"tags":[4083,3279,4514,4512,4515,4516,4511,4297,4513],"coauthors":[4503,4504],"class_list":["post-24022","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cloud-next-gen-it-security-en","category-deep-dive-en","tag-ai","tag-artificial-intelligence-en","tag-benchmarks-2","tag-chatbot-arena-2","tag-hellaswag-2","tag-humaneval-2","tag-large-language-models","tag-llm-2","tag-mmlu-2"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.0 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally. - RiskInsight<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally. - RiskInsight\" \/>\n<meta property=\"og:description\" content=\"Ever since the launch of ChatGPT in November 2022, many companies began developing and releasing their own Large Language Models (LLMs). \u00a0So much so that we are currently in a phase that many experts describe as an \u201cAI Race\u201d. Not...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/\" \/>\n<meta property=\"og:site_name\" content=\"RiskInsight\" \/>\n<meta property=\"article:published_time\" content=\"2024-09-25T14:25:07+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-25T14:25:09+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1706\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Jeanne PIGASSOU, Craig D&#039;Souza\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jeanne PIGASSOU, Craig D&#039;Souza\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"18 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/\"},\"author\":{\"name\":\"Jeanne PIGASSOU\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#\/schema\/person\/452b07e38a831a2a62bc945ae0972d8b\"},\"headline\":\"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally.\",\"datePublished\":\"2024-09-25T14:25:07+00:00\",\"dateModified\":\"2024-09-25T14:25:09+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/\"},\"wordCount\":3069,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg\",\"keywords\":[\"AI\",\"artificial intelligence\",\"Benchmarks\",\"Chatbot arena\",\"HellaSwag\",\"HumanEval\",\"Large Language Models\",\"LLM\",\"MMLU\"],\"articleSection\":[\"Cloud &amp; Next-Gen IT Security\",\"Deep-dive\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/\",\"url\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/\",\"name\":\"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally. - RiskInsight\",\"isPartOf\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg\",\"datePublished\":\"2024-09-25T14:25:07+00:00\",\"dateModified\":\"2024-09-25T14:25:09+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#primaryimage\",\"url\":\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg\",\"contentUrl\":\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg\",\"width\":2560,\"height\":1706},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Accueil\",\"item\":\"https:\/\/www.riskinsight-wavestone.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally.\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#website\",\"url\":\"https:\/\/www.riskinsight-wavestone.com\/en\/\",\"name\":\"RiskInsight\",\"description\":\"The cybersecurity &amp; digital trust blog by Wavestone&#039;s consultants\",\"publisher\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.riskinsight-wavestone.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#organization\",\"name\":\"Wavestone\",\"url\":\"https:\/\/www.riskinsight-wavestone.com\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2021\/08\/Monogramme\u2013W\u2013NEGA-RGB-50x50-1.png\",\"contentUrl\":\"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2021\/08\/Monogramme\u2013W\u2013NEGA-RGB-50x50-1.png\",\"width\":50,\"height\":50,\"caption\":\"Wavestone\"},\"image\":{\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.riskinsight-wavestone.com\/en\/#\/schema\/person\/452b07e38a831a2a62bc945ae0972d8b\",\"name\":\"Jeanne PIGASSOU\",\"url\":\"https:\/\/www.riskinsight-wavestone.com\/en\/author\/jeanne-pigassou\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally. - RiskInsight","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/","og_locale":"en_US","og_type":"article","og_title":"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally. - RiskInsight","og_description":"Ever since the launch of ChatGPT in November 2022, many companies began developing and releasing their own Large Language Models (LLMs). \u00a0So much so that we are currently in a phase that many experts describe as an \u201cAI Race\u201d. Not...","og_url":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/","og_site_name":"RiskInsight","article_published_time":"2024-09-25T14:25:07+00:00","article_modified_time":"2024-09-25T14:25:09+00:00","og_image":[{"width":2560,"height":1706,"url":"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg","type":"image\/jpeg"}],"author":"Jeanne PIGASSOU, Craig D'Souza","twitter_misc":{"Written by":"Jeanne PIGASSOU, Craig D'Souza","Est. reading time":"18 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#article","isPartOf":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/"},"author":{"name":"Jeanne PIGASSOU","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#\/schema\/person\/452b07e38a831a2a62bc945ae0972d8b"},"headline":"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally.","datePublished":"2024-09-25T14:25:07+00:00","dateModified":"2024-09-25T14:25:09+00:00","mainEntityOfPage":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/"},"wordCount":3069,"commentCount":0,"publisher":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#organization"},"image":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#primaryimage"},"thumbnailUrl":"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg","keywords":["AI","artificial intelligence","Benchmarks","Chatbot arena","HellaSwag","HumanEval","Large Language Models","LLM","MMLU"],"articleSection":["Cloud &amp; Next-Gen IT Security","Deep-dive"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/","url":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/","name":"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally. - RiskInsight","isPartOf":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#primaryimage"},"image":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#primaryimage"},"thumbnailUrl":"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg","datePublished":"2024-09-25T14:25:07+00:00","dateModified":"2024-09-25T14:25:09+00:00","breadcrumb":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#primaryimage","url":"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg","contentUrl":"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2024\/09\/pexels-pixabay-373543-scaled.jpg","width":2560,"height":1706},{"@type":"BreadcrumbList","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/2024\/09\/which-llm-suits-you-optimizing-the-use-of-llm-benchmarks-internally\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Accueil","item":"https:\/\/www.riskinsight-wavestone.com\/en\/"},{"@type":"ListItem","position":2,"name":"Which LLM Suits You? Optimizing the use of LLM Benchmarks Internally."}]},{"@type":"WebSite","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#website","url":"https:\/\/www.riskinsight-wavestone.com\/en\/","name":"RiskInsight","description":"The cybersecurity &amp; digital trust blog by Wavestone&#039;s consultants","publisher":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.riskinsight-wavestone.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#organization","name":"Wavestone","url":"https:\/\/www.riskinsight-wavestone.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2021\/08\/Monogramme\u2013W\u2013NEGA-RGB-50x50-1.png","contentUrl":"https:\/\/www.riskinsight-wavestone.com\/wp-content\/uploads\/2021\/08\/Monogramme\u2013W\u2013NEGA-RGB-50x50-1.png","width":50,"height":50,"caption":"Wavestone"},"image":{"@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.riskinsight-wavestone.com\/en\/#\/schema\/person\/452b07e38a831a2a62bc945ae0972d8b","name":"Jeanne PIGASSOU","url":"https:\/\/www.riskinsight-wavestone.com\/en\/author\/jeanne-pigassou\/"}]}},"_links":{"self":[{"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/posts\/24022","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/users\/1516"}],"replies":[{"embeddable":true,"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/comments?post=24022"}],"version-history":[{"count":9,"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/posts\/24022\/revisions"}],"predecessor-version":[{"id":24049,"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/posts\/24022\/revisions\/24049"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/media\/24019"}],"wp:attachment":[{"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/media?parent=24022"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/categories?post=24022"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/tags?post=24022"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.riskinsight-wavestone.com\/en\/wp-json\/wp\/v2\/coauthors?post=24022"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}