In this series of posts, our Head of AI, Paul Golding, explores challenge of LLMs in the enterprise. He extrapolates lessons from the detailed review paper Challenges and Applications of LLMs.
Summary
The rapid emergence of Large Language Models (LLMs) in the machine learning landscape has been remarkable, transforming from obscurity to widespread prominence in just a short span of time. The ever-evolving nature of the field makes it challenging to discern the persistent hurdles and the areas where LLMs need careful management.
Based upon the insights in the paper Challenges and Applications of LLMs, we translate the identified unresolved challenges into insights for the enterprise.
There remain many pitfalls for the unsuspecting user, further cementing the contention that, contrary to initial impressions, LLM deployments can take a lot of work to get right.
It remains our view that without a strategic commitment, many enterprises will fail to leverage the full transformative potential of LLMs. The paper illustrates why this might be so.
Introduction
To the uninitiated, the power of LLMs is highly deceptive. The power and confidence with which they deliver convincing results is sufficient for many to conclude that such power is generalizable to all use cases, or all samples. This is often far from the truth. Nonetheless, many enterprises are pressing ahead with LLM experiments and deployments, increasingly under pressure to get quick wins. This gives even less time to become aware of the many pitfalls waiting for the unsuspecting user.
Data scientists know well the pitfalls of naive reliance upon a model that appears to work well at first glance. Good data scientists train themselves to be skeptical of such results and to use strategies to probe the real efficacy of a model.
Moreover, great data scientists train themselves to be wary of the online-offline gap. This is where models appear to obtain fantastic results during development (offline), yet fail to deliver in production (online). The reasons for this are many, not always technical, but it takes experience and discipline to manage.
Pitfalls can flummox even seasoned data scientists. Witness how a recent paper revealed that a popular technique (SMOTE, cited 25000 times) that promises to “fix” certain dataset problems only gives the illusion of doing so. Indeed, it can make matters worse.
LLMs probably require way more guardrails than many enterprises currently use for substantially less complex models. Hence it pays to treat LLMs as beasts to be tamed rather than unleashed.
The paper includes a pictorial overview of the challenges:
In this post, we focus only on the issue of Unfathomable Datasets, borrowing from the language and lessons of the paper, but interpreting for an enterprise audience.
Unfathomable Datasets
Dataset Knowability
The gargantuan size of modern pre-training datasets makes it impractical for individuals to thoroughly assess the quality and content of each document. For an enterprise, this means that caution is needed.
LLMs can generate unknowable responses. We do not mean just from a stochastic point of view, but rather due to the unknowability of the dataset’s contents, its quality and vagaries. This presents potential risks that enterprises must deal with. Do not be fooled by the relative sanity of a few examples seen during an offline proof-of-concept. But this in no way reflects what the model might generate online in a production context.
Let’s examine some of the issues we might fix.
Data Duplications and Purity
Similar content that is not exactly duplicated, but nearly, can negatively impact model performance, significantly so under certain circumstances. Techniques like the MinHash algorithm help to filter out near-duplicates. This technique efficiently compares document similarity (to some degree) without needing to scan the entire document, noting that the task would otherwise scale significantly (quadratically) with the number of documents to be compared.
For the enterprise, such techniques will prove vitally important in data preparation (e.g. for fine-tuning). There are a range of subtleties to this problem, including the prevention of model bias. For example, poorly configured similarity heuristics might prune datasets in a naive way that unwittingly increases bias by removing subtle, yet key, differences that are highly useful for a particular use case.
Enterprises need to ensure that the data used is representative, diverse, and free from any unintended biases. Prepare your data carefully! Dumping lots of text into a model and letting it sort things out is a naive and poor approach.
Test Contamination
Training data can overlap with evaluation test sets, leading to inflated performance metrics. With poor data management, detecting and eliminating these overlaps is difficult.
The practical constraints of managing large datasets and the complex relationships between training and test data can complicate data processes. More rigor is required than existing enterprise processes can handle.
Whilst it is tempting in an enterprise to dump lots of unstructured data into the training pool and let the LLM sort it out, this is too naive. Data auditing is necessary. However, many enterprises lack sufficient rigor in this regard. Such lax procedures will cost an enterprise dearly when it becomes necessary to adopt LLMs with alacrity.
Data Infiltration and Exfiltration
Personally Identifiable Information (PII) such as phone numbers and email addresses, has been found in pre-training data, raising privacy concerns and potential security risks during model prompting. The same applies to any confidential enterprise data.
Of course, it goes without saying that confidentiality infiltration or exfiltration is potentially a massive risk with serious legal ramifications. There are many ways this could manifest. For example, consider the presence of confidential sales data within a dataset used to fine-tune a model subsequently made available for customer support. Confidential data could leak in any number of ways, many of them exacerbated by the ability of the LLM to both hallucinate and generate novel texts via diffuse synthesis.
Confidential data might be hard to detect in and of itself because it is spread throughout the dataset in diffuse nature. Its risk in terms of confidentiality breach might only become apparent when connected via the generative capabilities of the model over large spans of text — i.e. when the model joins the dots that make the leak more evident and damaging.
There is also a risk that enterprises might accidentally consume confidential data leaked into the pre-trained (baseline) model. In some cases, this could be a bigger threat than leaking data outwards. Enterprises need to pay close attention to the dataset claims and terms of use of any model before jumping head-first into LLM adoption and rollouts.
Naive Detection Can Easily Fail
Classification of texts (i.e. to detect confidential data) is exacerbated by the diffuse nature of corporate facts throughout the corpuses. Advanced classification techniques, like weak supervision (with labeling functions) is probably a good idea. It will beef up any detection of confidential text that doesn’t easily fall into a narrow set of detection rules. Plus, its use of labeling functions will make the task achievable without employing an army of legal checkers.
The mere presence of a customer’s name in a dataset might compromise confidentiality agreements. This could cause irreversible damage to revenue and reputation. The removal of such data could boil down to the effectiveness of Named Entity Recognition (NER). However, naive NER could overlook certain client names for all kinds of reasons. Even the use of grounding data (like knowledge graphs) could fail here if the client in question has yet to be signed up and is therefore absent from any official “fact-checking” records (e.g. CRMS).
Dataset Heterogeneity Can Be Bad
The practice of combining datasets from various sources can lead to issues if not well-considered. The optimal mixture of data sources for strong downstream performance is still an under-explored area in most enterprises.
Combining datasets is an obvious step: sales data, competitive intelligence, procurement and so on. It is easy to assume that this will lead to greater generalizability. We see the temptation to create a MyOrgGPT all-in-one model that encompasses all enterprise knowledge in an oracular-like fashion.
However, whilst these aggregations might seem useful, given the naive view that the data is all related (somehow), the efficacy is unknown in terms of translation into specific downstream tasks.
For example, the inclusion of competitive and market intelligence data might seem like a good idea when training a MyOrgSalesGPT model. However, perhaps CI data comes from a variety of sources that are in some way incoherent with the organization’s actual sales strategies.
The inclusion of such data might negatively affect the use of such a model for, say, campaign identification. Anecdotally, we could imagine a strong trend in market data causing excessive influence on the data even though it might not be so relevant to the current corporate sales strategy. In other words, the semantic assumptions of the model, due to a dataset bias, do not translate into relevant recommendations.
This bias could dilute sales campaign strategy through misalignment with corporate sales goals. However, the misalignment might hide itself. The model could still give a plausible impression, perhaps hallucinatory, that its outputs are still aligned with goals. As many a behavioral economist has told us: discerning plausibility from truth is hard. LLMs are brilliant at generating plausible outputs.
Fine-Tuning for Specific Tasks
Here we focus only on some of the dataset issues of fine-tuning, leaving the actual mechanics of fine-tuning practicalities and unanswered challenges to a later post. For an introduction to fine-tuning, please see our previous post on Fine Tuning.
For enterprise users looking to fine-tune pre-trained language models (LMs) for various tasks, the following data considerations need consideration.
Task Mixtures
It often pays to explore how far a particular fine-tuning exercise can be pushed in terms of training on a diverse set of tasks. This approach is called multitask-prompted fine-tuned LLMs (MTLMs). It can enhance generalization with minimal training effort. However, the issue of balancing of tasks is similar to the one of balancing domain mixtures.
Challenges such as negative task transfer (where learning multiple tasks impedes learning specific tasks) and catastrophic forgetting (forgetting previous tasks when learning new ones) have been observed.
Jang et al. report that MTLMs can underperform expert LLMs fine-tuned on only a single task because of negative task transfer, where learning multiple tasks hinders some specific tasks. It might also be necessary to watch out for catastrophic forgetting where learning new tasks erases previous ones. This could negatively easily impact enterprise use cases.
LLMs are deep-learning networks (DLNs). As with all DLNs, experimenting is almost always necessary because their results can be highly unpredictable. Enterprises should experiment with different task-set proportions in order to understand trade-offs for optimal performance against priority business goals.
Task Prioritization
Consider again the use case of corporate sales and defensive strategies against a particularly aggressive competitor. When fine-tuning a model to classify CRM entries as various types of sales opportunities, there comes a point where the attempt to generalize might best be jettisoned in favor of specialization for this sensitive use case.
Imagine that a defensive campaign is worth $1B in potentially lost revenue due to a the rapidly emerging threat. There will be some trade-off where the additional effort and cost in fine-tuning a specialized model on this one task is worth it.
Of course, this means the enterprise LLM teams need to be agile. They also need sufficient awareness of critical business need. Any MLOps put into place to detect task drift will likely fail to detect the opportunity. Resource alignment is also important here. A team motivated to conserve AI-compute costs could cost the business millions, or worse, in lost revenue!
Model Imitation
A common hack is to use an open source LLM that is fine-tuned by imitating proprietary models (e.g. GPT-4) using API outputs from the model. Initially, this scheme appears to work by benefitting from the human alignment of the proprietary model. However, enterprises should note that such “imitation models” can mimic style but not content accurately. This discrepancy has been discussed in the causality literature. The capability gap between fine-tuned open-source and closed-source models remains, motivating further work for better imitation approaches. For now, enterprises should be aware not to be fooled by superficial results when eyeballing the outputs of an imitation model.
Conclusion
A key message is that despite the common (and accurate) notion that LLMs are effective due to the vast quantities of data, it is not the case that quantity trumps quality. The quantity of data is important, but so is data quality, sometimes more so. Incorporating low-quality, biased, duplicate or irrelevant data can negatively impact model performance and the integrity of downstream applications.
Moreover, LLMs are highly complex models where the actual business performance of a task is hard to measure. This might tempt some LLM teams to focus on techniques that appear to generalize and lean into the better-performing use cases. This also gives the impression of economic and resource benefits. However, this approach might prove harmful in key enterprise use cases that really move the business needle. Thus the efficacy of valuable downstream tasks should always be a consideration.
It is our view that the main challenge for enterprises is still a mindset one, pushing back on the over-inflated optimism that these models conjure. With the democratization of LLMs, it is easy to get some results quickly. However, don’t be fooled by quick wins into thinking that the LLM takes care of everything. As the paper explains, and as we have briefly summarized and extrapolated to the enterprise, key results can be highly sensitive to dataset vagaries.