Generative AI in the legal industry: is accuracy everything?
We are 2 years on from the initial launch of ChatGPT, and the legal industry is still grappling with what generative AI means for the legal profession.
Unsurprisingly, many have been concentrating on the applications of generative AI that will truly change the game for law firms and lawyers. Many of these applications put generative AI in the role of the junior lawyer, producing first drafts of legal work or carrying out activities such as legal research that would otherwise take a real lawyer hours to undertake.
These kinds of applications face the same issue: the output of generative AI is not always accurate and nearly always requires human oversight. The accuracy of LLMs has improved over the last couple of years, but for the most part, the legal profession is one where many lawyers strive for total accuracy 100% of the time. If accuracy scores are less than that, there will always be a need to verify the output of LLMs.
Many use cases for LLMs depend on high accuracy levels — especially if you hope to minimise the need for human oversight. But these are usually the obvious ones: “draft me a research memo”, “review this contract”, “draft me a contract”, “summarise this case law” etc. There are many less obvious use cases where accuracy helps, but it is not the be-all-and-end-all. This might be because the output is extremely easy to verify (e.g. extracting wording), or it might be because the nature of the activity does not demand high levels of it (e.g. search for precedents).
Too often, I find the discussion of accuracy one-dimensional. Much discussion focuses on how we can reach higher levels of accuracy without discussing the circumstances in which this is required. I think a few things warrant more discussion:
- The use cases where accuracy matters and where it does not
- The ability of humans to effectively verify LLM-produced output
- The consequences of using AI — even if accurate — in some situations
But first, let’s introduce what the accuracy problem is and why it exists…
The accuracy problem
Hallucinations
We’ve known about “hallucinations” for some time in the context of generative AI. A hallucination is where generative AI produces an output that is either incorrect, fabricated or based on citations concocted out of thin air.
We didn’t use the term “hallucination” before the age of generative AI because, in machine learning classification algorithms, these things tended to be reflected in accuracy metrics: if the algorithm wrongly classified something, this was an error. This was a world where it was fairly easy to distinguish between “correct” and “incorrect”.
LLMs are designed to do one thing: predict the next word (more accurately, next “token” prediction — don’t ask…). Based on their ginormous datasets, they do this based on their understanding of the semantic relationships between different words. They are also smart enough to take into account the context in which a word is used. In a strict sense (leaving “temperature” to one side), their only success metric is whether they accurately predict the next token in the sentence: the way they work is explicitly not related to the accuracy of the output they produce.
Realistically, very few people’s use case for LLMs is “next token prediction”. Usually, they are interested in the finished output this mechanism produces. It is this output that will be judged, not whether the LLM successfully predicted the most likely next word in a sentence.
The term “hallucination” appears to have been created to account for the fact that LLMs are often judged against something they are not designed to do (i.e. factual accuracy rather than whether or not the “correct” next token was predicted). Out of the box, an LLM is not designed to be accurate: it’s just that there is often a correlation between the product of a series of successful “next token predictions” and accuracy. But, crucially, this is not always the case.
As with most things around generative AI, nobody really knows — but my hypothesis is that for as long as the generative AI tooling operates based on “next token prediction”, we cannot avoid hallucinations. That’s the case even if we tell an LLM, “Don’t hallucinate”: all this does is affect the calculation of the most likely next word — which is done based on the conceptual similarity between words, not verified facts.
New or larger datasets
What makes things more complex for LLMs is that with prior AI machine learning algorithms, we could tune them by introducing new datasets that might fix the inaccuracies. This is less easy with LLMs, which are “large”, “general purpose”, and specifically not designed to be right or wrong.
Some studies have shown that larger LLMs tend to hallucinate less. Some have suggested that LLMs trained on domain-specific content hallucinate less. Other LLMs can “reason” (in my view, a poor choice of word, but I’m not getting into that debate), which describes a process whereby the initial output of the LLM is fed back into the LLM one or more times to influence a subsequent output. This, in my own experimentation, also helps to reduce the amount of hallucinations, but it does not eliminate them.
My conclusion is that instead of talking about eliminating hallucinations, we have to accept that they will occur and that our strategy has to be around how we can mitigate them.
Mitigating hallucinations
Technical methods
So, we need to distinguish between (1) LLMs avoiding hallucinations and (2) mitigating the effects of hallucinations. If text generation is the success criteria of LLMs rather than verified facts, it seems that (1) is challenging because it is not a metric baked into the operational methodology of LLMs. Nonetheless, there are things we can do to mitigate the effect of hallucinations. Here are some examples:
- Retrieval-augmented generation. “Retrieval-augmented generation” connects LLMs to trusted datasets. In simple terms, this kind of system takes a query/prompt and uses a trusted dataset — along with the general linguistic capability of the LLM — to produce the output. In my own experimentation, this reduces hallucinations and also provides provenance to the output. But it does not completely eliminate hallucinations because (1) we cannot distinguish easily between what came from the LLM’s foundational training data and what came from the trusted dataset, (2) there is a possibility that the wrong parts of the trusted dataset are used or things being taken out of context, and (3) it relies on the “trusted dataset” being completely accurate.
- Verification by LLM. See “reasoning” above.
- Verification by lookup. This is where the output of LLMs can be verified against a specifically-identified trusted source using a database lookup. For example, if the output of an LLM referenced a case citation and you have access to a database of case citations, you could perform a query to see whether that case citation actually exists.
Perhaps the most obvious mitigation is to insist that humans always check the output of LLMs…
Human verification
There have been a number of instances in the legal industry where a lawyer has used AI to produce their work, only to be then embarrassed by hallucinations contained within it. The reaction to these kinds of situations has been one of the following:
- “This is why AI is so dangerous and should be banned”
- “This is just like a senior lawyer working with a junior lawyer. They can’t abdicate responsibility. The same applies to AI: they need to check it first”
I wholeheartedly endorse the second view, with one caveat: I do not think it is quite right to equate work produced by a junior lawyer with work produced by AI. I say this for a few reasons:
- The smell test. It’s often easy to spot work that needs to be scrutinised carefully. It might be produced late, in the middle of the night, riddled with typos, named clumsily, the name of the firm spelled wrong on the first page, etc. On the other hand, well-presented work free of typos often instils confidence from the start. While humans do sometimes produce bad but well-presented work. LLMs always do. That makes it harder to spot the errors.
- Thought process. You can easily reverse-engineer human work and form a view on how closely you need to look at it based on how it was created: “Did you speak to [x]?”, “Did you check [y]?”, “How long did it take you?”. Again, all these things are hallmarks of quality that help you assess how much scrutiny you pay to it. LLM “reasoning” capabilities help a little here, but it’s hard to deconstruct the thought process completely.
I’m open to the argument that I’m thinking of old ways of doing things and that LLMs have the potential to rip up and reinvent how legal work is created. But even if I’m wrong on the distinction between lawyer-produced and LLM-produced work, I still don’t think the fallback on “humans should check everything” is as effective as many people think it is. To explain why, I distinguish between “easy” and “hard” verification exercises.
Easy verification exercises
The ease with which a human can check the output of an LLM varies based on what is being done. For example, a common use case for an LLM might be to extract a particular data point from a contract.
A well-designed legal tech product that leverages AI will extract the data point and make it as easy as possible for the human to verify its output. It might do this by displaying the relevant extract from the contract and taking the user to the very point in the actual document where this extract is present. This is a simple cross-referencing exercise, driven both by the nature of the exercise and the user interface. The human can conduct it quickly and reliably: it’s an easy verification exercise.
Hard verification exercises
In other cases, verification of LLM-produced output is more challenging. This might be because the question of accuracy is not binary — for example, with legal research questions. These kinds of questions are often not a simple case of looking at text and seeing whether it matches what the LLM said, as is the case with easy verification exercises.
Here, various sources might need to be read, interpreted and compared against each other — often based on a solid grasp of legal principles and complicated jargon. In other words, checking them requires effort. These are hard verification exercises.
Hard verification exercises present three problems:
- Laziness: humans are often inclined to be lazy in exercises that require more cognitive load. As a result, it may be too tempting to trust what the LLM has said without really questioning it. You can almost imagine a lawyer looking at a daunting task and over-relying on the LLM’s output because they just want to go home and get some sleep
- Lack of rigour: the output of the LLM may influence the basis on which the human is doing their verification, which may not necessarily be the correct one (i.e. the LLM has sent the human down the wrong rabbit hole). The phenomenon of “what you see is all there is” is relevant here, meaning humans are likely to confine their thought process to the text in front of them rather than considering points not raised within it
- Benefits offset: hard verification exercises take longer. If LLMs aim to increase speed, their benefit is offset by every minute spent checking the response. We might also see a reduction in quality based on the prior two points
Degrees of difficulty
The effectiveness of human verification depends on what the human has to verify. For easy exercises, human verification is something you can generally rely on, but you might offset the benefits of using AI with harder tasks — both in terms of speed and quality. The usefulness of human verification depends on the specific task to which AI is applied.
For example, it is easy for humans to verify an AI-produced summary of a discrete part of a document, particularly if the user can see the relevant extracts of the document side-by-side.
What’s harder to verify is a more detailed analysis of one or more long documents that is not purely extractive but also carries out a bunch of processes that would often include a number of human judgment calls. It’s not necessarily the case that longer output correlates with higher difficulty to verify — it’s more the judgement calls required and the extent to which you can verify these, having been influenced by the AI’s output. Many lawyers struggle to work out how they can be responsible for these kinds of exercises merely by checking the LLM output, without directly carrying out the work themselves.
Does accuracy matter for all use cases?
Often overlooked are the applications of AI where human verification is unimportant because accuracy doesn’t matter that much. It really does depend on the use case and, in my view, the level of abstraction you are considering in your use case.
Use cases at a high level of abstraction
When people first saw LLMs in action, they saw immediate potential in the legal industry. The use cases people came up with were all the obvious ones based on a high-level understanding of what lawyers do: “Maybe it can draft contracts”, “Maybe it can draft research notes”, etc.
These kinds of use cases all fit within the classic conception of what lawyers do: they draft documents, they advise on the law, they read long and dusty books, they appear in court, etc.
But if you have been a lawyer (or you have spent a lot of time working with one), you will know that a bunch of manual processes and tasks fall under all of these things. For example, “appearing in court” involves a vast amount of process work to get your evidence and files in shape, as well as the effort you put in to keep your inbox under control and ensure no tasks slip through the net.
Bringing in the lower levels of abstraction
If you don’t understand the layer that fits under the classic conception of legal work, your use case for generative AI will not get to a low level of abstraction. Your use case will be “drafting contracts” instead of “finding a starting point to draft a contract” because you are tackling things at the process level rather than analysing tasks that fit within a process: in the real world, nobody is just “drafting contracts”, they are carrying out a series of micro-tasks that help them accomplish that overall process.
If you only have a superficial understanding of what lawyers do, the thought process, more often than not, appears to be “generative AI produces a lot of words, lawyers also produce a lot of words; therefore, generative AI can replace lawyers”. Whereas those who understand what lawyers do on a day-to-day basis can deconstruct the obvious use cases into specific tasks and point generative AI at those tasks.
I’m not saying one approach is better than the other. I’m saying you need both. You can’t expect much to change if you are tackling tasks without looking to redesign the overarching process. But equally, you can’t expect to change a process without deconstructing it into tasks. (In practice, I see the latter as a greater risk because of the temptation to “get behind a keyboard and start writing code” instead of spending time understanding how users might interact with the application you are about to write.)
How does this relate to accuracy?
As discussed above, the degree to which human verification can be relied upon to mitigate the risk of hallucinations from generative AI depends on the purpose to which generative AI is being applied. In some cases, it will be easy for a human to verify the output. In other cases, it is more challenging.
When you define use cases at a high level of abstraction, you tend to talk about things like “drafting a contract” — a process encapsulating all the tasks below it. If your use case for AI is “drafting a contract”, people might think AI is doing all of the processes points within it, e.g. finding a starting point, changing the names and numbers, varying the language to meet the commercial terms, checking clauses for consistency, proofreading etc. Or, of course, they might not know what you mean when you say “drafting a contract”. Neither of these is a good outcome. (I go into more detail about contract drafting in this article).
Some tasks, such as replacing names and numbers, are mechanical and easy to check. Others might be more complex, such as ensuring the contract as a whole meets the terms you agreed. To the extent these complex tasks are included within the use case, the ease with which a human review can mitigate the risk of hallucinations is reduced.
In contrast, defining use cases at a lower level of abstraction allows you to pick and choose the parts that generative AI can help with. For example, you might decide that within the contract drafting process, you are only going to focus on using LLMs to power a semantic search or Q&A interface to help people find suitable starting points more easily (i.e. instead of having to know what keywords to type in a search bar, your search is more “fuzzy” and tailored to your needs).
Focusing on use cases at a low level of abstraction helps you weigh up not only the ease with which human review can mitigate the effect of hallucinations but also helps you decide whether accuracy is actually important at all. For example, accuracy is dead important when drafting legal advice notes, but does it really matter all that much when the role it plays is presenting lawyers with a more relevant series of starting points they can choose when drafting a contract?
Accuracy and generative AI is a very hot topic right now, but underpinning this discussion is often the (in my view) erroneous assumption that accuracy is essential for every use case. Accuracy only matters if the use case you choose demands it. Complex tasks often demand a high degree of accuracy, and high-level use cases such as “contract drafting” often encapsulate complex tasks. Breaking down processes and casting use cases at a lower level of abstraction enables you to have a more sensible discussion about how much accuracy actually matters and how easy it is to mitigate the risk of hallucinations.
A lack of ambition?
In response to this, you can argue that only the complex applications of AI will shift the dial in the legal profession. This might be a valid argument, but in my experience, there is often no correlation between the complexity of the technology and the value it delivers to lawyers.
I have been burned on several projects by prioritising my curiosity in technology over what people care about. The result satisfies my own curiosity but does not satisfy my users. Everyone working in the technology industry should be conscious of this bias. In my view, you can only resolve it if you explore applications of AI irrespective of their complexity.
What are the consequences of using AI?
AI is such an interesting technology because it makes us question how we currently work at a fundamental elevel. We can use AI to improve processes and deliver things more quickly than ever. But we can also use AI to do something quicker while simultaneously missing the point entirely.
Before talking about improvements in accuracy, we should think about the fundamental task we want to achieve and ask which camp it falls into: (1) positive process improvement or (2) quicker process that misses the point.
The latter might arise in two circumstances: (1) where the process matters more than the product, and (2) where using AI will remove a valuable human cognitive capability.
Where the process matters more than the product
Let’s take a legal opinion in a banking transaction. The purpose of a legal opinion is to comfort a bank that the borrower in question is of sufficient legal standing and can enter into the transaction. Banks ask for it from a risk perspective, and its value largely lies in the words on the paper.
Contrast this with a complex legal advice memo delivered during a piece of litigation. It serves as a discussion piece for clients to clarify their understanding, “ask stupid questions,” etc. The words on paper carry value, but they also evidence the existence of a knowledgeable lawyer who can think outside the box and facilitate a helpful discussion.
Consider what would happen if both of the legal opinion and legal advice memo could be automated with complete accuracy. The legal opinion is likely okay because its value lies in the risk it resolves. However, the point of an automated legal advice memo is questionable if a discussion cannot subsequently take place with a lawyer who knows the relevant facts and case law in-depth.
Before assuming we need to involve AI in a specific workflow and that AI needs to be accurate, we should consider whether we are automating away a process that itself has intrinsic value.
Where using AI will remove a valuable human cognitive capability
I’m told that when calculators were first introduced, people were worried that we would lose the ability to add up. Similar concerns arose around using sat navs — what if people got lost and didn’t have a sense of direction?
Similar considerations exist for AI. The fact is that AI will cause us to stop doing things we used to do. The real question is whether that matters or not. Consider these two scenarios:
- AI helps me find a specific clause in a document. As a result, I am no longer very good at scrolling through documents and using CTRL+F
- AI helps me summarise case law. As a result, I no longer need to read cases in full anymore
While lawyers losing their scrolling skills hardly seems to be a cause for concern, lawyers no longer reading caselaw may compromise their ability to spot nuanced issues, read between the lines, or gain important context.
Again, before we start automating things and asking for higher levels of accuracy, we should consider whether doing this might adversely impact the skills lawyers need to have and their ability to learn new ones.
People tend to be polarised on this question, either saying “AI is bad as it will stop lawyers learning” or “AI is good and inevitable, and people need to learn to use it”. The better position is to be nuanced. Consider exactly what it is you are trying to do, and carefully consider its impact. If you conclude you should do it, the accuracy of an LLM might become relevant ; otherwise, it might not be.
Here are my actionable takeaways:
- Don’t always assume accuracy is the be-all-and-end-all
- When evaluating tools, consider the ease with which humans can verify the output
- Try (also) focusing on the use cases where accuracy doesn’t matter
- Think carefully about the consequences of using AI to automate processes before getting hung up on accuracy
Sometimes, I wonder whether much of the discussion in this area is driven by the desire to do the most interesting and impressive thing possible with generative AI rather than stepping back and considering a more measured approach that prioritises the value we can deliver today. As I’ve shown, focusing on the most impressive things is hard.
But there’s another risk here. Lawyers are renowned for using a piece of technology, and if it doesn’t work for them, never using it again. I hear more and more stories from lawyers about this exact thing happening with generative AI, and I worry that this might have an impact on the uptake of generative AI capabilities more generally. I wonder whether this is because we are starting with the most complex applications of AI instead of learning through the more simple but valuable ones.
As far as accuracy is concerned, the quest for higher rates of accuracy is often inextricably linked with trying to resolve the most interesting and complex problems with AI. In my view, when we are talking about accuracy, we should also be talking about why it matters, how easy verification of the output is, and whether there are any (more) valuable applications of AI where accuracy doesn’t matter quite as much.