AI, process design and benchmarking
The Vals AI legal report was released last week. It is described as a “first-of-its-kind study [evaluating] how four legal AI tools perform across seven legal tasks, benchmarking their results against those produced by a lawyer control group”.
It is impressive and ambitious, and pits AI against humans in relation to data extraction, document Q&A, document summarization, redlining, transcript analysis, chronology generation and EDGAR research tasks.
I encourage anybody interested in legal tech and AI to give it a read. The Vals report will be useful to anybody looking to find out more about the capabilities of legal AI tooling. I would like to share a few thoughts of my own about how it can be best used in practice.
AI is a building block within a process
The first thing that’s important to note about any exercise like this is that AI is only ever a building block within a process. Its value manifests itself in how a given outcome can be delivered more effectively in a process redesigned and enabled by AI.
I have made this point a number of times before. It’s an important one, because it would be easy to read a report such as Vals AI, and draw one of a few hasty conclusions, e.g:
- “Lawyers are going to be replaced by AI”
- “AI won’t replace you but a lawyer using AI will replace you”
- “AI is better at redlining than humans”
- “AI is the death of the billable hour”
These kinds of conclusions are things you see a lot on social media. Sometimes they may be true, but often they are underpinned by oversimplification and misconception.
I’m going to choose chronology generation as the example for this article.
(I could also have chosen transcript analysis, redlining or EDGAR research, because these are examples of multi-faceted legal processes. Things like summarization, document Q&A and data extraction tend to be building blocks within overarching processes. Nonetheless the principles are exactly the same.)
I am not an expert in chronology generation, but I have worked on numerous projects where I have personally made one, or have supervised one being made. Here’s what the broad process looks like based on my experience:
How might this process look if we start involving AI in it? The most common conception of the AI-enabled chronology usually puts AI in the driving seat. To cover the risk that AI hallucinates, is not 100% accurate etc., a step is often included at the end for a human to review the chronology.
In reality, this is probably not the simplest the process could be. How can a human review the chronology unless they themselves are familiar with the facts? Do they need to read the documents themselves? If so, does the process look more like this?
The real answer is that it probably depends. If I already know the documents like the back of my hand, we can probably settle on Model #2. If I do not know the documents, then we will have to go with Model #3 because I do not have the knowledge to carry out the final step.
But then this calls into question the fundamental goal of the chronology exercise. When I made chronologies, the thing that was helpful was not always the words on a piece of paper, it was the knowledge it had built up in my head. I could jump on client calls and quickly comment, “that’s not going to work because of [x] document” when we were doing structuring analysis.
I only acquired that knowledge by spending more time with the documents. If you’re anything like me, actively engaging with documents builds your knowledge far more than passively engaging with them.
When I’m reading long cases (nowadays, it’s more arXiv articles people send me), I always do so with a pen and paper in my hand so I can do something like draw diagrams as I am reading along. (I wonder, by the way, whether this is why lawyers like having things printed).
If the fundamental goal is to acquire knowledge, aren’t we better with a new Model #4 that puts humans in the driving seat with AI checking things?
Or, perhaps you can think of different variations, e.g. where AI helps summarise a document before a human reads it, or where AI highlights parts to draw your attention to?
You see how it can become very complex? You can also see, hopefully how it depends on the context in which I am operating.
A piece of complex litigation where I need all the facts in my head (and probably know well by the rest of the team as well) might steer you towards Model #4 or Model #4A.
A simple piece of litigation where everyone knows everything already, and the goal is just to produce a reference artefact might push you towards Model #2 or Model #3.
The value of initiatives such as Vals AI are that they help us know how good a given tool is likely to be within the context of a process. But you can’t just assume everything is going to be simple like Model #2, and that you are swapping humans for AI. You have to think about the process — and the process will change depending on what you are doing.
Not all processes require the same level of accuracy
The reason all of this matters is that different processes require different levels of accuracy, depending on what you want out of them.
Taking Model #2 as an example, if you know the facts already and you simply need them recorded on a piece of paper, the thing you are trying to automate is, at its basic level, a typing exercise. If you can spot errors from a mile off, you don’t necessarily need a high level of accuracy for AI to add value to the process.
The new AI-powered process is still useful, but it becomes more useful the more accurate the AI part is. Although — as I note in the graph below, very low levels of accuracy might make it quicker just to have done it yourself (because the time to check the output exceeds the time it would have taken to have typed it yourself accurately).
Again, the context matters. If you know the facts, but you need to present the output to court, accuracy might matter a little more. For example, if the chronology directly quotes correspondence, you cannot run any risk of anybody being misquoted.
This is the case even with minor errors, such as minor word variations: courts will be quick to look at these and form a bad impression of the document’s reliability as a whole. As well as you know the documents, I bet you can’t spot minor variations in quotes, inconsequential as they may be to the substance of what is being said.
For this kind of use case, accuracy matters — although it’s perhaps more evident if you are using Model #2 in a context where you don’t know the documents at all, and you cannot check the documents effectively.
You can’t use Model #2 in contexts like this unless the AI in question has an extremely high level of accuracy. 70%, 75%, 80%, 85% — these are not likely to be good enough. You need higher for them to deliver any value in this particular process context. You only start getting there when you are well into the 90s.
Push me hard enough and I might even make the argument that there is a dip in the curve as work transitions from: (1) obviously inaccurate, to (2) deceptively inaccurate, because it becomes harder to spot errors. This, in fact, is my main objection to the often-said caveat that “humans should always check AI’s output” — because it is hard to spot errors in deceptively inaccurate content.
Moving away from Model #2, the curve might look different for a tool that just needs to be “pretty much accurate” because you might end up with diminishing returns after a certain point.
For example, in Model #4A, you might be using AI as a “better search” within your document to find things. Value can be added here if the AI can help humans find things they would have otherwise ignored.
But it can hardly be said that value is reduced if an irrelevant part of the document is flagged that the human can simply ignore (as long as this does not adversely affect the user experience). The use case simply does not call for high levels of accuracy.
Indeed, being too accurate might stop people from considering issues they hadn’t thought to enter into their search query. So, depending on the use case, value might either plateau or even go down after a point.
Finally, the ability of a tool to help humans verify the accuracy of its output – through its UX – plays a huge part.
Don’t just think about accuracy in a silo. Think about it in the context of the process in which it is being deployed, and how proportionate accuracy is to value. If accuracy matters a great deal because of how AI has been inserted into the process, you need to reach a certain level of accuracy before the process becomes viable. At the opposite end of the spectrum, the more accurate, the better.
Benchmarking processes as well as building blocks
Whether the process be chronology generation, redlining documents etc. the goal of the process might be multifaceted. I’ve already touched upon one example, which is that sometimes the goal of legal processes is not just to produce an artefact, but to instil knowledge in the hands of the person doing it.
In a similar vein, other factors are always in play here alongside knowledge and quality:
- Speed: is time of the essence, and is it better to have something rather than nothing?
- Importance: how “good” does the work product actually need to be to serve the fundamental use case?
- Cost: what can the person paying for the work product afford to pay for it, given its importance?
- Transparency: how important is it that every single conclusion can be justified by a human who can reverse-engineer the thought process?
- Consistency: how important is it that an exercise is done in exactly the same way each time?
- Insights: is a key part of the process to store data that can be reused in the future, that a human would never have time to capture or record?
On following a Model #4A-esque process in writing this article, ChatGPT tells me I should also mention ethical concerns, implicit biases and confidentiality concerns. Indeed, these are important considerations.
Weighing up each of these factors is a delicate and complex balancing exercise. The line will be drawn differently depending on who you are.
I used to work in an environment where cost was rarely an issue, but speed, knowledge and quality were of utmost importance. This might drive you more towards Model #4A than Model #2.
But everyone is different. If you work in an environment where speed is important, knowledge is a secondary concern, and there is a genuine trade-off between quality of the work product and the cost to deliver it, you might be more interested in Model #2 or even Model #1.
In order to really understand the value of AI as a building block within legal processes, we have to step back and understand what we are looking to achieve from that process. Then, we have to find out the best way to weave AI into it. Then, we have to benchmark that AI-enabled process against how we would do it without AI. Only then can we really understand the value of AI in the legal world.
AI alone will not deliver any value — it only delivers value within the context of a process. The most fundamental thing we need to do is benchmark AI-enabled processes against each other. Benchmarks such as Vals are incredibly useful, but we shouldn’t treat them as the be-all and end-all. There is still more room for discussion around how we can apply AI in redesigned processes, rather than talking as if a single AI component replaces an entire workflow.