Lawyer or language model? Testing AI’s competence in answering Australian legal

Australian Allens Law Reference for Generative AI

7 minutes of reading

The last 24 months have seen generative artificial intelligence (AI) tools are advancing in leaps and bounds, driven by notable developments in large language models (LLM). Its new capabilities are already having a significant impact on the way businesses operate, including the legal function. However, the exact effectiveness of generative AI as it relates to the law remains largely a realm of speculation and anecdote. Certainly, those of us who have tried each new model have noticed improvements (and sometimes setbacks), but how good are they, really, as lawyers?

Conceptually, the ability of AI tools to quickly identify patterns in large volumes of data and generate optimal sentences or phrases would seem like highly desirable skills in lawyers. However, the limitations of generative AI when faced with a message requesting advice They are also highly publicized: AI faces many challenges when it comes to replicating the judgment of a human lawyer, which plays a crucial role in legal practice. Language also operates differently in a legal context compared to many other linguistic contexts.

A repeatable benchmark is needed to systematically test, compare, and track advances in generative AI's ability to answer legal questions over time. In consultation with Linklaters LLP, Allens has developed the Allens AI Australian Law Benchmark (Allen AI Benchmark) to test the LLMs' ability to answer legal questions under Australian law. We tested general-purpose implementations of market-leading LLMs (at time of testing: February 2024), approximating how a lay user might attempt to answer legal questions using AI, rather than a human lawyer.

Key results

The models we tested should not be used for legal advice on Australian law without expert human supervision. There are real risks in using them if you don't already know the answer.
The best overall performer was GPT-4, followed by Perplexity. LLaMa 2, Claude 2 and Gemini-1 tested relatively similarly.
In 2024, even the best-performing LLMs we tested were not consistently reliable when asked to answer legal questions. While these LLMs could have a practical role in helping legal professionals summarize relatively well-understood areas of law, inconsistencies in performance mean that these results still need careful review by someone capable of verifying that they are accurate and correct.
For tasks involving critical reasoning, none of the tools we tested (which are publicly available chatbots implementing GPT-4, Gemini 1, Claude 2, Perplexity, and LLaMa 2) can be trusted to produce correct legal advice without human supervision. expert. The LLMs we tested frequently produced answers that got the law wrong and/or missed the point of the question, while expressing their answers with falsely inflated confidence. Therefore, there are real risks in using these tools to generate legal advice if you don't already know the answer.
Poor citation remains a major problem for many of the models. For example, some tools were demonstrated:
- inability to choose authoritative legal sources (cases, legislation, or authoritative texts) over unauthorized sources (such as a law firm publication);
- a tendency to fabricate (hallucinate) case names;
- a tendency to name a correct source but attribute a fictitious extract or choose an incorrect precise quote; either
- a tendency to cite only an entire law without specifying a reference to a section.
"Infection" by legal analysis from larger jurisdictions with different laws is a major problem for smaller jurisdictions like Australia. In particular, although asked to respond from the perspective of Australian law, many of the responses cited authorities from UK and EU law, or incorporated analyzes of UK and EU law that did not are correct for Australian law.
Legal teams within any company considering the use of generative AI technologies should ensure they have safeguards in place to govern how the output can be used. In the legal context, AI results need careful review by someone able to verify that they are accurate and correct, and that they do not contain irrelevant or fictitious quotes.
Even if (and when) LLMs reach or exceed parity with the benchmark, the role and importance of the human lawyer will comfortably endure. The ability to answer legal questions succinctly and correctly is only a fraction of what is required in the daily work of an Australian lawyer, whose role today is more akin to that of a strategic advisor.

Who in your organization needs to know about this?

Legal leaders and teams, IT staff, innovation and procurement teams.

The tests: methodology and review.

The Allens Benchmark is an extension of the LinksAI English Law Benchmark. The Allens Benchmark consists of 30 questions relevant to 10 different practice areas. The questions would normally require the advice of a competent mid-level attorney specializing in that area of practice. The intention was to test whether AI models can reasonably replicate certain tasks performed by a human lawyer.

While our question set has some questions in common with the LinksAI English Law Benchmark, others are designed to test issues unique to the Australian legal context.

We tested the question on five different models, namely GPT-4, Gemini 1, Claude 2, Perplexity, and LLaMa 2. We used general-purpose implementations of these LLMs, which are not specially trained or tuned to provide legal advice. Therefore, our methodology approximates how a non-professional user might attempt to perform tasks using AI instead of a human lawyer.

We asked each of the 30 questions to each AI three times, starting the session over each time. LLMs use probabilistic algorithms to assemble your written output. The repetition of each question controls the boundary conditions (as shown in cases where the answers of the same model differed significantly each time a question was asked).

Responses were scored by senior attorneys from each practice area. Each response received a score out of 10 and included:

5 points for substance (is the answer correct?)
3 for citations (is the answer supported by relevant laws, case law or regulations?)
2 for clarity.

The best overall performer was GPT-4, followed by Perplexity. LLaMa 2, Claude 2 and Gemini-1 tested relatively similarly.

LLMs performing at the GPT-4 level could play a practical role in helping legal professionals summarize relatively well-understood areas of law. GPT-4 seems capable, for example, of preparing a sensible first draft of law in some cases. However, inconsistencies in the performance of even the best-performing model mean that the draft still needs careful review by someone able to verify that it is accurate and correct, and that it does not contain irrelevant or fictitious quotes. Many of the models frequently cited unauthorized sources, hallucinated case names, and hallucinated quotes from real sources.

For tasks involving critical reasoning, even the best-performing LLMs performed poorly. This finding is consistent with the LinksAI report from October 2023. Furthermore, we found that (as predicted in the LinksAI report from October 2023), LLMs were further disadvantaged in the context of a smaller jurisdiction such as Australia. Responses frequently adopted analyzes from larger jurisdictions (especially the EU and the UK) without recognizing the difference in law.