What might LLMs/generative AI mean for public benefits and the safety net/tech?

There is so much excitement about things like GPT-4, and the Executive Summary Epistemology™ of broad, sweeping summaries of things people are saying about things other people are saying and which actually have very little grounding in tactile interactions with the technology are proliferating.

In general, given just how new this wave of AI is, my perspective is that people should be spending more time experimenting with it on their problems than generating takes. But doing so does require a solid understanding of actual problems, rather than the Imagined Problems (so-high-level-to-be-meaningless) that both drive a lot of work and drive a lot of well-intentioned people to madness.

So here are some scattered thoughts to hopefully inform and encourage more experimentation. I mean them as generative (ha!) — I hope they catalyze your own thoughts for experiments; these are less predictions than provocations.

Replace “a chatbot that knows things” with “a calculator for words” as your anchor mental model. The chat experience is an interface or affordance. The underlying technology breakthrough is actually software that can process and reason about words much, much more effectively. Similarly, it’s not about these models having the answers. GPT-4 scores 88th percentile on the LSAT. That is a test of reasoning—not knowledge. Don’t think of these models as domain experts. Think of them more like an intern in your office going to law school in the fall and who got an 163 on the LSAT.
Much of the substance of what constitutes “government” is in fact text. A technology that can do orders of magnitude more with text is therefore potentially massively impactful here. Law, policy, regulations, guidance, business process and operating procedures, official letters and notices—much of the substance of what we consider government is in fact made up of text. This gives LLMs much more potential in the context of interacting with or delivering government, almost definitionally, than many other domains.
Many of the sub-tasks of the work of delivering public benefits seem amenable to the application of large language models to help people do this hard work. Eligibility operations are a value chain with concrete work involved. Processing. Verifying. Mapping messy reality to abstract rules. I see many opportunities for large language models to assist the public servants doing that in ways that may increase throughput and decrease the difficulty (and frustration) of parts of that work. Examples just off the top of my head:
- Next-level OCR for documents: OCR is currently good at well-structured tasks like processing a single form that is very common (think a tax return.) We likely now have the technology to effectively extract arbitrary information from virtually any paystub, for example, without requiring more than a human review.
- Pulling up applicable rule citations for an edge case / copilot for policy: Public benefits programs represent the accreted complexity of decades. It is very complicated and difficult work to identify the applicable rules in more complicated cases, often an exception to the exception to the exception. A human may have to reason about convoluted logic spread across 3 distinct sources of policy to get to an answer. LLM’s reasoning capabilities likely can make this much easier.
- Sensemaking, analysis, and prioritization of complaints or appeals
- Automated support for simplifying client-facing language
- Next-generation self-service options that avoid the “talking to a wall” robot experience: Chatbots may not be the core of this technology, but the sub-task of “compose an answer that is more directly responsive to this question (even if that is that you can’t answer it)” is one LLMs handle extremely well.
- LLMs + RPA to streamline interactions with legacy systems: Robotic Process Automation has been an increasingly common shim layer to make easier or quicker doing common tasks in legacy systems where the task might be time-intensive (many clicks.) LLMs likely supercharge this, given that it can take an arbitrary task, click around in a test environment, and generate a reasonable starting place for an RPA script that does a task. (Remember, a system screen is generally a page with text! Text, folks!)
- Lots more
Many of the sub-tasks of interacting with or assisting someone with public benefits are amenable to LLMs. Some examples:
- Triaging and assisting with escalations and appeals of issues
- Lots more
Flowing from the above two points: we may see a path divergence in speed of adoption inside vs. outside. This has implications worth gaming out and considering deeply. In particular, the scale software and low/no cost user discovery brings could well overwhelm systems that currently display linear-to-human-staff scalability.
Also related to the above: LLMs may enable a new generation of software-based agents on top of government systems—figuring out how to align incentives in the right direction for those would be a useful conversation to start. The metaphor of “Turbotax for X” is so ubiquitous as to be somewhat annoying at this point. But its ubiquity is a function of how densely it compresses complicated experience information. People see two things: (1) “a simple, guided experience for navigating a complicated form”; (2) pernicious industry rent seeking on top of government services. The only point I seek to make here is that #2 largely came from the incentive design of the Free File program. My prior work building GetCalFresh was also a software-based agent helping people navigate a complicated program, but with fundamentally different incentives. If the cost of developing such agents has collapsed, the more useful question to start asking is: what is a strong interface-access regime that aligns incentives towards public aims? (Corollary: absent this conversation, things will likely arise anyway but be managed in a much more ad hoc way, missing opportunities and creating costs and frustration on both sides.)
LLMs are “unstable material” like so many other engineering materials, and so strong quality checks/monitoring/humans in the loop by default are probably necessary to ensure low failure rates. We don’t have a good sense yet of failure rates. In fact they may well be dynamic—some are reporting GPT-4 performance changing re-running tasks, as the model takes in input and is modified and being aligned for competing goals. This makes monitoring, QA, human checks all very critical to any use of these things, particularly in sensitive contexts where downside risk costs are significant. (Obligatory note: the status quo’s costs are also worth weighing here.)
Moonshot hope—maybe zero cost code generation solves the “our legacy system is so hard to change” problem. This could be a post all on its own. The short version:
- if the starting point for reducing the cost of change is adding lots of automated tests that characterize the system’s current behavior, and
- generation of such tests has become effectively zero-cost, and
- models in short order get good at reasoning about existing code bases
- then maybe “the legacy system problem” is about to get much, much easier to “modernize” (if we define that as 1. making changes we need, and 2. making changes that make other changes easier in the future)

Overall: I’m cautiously optimistic. Zooming out, what are the fundamentals that make me optimistic? We have complicated programs that are largely comprised of text and a fundamental technology breakthrough about processing and reasoning about text. Of course things are more complicated; these are all complex, decentralized systems at the end of the day. Complex systems are intrinsically hazardous. Things can and will get messy. But it seems to me that these fundamentals have net positive implications for reducing burden (on many different actors) and making things better overall.

But—all of this is fairly low confidence unless and until I see these things working against concrete problems! A 90% success rate vs. a 20% success rate will have very different implications.

So my final exhortation: take these thoughts as inspiration for tests and experiments to run, not as opinions or prognostications to evaluate. (And share what you find!)

You can subscribe to new posts from me by email (Substack) or RSS feed.