Dave Guarino

Some ideas on more safely prototyping LLM products

Some ideas on more safely prototyping LLM products

'a computer mainframe with duct tape on it in cyberpunk style' (Dalle)
'a computer mainframe with duct tape on it in cyberpunk style' (Dalle)

I really enjoyed two recent posts about engineering and LLMs:

  1. Mitchell Hashimoto's "Prompt Engineering vs. Blind Prompting"
  2. Apenwarr's "System design 2: what we hope we know"

A theme that struck me as underlying both posts is nicely captured in the (rough) quote of the latter's engineering professor:

Engineering isn't about building a paperclip that will never break, it's about building a paperclip that will bend enough times to get the job done, at a reasonable price, in sufficient quantities, out of attainable materials, on schedule.

Put another way: engineering is about building things that meet people's needs at an appropriate cost and an appropriate level of safety — there is no 100% safety level, and the work is in large part applying methods to get at the empirical math of tradeoffs across cost and safety.

Unsurprisingly, both these posts were concerned with Large Language Models (LLMs) such as ChatGPT, GPT-4, Llama, etc.

Mitchell's post gestured towards a few examples of how to safely develop on top of such innately unstable material, under the umbrella of "Trust But Verify and Continuous Improvement" - for example:

For our calendar application example, we may want to explicitly ask users: "is this event correct?" And if they say "no," then log the natural language input for human review. Or, we can maybe do better to automatically track any events our users manually change after our automatic information extraction.

Now I very squarely identify as a product engineer when it comes to building software. What that means is I'm most interested in building things where the undefined variable is the user's needs and preferences (or call those requirements) and applying the right set of methods to meet the goal of building the right thing (this I would juxtapose with what might be considered more pure technical work aimed at building the thing right - "the thing" is much better defined at that point!)

So these posts got me thinking it might be useful to put fingers-to-keyboard on some of the methods that I think might be useful in starting to build products, prototypes, and services on top of LLMs.

Nota bene: AI safety is its wholly its own field and I'm not an expert. But I do have the benefit of having built actually-in-production-and-use things on top of unstable material in domains and contexts where the potential downside human cost is not small. I hope this catalyzes more thinking and exchange of ideas rather than be taken as some gospel.

Some ideas for building on top of LLMs with increased safety #

  1. Let the user review and tell you if it's right: this is Mitchell's example above of extracting a date and asking the user if it's right before using it
  2. Delayed-response for 100% manual human review: for example, start prototyping a service as one where a user can email and get a response within 24 hours; all LLM-generated responses are reviewed by a human before responding to the actual end-user, and corrections are made by human reviewers while building up a training data set of input-output pairs for evaluation and fine-tuning
  3. Standardize and "whitelist" outputs that can be returned to users safely: for example, if you have some common responses that you know are generally never that bad an idea for the end-user, you might whitelist pre-vetted responses, use the model to select among them (or say none are appropriate and send to a human), and give the user a pre-vetted response; an example from a domain that I've worked in is that while helping people navigate the SNAP or food stamp program, it's often a quite safe escape hatch to tell them to call their local agency and provide them the phone number
  4. Use LLM output to try to handle edge cases/error states, but with user review: so for example if someone puts in some info that seems to fail some sort of validation, instead of preventing the user from fixing it, try to use an LLM to fix it and show the user the output to review first
  5. Try using a separate model as a safety check: If you have some concern that certain responses might not be safe, try having a separate model review with some rules - again, you could be very aggressive in trying to flag things that seem even maybe unsafe by asking it to only not flag if certain criteria seem to be definitively met
  6. (Weird but true!) Limit user volume to always be able to have a human review!: this may sound somewhat obvious but I think people tend to under-appreciate how the ways that users are finding or getting to use a thing is a key variable for safety, and you can control some of that by design (for example only sharing to certain users to start, having occasional intentional downtime, offering it for only a limited testing time that is given up front, using ads or other links for user acquisition that can be toggled on/off with a URL that cannot be publicly discovered otherwise, etc.)

All of these are ways that I think one could start to offer services to real people (the greatest source of learning, to put my own biases on the table!) while having a lot more control over potential negative cases.

Have other ideas? Have ways these could be better or further de-risked? Drop me a note and I'll happily add your own to this post or another with attribution so that we can collectively get smarter at this and make more things people need and want on top of these (pretty darn impressive if unstable) materials.


You can subscribe to new posts from me by email (Substack) or RSS feed.