i guess i could define acceptance criteria and then have a benchmark agent try different versions in a consistent harness with an inner agent multiple times and grade outcomes. is that the way to not go mad? waiting manually for the dice roll and trying to guess if this time it worked better sucks

0

If you have a fediverse account, you can quote this note from your own instance. Search https://bsky.brid.gy/convert/ap/at://did:plc:fpruhuo22xkm5o7ttr2ktxdo/app.bsky.feed.post/3mbtx3jcyxc2h on your instance and quote it. (Note that quoting is not supported in Mastodon.)