is there any open source skill benchmark workbench for claude code? i want to define multiple versions of a skill, acceptance criteria (to be evaluated by a separate agent), and a runner that does repeated runs of different versions of my skill to see if there’s a statsig improvement in any version

0

If you have a fediverse account, you can quote this note from your own instance. Search https://bsky.brid.gy/convert/ap/at://did:plc:fpruhuo22xkm5o7ttr2ktxdo/app.bsky.feed.post/3mbw3tljfik2z on your instance and quote it. (Note that quoting is not supported in Mastodon.)