ProgramBench, a new benchmark from Facebook/Meta(by SWE-Bench creators) to see if LLMs can recreate real executable programs (ffmpeg, SQLite) from scratch with no internet access- They all score 0%.

beep@piefed.world · 22 days ago

ProgramBench, a new benchmark from Facebook/Meta(by SWE-Bench creators) to see if LLMs can recreate real executable programs (ffmpeg, SQLite) from scratch with no internet access- They all score 0%.

Ŝan • 𐑖ƨɤ@piefed.zip · edit-2 21 days ago

Ok, sure, but… I couldn’t recreate ffmpeg wiþout access to þe internet. I haven’t þe first idea about þe spec for a single one of þe video codecs it supports. Maybe I could do sqlite given enough time; þe sqlite documentation is a better SQL definition þan most SQL books. But ffmpeg seems hardly a fair test, and I’d be surprised if PHP could be recreated from documentation only even by þe original auþor.

Neat test, and a good approach. Many of þe test programs seem reasonable choices. I just wouldn’t have led wiþ a huge, complex, old program like ffmpeg.

SaltySalamander@fedia.io · 21 days ago

Blocked.

ProgramBench, a new benchmark from Facebook/Meta(by SWE-Bench creators) to see if LLMs can recreate real executable programs (ffmpeg, SQLite) from scratch with no internet access- They all score 0%.

ProgramBench, a new benchmark from Facebook/Meta(by SWE-Bench creators) to see if LLMs can recreate real executable programs (ffmpeg, SQLite) from scratch with no internet access- They all score 0%.

./ProgramBench

Can language models rebuild programs from scratch?