What is Hackers' Pub?

Hackers' Pub is a place for software engineers to share their knowledge and experience with each other. It's also an ActivityPub-enabled social network, so you can follow your favorite hackers in the fediverse and get their latest posts in your feed.

0
0
0
0

FormalMATH: Benchmarking formal mathematical reasoning of large language models. ~ Zhouliang Yu et als. arxiv.org/abs/2505.02735

arXiv logo

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

arxiv.org · arXiv.org

0
0
0
0
0
0
0
0
0
0
2
0
0
2
1
1
0

"Cuida tus pensamientos, porque se convertirán en tus palabras. Cuida tus palabras, porque se convertirán en tus actos. Cuida tus actos, porque convertirán en tus hábitos. Cuida tus hábitos, porque se convertirán en tu destino." ~ Mahatma Gandhi (1869-1948).

0
0
0
0
0

Input on the English language:

In Danish we have a term called the "not test”, and we sometimes say about a political view, that it does not pass the not test, if no one would take the opposite view.

For instance, if a politician campaigns on wanting children to live happy lives, we may say, “it does not pass the not test", because no candidate is running on NOT wanting children to live happy lives.

Is there an established English term for such a "not test”?

0
0
0

- ihr wollt wirklich alle verbieten?

Nein, so ist es nicht. Ein zweiter Blick lohnt sich - unser Gesetzentwurf ist ausgewogen und lässt Platz für Litfasssäulen, Plakatierung und vieles mehr.

Verbieten wollen wir Werbung, die wirklich stört - z.B. Bildschirme im Straßenraum.

hamburg-werbefrei.de/gesetzent

0
0
1
0
0

:exclamation_sign:​パロディちゅーい   

:syuiloneko_face:「io(いお)のムラカミ スパデラっ。アイナイトの逆襲ッ」
:crying_ai:「よし、スペアのお面ッ。」
:syuiloneko_face:「あ、マズいにゃ!ムラカミさんがニャウィリーで逃げるにゃ!」
:angry_ai:「なっ!? 逃がしませんよ!」
:syuiloneko_face:「まてーーーー!!」

RE:
https://misskey.io/notes/a7623tse7vg30jjg

1
0
0
0
0
0

"中國性少數資訊賬號「同志之聲」被改名,性少數組織再陷「境外勢力」指控|端傳媒" theinitium.com/article/2025050

> 性少數的生存空間卻愈發緊縮。2021年7月6日,在微信平台,包括清華大學、北京大學、復旦大學和中國人民大學在內的高校LGBTQ+社團公衆號被集體停用,歷史內容全被清除。

同月,多家長期運營的性少數公益組織也被要求更名。運營八年的「同志平等權益促進會」改名為「同促在線」;2008年成立的「北京同志中心」更名為「北同文化」;「同性戀親友會」亦更名為「出色夥伴」。

此後,平台對涉及LGBT話題的處理方式趨於隱性,較少發布明文禁令,而多通過限流、降權、撤熱搜等技術手段完成內容壓制。在推薦算法主導的信息傳播環境中,這種無聲的處理方式更加隱蔽,難以確認界限與標準。

0
0
0
0
0
0
0
0
0
0
0
0
0
0