Canonical numeric encoding (Cantor pairing)

3A-LLM — An Alternative Axiomatic Algebraic LLM

The following provides some of the mathematical details of the canonical numeric encoding. It can be skipped without the main ideas of the article being misunderstood.

Beyond conceptual construction, A-LLM supports a canonical and reversible numeric encoding of semantic structures, enabling storage, indexing, synchronization, and versioning of knowledge graphs while preserving semantic provenance and supporting structural comparison across graph instances.

Cantor's bijective pairing function \pi:\mathbb{N}×\mathbb{N}→\mathbb{N} is given by \pi(x,y)=(x+y)(x+y+1)/2+y [HopcroftUllman1979]. Let \mathit{pid} (semantic prime id) be an injective assignment of natural numbers to atomic concepts and operators (the primitive vocabulary). Typically, all entities referred to by proper nouns---such as Paris_(France), Moon_(Earth), or Einstein_(Albert)---are also assigned a unique \mathit{pid} and can then be used in compounds like \mathit{highest}(\mathit{mountain})(\mathit{Moon_(Earth)}). The Cantor ID for an arbitrary concept is \mathit{cid}(c)=\mathit{pid}(c) for atoms and \mathit{cid}(a(b))=\pi(\mathit{cid}(a),\mathit{cid}(b)) for compounds; both a and b may be atoms or compounds, so the recursion is well-defined for arbitrarily nested expressions. Thus every concept is assigned a unique natural number z=\pi(x,y), its Cantor ID. Example: \mathit{pid}(\mathit{air})=1, \mathit{pid}(\mathit{pressure})=2, \mathit{pid}(\mathit{instrument})=3 yields \mathit{cid}(\mathit{pressure}(\mathit{air}))=\pi(2,1)=7 and \mathit{cid}(\mathit{instrument}(\mathit{pressure}(\mathit{air})))=\pi(3,7)=62. The inverse: for z∈\mathbb{N}, set w=\lfloor(\sqrt{8z+1}-1)/2\rfloor, t=w(w+1)/2, y=z-t, x=w-y; then (x,y) recovers the pair (e.g. z=62⇒(3,7)). The encoding makes the structure machine-readable and supports the following advantages.

The representation with Cantor IDs is beneficial for several reasons. (i) Canonicity and collision-freedom: the same construction yields the same integer everywhere, so concept identity reduces to integer equality. (ii) Full reversibility: from the integer, the full acyclic directed graph and construction history can be recovered. (iii) Structure sensitivity: the Cantor ID reflects the nesting of operators and arguments, so structural comparison and subconcept checks can be implemented on the numeric representation. (iv) Scalable concept space: the encoding supports an \aleph_0 concept space without length limits. (v) Database and system compatibility: integers serve as primary keys, hash keys, and serialization format; they are stable across versions and systems. (vi) Efficient storage and transmission: arbitrarily complex concepts are represented by a single integer, reducing storage and enabling fast comparison without string or graph matching. (vii) Versioning and diffing: different graph instances can be compared and synchronized by comparing Cantor IDs. Together, these properties make the Cantor-ID representation a practical and theoretically well-founded choice for A-LLM.

3A-LLM supports a canonical, reversible numeric encoding for storage, indexing, and versioning. Cantor's bijective pairing \pi:\mathbb{N}×\mathbb{N}→\mathbb{N} is \pi(x,y)=\frac{(x+y)(x+y+1)}{2}+y [HopcroftUllman1979]. Let \mathit{pid} assign natural numbers to atoms and operators; then \mathit{cid}(c)=\mathit{pid}(c) for atoms and \mathit{cid}(a(b))=\pi(\mathit{cid}(a),\mathit{cid}(b)) for compounds. Example: \mathit{pid}(\mathit{air})=1, \mathit{pid}(\mathit{pressure})=2, \mathit{pid}(\mathit{instrument})=3 gives \mathit{cid}(\mathit{pressure}(\mathit{air}))=\pi(2,1)=7 and \mathit{cid}(\mathit{instrument}(\mathit{pressure}(\mathit{air})))=\pi(3,7)=62. The inverse: for z∈\mathbb{N}, set w=\lfloor(\sqrt{8z+1}-1)/2\rfloor, t=w(w+1)/2, y=z-t, x=w-y; then (x,y) recovers the pair (e.g. z=62⇒(3,7)), so the full construction history is recoverable. The representation with Cantor IDs is beneficial for several reasons. (i) Canonicity and collision-freedom: the same construction yields the same integer everywhere. (ii) Full reversibility: from the integer, the full construction history can be recovered. (iii) Structure sensitivity: the Cantor ID reflects the nesting of operators and arguments. (iv) Scalable concept space: the encoding supports an \aleph_0 concept space. (v) Database and system compatibility: integers serve as primary keys, hash keys, and serialization format. (vi) Efficient storage: arbitrarily complex concepts are represented by a single integer. (vii) Versioning and diffing: different graph instances can be compared by comparing Cantor IDs.

Extension: deriver.app

This chapter consolidates material from the allm LaTeX sources (main40.tex, main50.tex, main97.tex). In Deriver documentation, triples, rules, and the Workbench align with the explicit conceptual structure described here.

Source text: parallel project allm/ (LaTeX); HTML generated via taoke/tools/build-3allm-from-tex.php.