Logically:

There is a shared reachable set and a thread-local grey set.

To trace an object in the grey set:
1. Attempt to add it to the reachable set.  If it was already in the reachable set, ignore it, as someone else is already tracing it (possibly mark it as a popular object).  Otherwise:
2. Consider each of the objects it points to.  If any of them is already in the reachable set or the grey set, ignore it.  Otherwise:
3. Add them to the grey set.

In the implementation:

The thread-local sets are sparse bitmaps; probably, a two-level tree will do just fine; in a pinch, maybe go for three levels with a cache.

The grey and reachable sets are divided into granules of, say, 128 bits (should probably be at least 128 bits, as that's as much as we can CAS at a time).
When adding an object to the grey set, we also check if any other bit in its granule is set; if not, add the granule to the grey queue.
(A pointer to a granule can be represented in just 32 bits.)
Sort the grey queue before traversing it
Add elements of the grey set to the reachable set one granule at a time rather than one object at a time to take advantage of bit-parallelism in CAS.

Minor point:
The grey queue doesn't have to be _fully_ sorted, but one should at least make an effort; perhaps make it an LSM.

It can be partially sorted, but an unsorted subsequence should cover at most an L2-block of heap.
(Where an L2-block means either something slightly smaller than L2--in the event that L2 can be convinced to use pure LRU cache replacement--or else the size of a single L2 way, or perhaps a couple of them.)


Quoth hayley:

> The ZGC developers found that queues would grow in rare cases, without marking the bitmap before enqueuing.

I think consulting the grey set solves this.


How to implement the tree?
2x vpermi2d top level -> 64
512 mid level         -> 4096 (= 32kb, amd's auto big page size)
8kb bitmap            -> 8192
8 bits/byte           -> 8
16 bytes marked/bit   -> 16
2^38 byte heap
One bitmap page covers 8192*8*16 = 1mb = l2
Nominally we would like to sort somewhat intra-bitmap (intel/amd way size is 64k), but still good enough to ~eagerly trace what falls inside the current bitmap page.
That allows us to trace a given bitmap page to exhaustion, and then free it perfunctorily -> solved that problem.


Supppose that we _increase_ the min object size to 4 words (except for conses--which go on special cons-only pages anyway;--treat em special), and at the same time _decrease_ alignment to 1 word?
That should decrease fragmentation and at the same time increase bitmap density (since a single bit suffices to mark 4 words).  Some extra bookkeeping is required for the marking (forget--but rather straightforward).  Also alignment for dcas (surely there is a bit to spare; maybe more than one, and we can have more alignment constraints?).