Logically: There is a shared reachable set and a thread-local grey set. To trace an object in the grey set: 1. Attempt to add it to the reachable set. If it was already in the reachable set, ignore it, as someone else is already tracing it (possibly mark it as a popular object). Otherwise: 2. Consider each of the objects it points to. If any of them is already in the reachable set or the grey set, ignore it. Otherwise: 3. Add them to the grey set. In the implementation: The thread-local sets are sparse bitmaps; probably, a two-level tree will do just fine; in a pinch, maybe go for three levels with a cache. The grey and reachable sets are divided into granules of, say, 128 bits (should probably be at least 128 bits, as that's as much as we can CAS at a time). When adding an object to the grey set, we also check if any other bit in its granule is set; if not, add the granule to the grey queue. (A pointer to a granule can be represented in just 32 bits.) Sort the grey queue before traversing it Add elements of the grey set to the reachable set one granule at a time rather than one object at a time to take advantage of bit-parallelism in CAS. Minor point: The grey queue doesn't have to be _fully_ sorted, but one should at least make an effort; perhaps make it an LSM. It can be partially sorted, but an unsorted subsequence should cover at most an L2-block of heap. (Where an L2-block means either something slightly smaller than L2--in the event that L2 can be convinced to use pure LRU cache replacement--or else the size of a single L2 way, or perhaps a couple of them.) Quoth hayley: > The ZGC developers found that queues would grow in rare cases, without marking the bitmap before enqueuing. I think consulting the grey set solves this. How to implement the tree? 2x vpermi2d top level -> 64 512 mid level -> 4096 (= 32kb, amd's auto big page size) 8kb bitmap -> 8192 8 bits/byte -> 8 16 bytes marked/bit -> 16 2^38 byte heap One bitmap page covers 8192*8*16 = 1mb = l2 Nominally we would like to sort somewhat intra-bitmap (intel/amd way size is 64k), but still good enough to ~eagerly trace what falls inside the current bitmap page. That allows us to trace a given bitmap page to exhaustion, and then free it perfunctorily -> solved that problem. Supppose that we _increase_ the min object size to 4 words (except for conses--which go on special cons-only pages anyway;--treat em special), and at the same time _decrease_ alignment to 1 word? That should decrease fragmentation and at the same time increase bitmap density (since a single bit suffices to mark 4 words). Some extra bookkeeping is required for the marking (forget--but rather straightforward). Also alignment for dcas (surely there is a bit to spare; maybe more than one, and we can have more alignment constraints?).