Uploaded image for project: 'Minecraft: Java Edition'
  1. Minecraft: Java Edition
  2. MC-224893

Replacing Chunk Futures causes several issues

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Reopened
    • Resolution: Unresolved
    • Affects Version/s: 21w17a, 21w18a, 1.17 Pre-release 3
    • Fix Version/s: 1.17 Pre-release 3
    • Labels:
      None
    • Confirmation Status:
      Confirmed
    • Category:
      World generation
    • Mojang Priority:
      Normal

      Description

      It turns out that MC-214808, MC-216148 and MC-214793 are all caused by the same code change introduced in the 1.17 snapshots. In the following I will describe this change and explain how it leads to each of these bugs. I post this as a separate report as it affects multiple other reports, so it doesn't really fit anywhere as a comment.

      World generation/loading stores one Future for each generation stage of a chunk in a corresponding ChunkHolder. These Futures are created once something requests the chunk in the respective stage, under the condition that the chunk ticket level is sufficiently large to issue generation into that stage. What changed in the 1.17 snapshots is the handling of demotion of chunk ticket levels.
      In 1.16 demoting the chunk ticket level completed all pending Future s above the new level with a special UNLOADED_CHUNK marker, while leaving already finished Future s untouched.

      net.minecraft.server.world.ChunkHolder.java
      protected void tick(ThreadedAnvilChunkStorage chunkStorage) {
        ...
        Either<Chunk, ChunkHolder.Unloaded> either = Either.right(new ChunkHolder.Unloaded() {...});
      
        for(int i = bl2 ? chunkStatus2.getIndex() + 1 : 0; i <= chunkStatus.getIndex(); ++i) {
          completableFuture = (CompletableFuture)this.futuresByStatus.get(i);
          if (completableFuture != null) {
            completableFuture.complete(either);
          } else {
            this.futuresByStatus.set(i, CompletableFuture.completedFuture(either));
          }
        }
        ...
      }
      

      This has two important consequences:

      1. Once a generation stage finishes, the Future stays completed with the correct result until the chunk is unloaded
      2. A Future that is still pending at some point can only be aborted (completed with UNLOADED_CHUNK, which does not abort the actual generation step) if the chunk ticket level drops too low, otherwise it will complete with a valid chunk. This requires some small argument since the Future cannot only be aborted directly by the above code, but also indirectly if any of its dependencies is aborted (completed with UNLOADED_CHUNK). However, the latter cannot happen if the ticket level stays high enough (after observing the pending Future), by construction of the ticket system.

      In the 1.17 snapshots, this handling of demotion fundamentally changed. Instead of directly aborting the pending Future s above the new level, all Future s, including already finished ones, are replaced with a new one that completes with an UNLOADED_CHUNK once the original Future completes.

      net.minecraft.server.world.ChunkHolder.java
      protected void tick(ThreadedAnvilChunkStorage chunkStorage) {
        ...
        Either<Chunk, ChunkHolder.Unloaded> either = Either.right(new ChunkHolder.Unloaded() {...});
      
        for(int i = bl2 ? chunkStatus2.getIndex() + 1 : 0; i <= chunkStatus.getIndex(); ++i) {
          completableFuture = (CompletableFuture)this.futuresByStatus.get(i);
          if (completableFuture != null) {
            this.futuresByStatus.set(i, completableFuture.thenApply(__ -> either));
          } else {
            this.futuresByStatus.set(i, CompletableFuture.completedFuture(either));
          }
        }
        ...
      }
      

      As a result, both of the above properties are no longer valid:

      1. An already finished stage might be replaced with an UNLOADED_CHUNK_FUTURE
      2. If the chunk ticket level drops below, say, t then the corresponding Future is not directly aborted, but instead replaced with a Future that will be lazily aborted. If the ticket level is now raised above t again while the original Future is still running, then no new Future will be issued for this generation stage.
        net.minecraft.server.world.ChunkHolder.java
        public CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> getChunkAt(ChunkStatus targetStatus, ThreadedAnvilChunkStorage chunkStorage) {
          int i = targetStatus.getIndex();
          CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> completableFuture = (CompletableFuture)this.futuresByStatus.get(i);
          if (completableFuture != null) {
            Either<Chunk, ChunkHolder.Unloaded> either = (Either)completableFuture.getNow((Object)null);
            if (either == null || either.left().isPresent()) {
              return completableFuture;
            }
          }
          if (getTargetStatusForLevel(this.level).isAtLeast(targetStatus)) {
            CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> completableFuture2 = chunkStorage.getChunk(this, targetStatus);
            this.combineSavingFuture(completableFuture2, "schedule " + targetStatus);
            this.futuresByStatus.set(i, completableFuture2);
              return completableFuture2;
          } else {
            return completableFuture == null ? UNLOADED_CHUNK_FUTURE : completableFuture;
          }
        }
        

      This was fine in 1.16 since the Future could only be pending if it was not already (indirectly) aborted (which is exactly point 2 above). However, in 1.17 the Future is already indirectly aborted (it is replaced by a lazily aborted one) so it will always produce an UNLOADED_CHUNK even if the ticket level does not fall again. This breaks property 2.

      It turns out that this code was changed in 21w06a, which is precisely the version where all 3 bugs were reported. (I actually checked the precise version retrospectively after figuring out the cause for these issues).

      So, how does violation of these 2 properties cause the 3 linked bugs?

      1. Ticking-Futures might never complete

      Every time the ticket level is raised high enough, a chunk is promoted to BORDER, TICKING and ENTITY_TICKING state respectively. This will create additional Future s that depend on the generation Future s in some small neighborhood of the chunk. Upon completion, these will execute the respective tasks for promoting the chunk to the respective state, like registering tick schedulers, marking the chunk tickable/entity-tickable and sending the chunk data to players.
      Assuming property 2 above, these Future s will always complete with a valid chunk at some point, unless the ticket level drops too low. In the latter case, the Future s get recreated once the chunk is promoted again, so this is not an issue.
      Now, in the 1.17 snapshots property 2 is violated and hence these special Future s might never complete (with a valid chunk) since they might depend on lazily aborted Future s upon creation. Hence the respective promotion tasks might never execute. In particular, the ticking-Future, which is responsible for sending the chunk data to players, might not execute, hence causing MC-214808.
      Note that all 3 Future s can fail independently. For example, I did observe chunks where the ticking-Future completed but the entity-ticking-Future did not, causing the chunk to load but entities to get stuck in these chunks. Also note that this issue can similarly lead to the "Chunk not there when requested: " error when the ServerChunkManager queries a lazily aborted Future for the FULL stage. I actually observed this crash a few time during debugging.

      In order to provoke this issue, one can try the following steps. Due to the random nature of these bugs, several attempts might be needed. Run

      • /tp ~1000 ~ ~
      • /tp ~-1000 ~ ~
      • /tp ~1000 ~ ~
        in quick succession, e.g., with one tick in-between, or while pausing the server thread through the debugger. The first teleport will create the ChunkHolder s at the target location and create the generation Future s up to FULL stage. The second tp will then lazily abort these Future s again and the third and final tp will create the ticking-Future s on the lazily aborted but still pending generation Future s, hence causing MC-214808.
        I also observed this issue with only a single tp, although this was less reproducible. This might seem strange at first, since the ticket levels should change monotonically in this case and not experience the jitter required for the above explanation. I think this is caused by an interaction of the POST_TELEPORT chunk ticket and the player ticket throttling. The POST_TELEPORT ticket is created right after teleporting, but expires before the player tickets are added because of the throttling, hence causing the required jitter.

      2. Futures might be erased completely

      If the chunk ticket level drops low enough, the corresponding ChunkHolder is scheduled for unloading. It can still be revoked up to the point where it is actually saved to NBT and all the unloading tasks are done. This can be a few ticks from the actual scheduling, depending on server load. By property 1, once the corresponding chunk has passed the first generation/loading stage, all Future s will keep the reference to this chunk, so that it can indeed be revoked from the ChunkHolder.
      However, in 1.17 when the ChunkHolder is scheduled for unloading, all generation Future s are replaced with (lazily) aborted ones. If this ChunkHolder gets then later on revoked (before it could properly save and unload) all generation Future s are recreated and the very first stage then reloads the chunk from disk again. Hence, the chunk object gets replaced with a completely new version that is reloaded from disk, erasing any progress since the last save. This is exactly MC-216148.
      Note that the unloading tasks did not run in this scenario, so any externally stored data is still present. In particular, the chunk is still marked as loaded and will hence skip the block-entity loading step.

      net.minecraft.server.world.ThreadedAnvilChunkStorage.java
      private CompletableFuture<Either<Chunk, ChunkHolder.Unloaded>> convertToFullChunk(ChunkHolder chunkHolder) {
        ...
        worldChunk2.loadToWorld();
        if (this.loadedChunks.add(chunkPos.toLong())) {
          worldChunk2.setLoadedToWorld(true);
          worldChunk2.updateAllBlockEntities();
        }
        ...
      }
      

      As a consequence, block-entity tickers will not be recreated and still reference the old copies, hence block-entities will not be ticked. This is indeed a side effect noted in the bug report.

      In order to provoke this issue, run for example

      • /tp ~1000 ~ ~
      • /tp ~-1000 ~ ~

      in quick succession, so that the ChunkHolder s do not save and unload in-between. This can be achieved by pausing the server thread, for example. The first tp will erase all generation Future s and the second tp will then regenerate them and reload the chunk from disk.

      In the report, it is noted that this seems to happen near nether portals. I'm not really sure how this is correlated. Most certainly, the relevant feature of nether portals are the PORTAL tickets created upon using them. However, I don't know how they enter the equation. They might play a similar role to the POST_TELEPORT_TICKET of the previous section or it might be something else.

      3. Chunks might drop below FEATURES stage during initial lighting

      While MC-214793 is actually caused by another concurrency bug, namely MC-224894, it might still be worthwhile to explain why this issue only showed up recently, even though the other bug is way older than 1.16.
      Chunks below the FEATURES stage, i.e., chunks for which the FEATURES generation Future does return UNLOADED_CHUNK, are considered opaque by the lighting engine. Due to MC-224894, chunks are not kept in this required stage during initial lighting and hence might become opaque to the lighting engine.
      However, due to property 1, in 1.16 the chunk always stayed in FEATURES stage once it was scheduled for initial lighting (and even was kept from unloading due to the pending light task). What could have happened even in 1.16 is that neighbor chunks might be unloaded between starting the initial lighting and finishing, which would cause glitches at the border. However, these neighbor chunks were kept loaded by the light ticket that is only removed (wrongly) at the start of the initial lighting, so the time frame was usually way too small, given that the actual unloading usually requires some extra ticks.
      On the other hand, in 1.17 the chunk will immediately lose its FEATURES status once the light ticket is removed (and the player is sufficiently far away), hence triggering the issue with a relatively small amount of work required by the server thread which thus gives a large enough time frame for the issue to happen.
      Furthermore, note that in 1.16 the light generation stage can be aborted before execution if any pending dependency was aborted due to ticket levels dropping too low. In 1.17 this is no longer the case since generation Future s don't get aborted directly. Hence, 1.17 more often processes initial lighting even if the player is already far away, compared to 1.16, increasing the chance for glitches at he chunk borders.
      MC-224894 already describes how to trigger this lighting issue alone. However, MC-214793 mentions corrupted features in conjunction with this, I am not sure what is happening here, but I think this is a combination of the pure lighting issue and the previous section (progress being reverted). For example, one might argue that chunks which do make it to the LIGHT stage are kept from unloading a bit longer than their neighbors (which already finish after the FEATURES stage), due to the additional FUTURE that is waited upon by the unloading code. Hence, these ChunkHolder s might still be alive while their neighbors are already saved and unloaded and are hence susceptible to the previous issue which then regenerates these chunks completely, erasing all features that were leaking in from neighbors. However, as the light data is stored externally in the lighting engine, it will stay broken and not be regenerated. The neighbors were already saved to disk and will hence not need to regenerate, so they keep their features, causing corruption at the boundary.
      Anyway, this is just a wild guess and probably not the whole story. But I think it's not too important to know precisely what's going on here. Debugging this is a real nightmare due to the rather random nature of everything involved and light tickets interacting with the whole argument.

      Conclusion

      The actual bug should be reasonably easy to understand and fix, e.g., by reverting to the 1.16 behavior. On the other hand, there might be good reasons for this change, so some more sophisticated solution might be necessary.
      In any case, I hope these explanations have made clear the connections to the other 3 bugs and improved understanding of the general concurrency issues.

       

      Best,
      PhiPro

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              fry [Mojang] Georgii Gavrichev
              Reporter:
              PhiPro Philipp Provenzano
              Votes:
              15 Vote for this issue
              Watchers:
              14 Start watching this issue

                Dates

                Created:
                Updated:
                CHK: