Uploaded image for project: 'Minecraft: Java Edition'
  1. Minecraft: Java Edition
  2. MC-169913

Lighting engine is not threadsafe

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open
    • Resolution: Unresolved
    • Affects Version/s: 1.15.1, 1.15.2 Pre-Release 1, 1.16.2 Pre-release 1, 1.16.2, 1.16.4, 20w48a
    • Fix Version/s: None
    • Labels:
      None
    • Confirmation Status:
      Confirmed
    • Category:
      Lighting
    • Mojang Priority:
      Normal

      Description

      The overall issue dealt with in this report is that block changes are not properly synchronized with the lighting engine; in particular they are not properly synchronized with the lightmap handling which determines the region the lighting engine can operate on and the region where skylight optimizations are applicable. As shown below, this can lead to persistent lighting glitches.
      In the following we'll look into 3 variants of this general issue that are relevant for my proposed solution. I attached a world download containing a simple setup for all 3 problems with the labeling as below. Due to the multithreading nature of the problem, the issues are not necessarily reliably reproducible, although they were on my system; it would be nice if someone else could confirm that the setup works on their system. Below I'll try to extract the relevant points.

      A boring solution to these issues would be to employ some copy-on-write data structure to synchronize block changes to the lighting thread in a more controlled fashion. However, that might add quite a lot of overhead in order to fix some bug that might not occur that often in practice. The solution developed below should be much more efficient and in fact not very complicated.

      While the first issue is not directly a multithreading issue, it is still relevant for the solution of the general bug.

      Problem 1: Lightmaps get removed too early

      When the last block of a subchunk gets removed, the lighting engine is told to remove the corresponding lightmap (if there are no nearby blocks). Only afterwards it is told to reevaluate the removed block. However, at this point the lightmap was already removed, so the lighting engine simply ignores the light check; even if it wouldn't be ignored, all the relevant information would be erased.
      This can be visualized as follows:

      • Create a new redstone-ready world
      • /setblock 7 82 7 minecraft:sea_lantern
      • Destroy the sea lantern
      • Observe that the block light value is still 15. This is actually not directly the bug we are after. This part of the issue will disappear upon reloading the world or triggering some block light update. We will come back to this later.
      • Reload the world to fix the previous point.
      • Note that the light is still broken in subchunk 0 4 0

      As soon as we break the sea lantern, the lightmap of that subchunk gets removed, so the lighting engine cannot process the block update to reduce the blocklight level. Hence the light stays in subchunk 0 4 0 which doesn't have its lightmap removed because of nearby blocks. The blocklight level for subchunk 0 5 0 is set to 0 by removing the lightmap, as we see after reloading the world.

      The relevant code pieces are

      net.minecraft.world.chunk.WorldChunk.java
      @Nullable
      public BlockState setBlockState(BlockPos pos, BlockState state, boolean bl) {
           ...
           boolean bl2 = chunkSection.isEmpty();
           BlockState blockState = chunkSection.setBlockState(i, j & 15, k, state);
           ...
           boolean bl3 = chunkSection.isEmpty();
           if (bl2 != bl3) {
               this.world.getChunkManager().getLightingProvider().updateSectionStatus(pos, bl3);
           }
           ...
      }
      
      net.minecraft.world.World.java
      public boolean setBlockState(BlockPos pos, BlockState state, int flags) {
          ....
          BlockState blockState = worldChunk.setBlockState(pos, state, (flags & 64) != 0);
          ...
                  this.profiler.push("queueCheckLight");
                  this.getChunkManager().getLightingProvider().checkBlock(pos);
                  this.profiler.pop();
              }
          ...
      

      Chunk.setBlockState(...) calls updateSectionStatus(...) to issue the creation and removal of lightmaps to the lighting engine. World.setBlockState(...) first calls Chunk.setBlockState(...) and only afterwards checkBlock(pos) which processes the block update for the lighting engine.
      To fix this issue, we should first issue the lighting check and only afterwards the removal of the lightmap. By a similar argument, we should first issue the creation of a lightmap and then the lighting check for a newly added block. As we will see later, we must in fact issue the creation of a lightmap before changing the block via chunkSection.setBlockState(...).
      Hence I propose to move the lightmap handling (updateSectionStatus(...)) to World.setBlockState(...) and change the order to

      • Issue lightmap creations if necessary
      • Call Chunk.setBlockState(...)
      • Issue light checks
      • Issue lightmap removals if necessary
        Also, the lighting engine should process the issued operations in this order. Keeping only a single call to updateSectionStatus(...) as in the current code and adding some priority system to achieve the right order might work in the single threaded case, but not in the multithreaded case, as one operation might be processed before the others get issued.

      There is a similar version for skylight included in the attached world, however it requires some more operations as there is some special handling for adding and removing skylight maps.

      For completeness, let's discuss why in the example above the light stayed in subchunk 0 5 0 until reloading the world. The main thread does not read light values directly from the internal lightmaps, but gets its own immutable copies that are updated inside

      net.minecraft.world.chunk.light.LightStorage.java
      protected void notifyChunkProvider() {
          if (!this.field_15802.isEmpty()) {
              M chunkToNibbleArrayMap = this.lightArrays.copy();
              chunkToNibbleArrayMap.disableCache();
              this.uncachedLightArrays = chunkToNibbleArrayMap;
          ...
      }
      

      field_15802 gets populated whenever a light value changes or a new lightmap is added (which can contain precalculated light values and hence cause a change of light values as well), but not when a lightmap is removed. Hence the immutable copy of the lightmaps does not get updated until reloading the world or triggering some blocklight update, as observed above.
      By itself this isn't really a bug. If everything else worked correctly, lightmap removal shouldn't change light values, hence there shouldn't be a reason to create a new copy of the lightmaps, other than freeing some memory.

      Images produced by the attached world:

      Problem 2: Blocks can be seen too early

      Since the lighting engine reads blocks directly from the main thread world, it can see block changes "immediately". In particular, it can see blocks before all relevant lightmaps have been created; even though lightmap creation should be issued before the actual block change is carried out, as noted above, the lighting engine might encounter a new block while processing other light updates. Although the issued lightmap creation will be visible to the lighting thread at that point (by looking at the queued operations), there is no reason why this operation should have been processed. Even if we processed all pending operations just before the block lookup, it could happen that the block gets placed and the lightmap creation gets issued inbetween those two operations.
      This can lead to the problem that the lighting engine can see and take into account newly placed blocks although the relevant adjacent lightmaps, which that block can affect, have not yet been created, so the light updates will get stuck at those neighbor subchunks. Later on, the relevant lightmaps will be added and a light check for the newly added block will be carried out. However, the light is already correct around this block; the lighting errors occur at the boundary to the neighbor subchunk. Hence, the light check at the newly added block will succeed and no further updates will be carried out, keeping the errors at the chunk boundary.
      For skylight this also causes the skylight optimization to be applied wrongly; the optimization only works 16 tiles away from non-air blocks, so is not applicable around those newly placed blocks. But since the lighting engine has not been told yet about those new non-air blocks, it might apply the optimization nevertheless, causing errors. This is what we will demonstrate in the following.
      The setup is as follows:
      We have some large stone platform high up in the sky (at y=255 in order to cause long light propagation times). Subchunk 1 6 1 has a lightmap that is kept alive by nearby blocks, but subchunk 1 5 1 doesn't. Inside a single tick

      • Punch a small hole in the stone platform
      • Trigger the lighting engine to propagate the light downwards from this hole by causing some more light checks
      • Cause a bunch of other lightmap creations and light updates, so the priority system doesn't accidentally move the next step before processing the updates of the previous one. Don't cause too many updates, as otherwise the light propagation of the previous step will be finished before we can do cool stuff.
      • The lighting thread will now propagate the light downwards from the hole and hence not process any operation caused by the next step until done
      • Place a platform of leave blocks in subchunk 1 6 1 below the hole
      • On its way down, the light will now see the leave blocks and correctly incorporate them into the result. As subchunk {{ 1 5 1}} doesn't have a lightmap, the skylight optimization will kick in and simply copy down the light values from subchunk 1 6 1.
        After passing the leave blocks, the light should decay to 0 after 15 blocks. But because of the wrongly applied optimization, it travels a whole 16 blocks (or arbitrarily many) without any decay.
      • Now the lighting engine adds the lightmap for subchunk 1 5 1 which is initialized by copying down the values from subchunk 1 6 1, ie. the values as produced by the skylight optimization. Also, light checks are carried out for the leave blocks. But as the light was already correct around those, nothing changes and the light in subchunk 1 5 1 stays broken.
      • Add some blocks in subchunk 1 5 1 to better visualize the issue (F3 would work as well)

      What we learn from this is that we should make sure that the lighting engine doesn't see any non-air blocks for subchunks that it has been told to be empty, as this can conflict with the skylight optimization.
      The lighting engine already keeps track of which subchunks should be empty according to the main thread and spawns lightmaps accordingly. I propose to employ this information inside the block lookup used by the lighting engine: First test whether the queried subchunk should be empty and return air in that case. Otherwise, proceed as normal. This prevents the described issue that blocks get seen before all relevant lightmaps have been added.

      Unfortunately, things can get a bit more delicate as Problem 3 shows.

      Problem 3: Lightmap removal must be cancellable

      The basic issue encountered here is that lightmap removal should be canceled if a subchunk gets repopulated before the lightmap is actually removed, rather than removing the lightmap and then recreating it, as this would erase information.
      In order to get a better understanding, let's look at the setup in the attached world. The situation is mainly the same as in Problem 2: we have a stone platform high up in the sky and subchunk 1 6 4 has a light map. This time subchunk 1 5 4 has a light map that is kept alive by some nearby block, as well. Similarly as above, inside a single tick

      • Punch a small hole in the stone platform
      • Trigger the lighting engine to propagate the light downwards from this hole by causing some more light checks
      • Cause a bunch of lightmap creations and light updates to prevent priority reordering
      • The lighting thread will now propagate the light downwards from the hole and hence not process any operation caused by the next steps until done
      • As in Problem 2, we will spawn a leave platform while the lighting engine is propagating the light downwards in order to cause some issues. However, we want the lighting engine to remove the lightmap for subchunk 1 5 4 before it gets kept alive by the leave platform, but after propagating the light downwards, in order to produce the issue outlined above. Hence we need some other steps before spawning the leave platform.
        Now, issue removal for the lightmap of subchunk 1 5 4 by deleting the blocks that keep it alive, but in such a way that the lightmap for subchunk 1 6 4 does not get removed.
      • Again cause some more lightmap creations and light updates to prevent priority reordering
      • Now spawn the leave platform in subchunk 1 6 4 below the hole
      • On its way down, the lighting engine will now see and incorporate the leaves blocks, as in Problem 2. This time, however, subchunk 1 5 4 does have a lightmap, so we don't wrongly apply the skylight optimization, but correctly propagate the skylight to subchunk 1 5 4. Subchunk 1 4 4 now correctly stays dark.
      • Now the lighting engine processes the issued removal of the lightmap at 1 5 4. The issued recreation of the lightmap through the leave platform is not yet processed, as we issued a bunch of other lightmap creations, so it is still stuck in the queue.
      • Now we finally process the issued recreation of the lightmap at 1 5 4. However, at this point the lighting information has already been erased and the new lightmap for 1 5 4 will be initialized by copying down the values from 1 6 4, hence the light below the leave platform does not decay, looking as in Problem 2. However, subchunk 1 4 4 will stay dark, in contrast to Problem 2, showing that we indeed observe a different issue here.

      This issue is a mixture of Problem 1 and 2. In some sense, the issue is that the lighting engine sees the leave blocks before it is told that subchunk 1 5 4 is not empty (i.e. before it processes the lightmap creations). However, the solution for Problem 2 of simply treating that block as air does not apply here, because at the time the leave block is seen, that subchunk is not marked empty, but gets only marked empty later on. In some sense this is also related to Problem 1, as lightmaps that are relevant to the leave blocks get removed after the lighting engine sees the leave blocks but before it processes the removal of the leave blocks.
      The general lesson learned here is that the lightmap creation issued by spawning the leave platform should cancel the previous removals of lightmaps.

      The precise order of operations needed is summarized in the following section.

      Proposed Solution

      As noted above, the main thread should carry out operations in the following order:

      • Issue an operation to mark the subchunk as non-empty, as necessary
      • Call Chunk.setBlockState(...)
      • Issue light checks
      • Issue an operation to mark the subchunk as empty, as necessary

      It will become clear in a moment why the subchunk must be marked non-empty not only before the light check, but also before the block gets written to the chunk.

      For block lookups, the lighting engine should first test whether the queried subchunk should be empty, according to its current knowledge, and in this case return air.

      The lighting engine should carry out operations in the following order:

      • When starting an update cycle, take a snapshot of the operation queue and don't process any later updates. This makes sure that we don't process light checks without processing the corresponding lightmap creation or lightmap removals without the corresponding light checks.
      • Process all operations that mark subchunks as non-empty
      • Process all light checks and propagate all light updates
      • In case that only a limited number of light updates is carried out at a time, finish the current set of updates before processing any new operation
      • After all pending updates have been processed, process the operations that mark subchunks as empty. In this step, we need to take another snapshot of the operation queue and search it for operations that mark subchunks as non-empty, and cancel operations accordingly. This is needed as shown by Problem 3.
        The correctness of this last step requires some small argument, as follows. Suppose some subchunk has been marked as empty. We consider two cases:
        • The lighting engine has not seen any block that was added to the subchunk after the main thread issued the operation to mark it as empty. In this case, we can safely mark this subchunk as empty, since we processed all pending light updates that were caused by the removal of blocks and no new blocks have been observed.
        • The lighting engine does see some block that was added to the subchunk after the main thread issued the operation to mark it as empty. Now we are in the situation of Problem 3. The important point is now that we can use acquire-release semantics on reading this newly added block and the corresponding write on the main thread. Because the operation to mark subchunks as non-empty is issued before writing the block , acquire-release semantics make sure that we will see this operation in the operation queue after observing the block. Hence we will find the operation in the second snapshot of the operation queue and correctly cancel the operation.
          In fact, acquire-release semantics are not precisely guaranteed as reading blocks isn't even threadsafe (MC-162080), however in practice they will hold on x86 architectures as long as MC-162080 doesn't occur.
      • Leave all remaining operations to mark subchunks as non-empty for the next cycle or process them right away.

      Unfortunately, as far as I can tell, this is not so easily achievable with the current design of the multithreading code. That is because all operations are wrapped inside some generic runnable tasks and put into a single task queue, so it is not straightforward to scan this task queue without actually processing its elements.
      I propose to introduce more specialized "task queues" for different types of tasks which then can be scanned more easily. More concretely, I propose to add 3 sets (and possibly more for other stuff not covered here) lightChecks, markEmpty, markNonEmpty. When the lighting engine starts its update cycles, it takes ownership over all 3 sets and gives the main thread new ones to populate. Now we can easily process markNonEmpty and lightChecks. When processing markEmpty, we again take ownership over the new markNonEmpty and cancel any subchunks occurring in both. Now we can either merge the remaining operations of markNonEmpty back into the current set for the main thread to populate or simply keep it and merge it with the next update cycle.
      The main thread then populates the sets as follows:

      • Use some lock, so that we don't write to the sets after the lighting engine has taken ownership
      • Add light checks to lightChecks (obviously )
      • When issuing a new empty subchunk, add it to markEmpty
      • When issuing a new non-empty subchunk, first try to cancel a corresponding entry in markEmpty and otherwise add it to markNonEmpty.
      • Issuing an empty subchunk must not cancel any operation in markNonEmpty, as this operation might be needed to cancel a removal in the currently ongoing cycle. Canceling entries of markEmpty by non-empty subchunk, on the other hand, is unproblematic.

      I hope this report was more or less understandable. If anything is unclear, please don't hesitate to ask
      Best,
      PhiPro

        Attachments

        1. Lighting Bug.zip
          1001 kB
        2. Problem1_1.png
          Problem1_1.png
          303 kB
        3. Problem1_2.png
          Problem1_2.png
          107 kB
        4. Problem1_block.png
          Problem1_block.png
          414 kB
        5. Problem1_sky.png
          Problem1_sky.png
          523 kB
        6. Problem2.png
          Problem2.png
          457 kB
        7. Problem3.png
          Problem3.png
          464 kB

          Issue Links

            Activity

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              PhiPro Philipp Provenzano
              Votes:
              29 Vote for this issue
              Watchers:
              15 Start watching this issue

                Dates

                Created:
                Updated:
                CHK: