What is the overhead in the engine of dynamically adding and removing TCastleScenes?

DSK · November 13, 2022, 3:59pm

In a hypothetical system which implements a large world by logically dividing it into cells - what approach to dynamically adding and removing TCastleScene’s would minimise variation in the frame rate? (I do appreciate this will probably require answering by experimentation - but someone may already have experience, or know a best approach

If such a system were to use a combination of occlusion query, distance from viewer etc, to decide which parts of the terrain to dispose of and which parts to load, then would it be better from the engine perspective if Scenes were added in batches - or restricting to just one per frame? I.e. Does each add of a scene incur overhead which might be more efficient if they are done in batches? Or would it be smoother to slowly add them - possibly obscured in fog in the distance?

The data could come from different sources: procedurally generated locally, loaded from local files, or loaded from internet server. I’m assuming there are no problems with getting the data by either of these methods in other threads - but that the actual creation of the TCastleScenes and inclusion into the scene must be done in the main thread? Or, is it possible to construct each TCastleScene in the other threads - but then only insert them in the scene in the main thread?

I’m imagining that a thread safe queue would be required so that “producers” of the data can push into the queue, and then the main thread can pop from the queue and insert the TCastleScene’s…

But at what point can the TCastleScene’s actually be constructed? Can they be constructed in other threads and then inserted into the queue so that the main thread only has to retrieve them?

Or (as I suspect), must another representation of the data be inserted in the queue - and then the main thread must retrieve this and construct each TCastleScene before inserting in the view?

michalis · November 13, 2022, 9:04pm

All the CGE operations must be done from the same thread, see Threads usage | Manual | Castle Game Engine . This is similar to e.g. Unity or Lazarus LCL.

So if you want to utilize threads, you have to operate only on your own structures in threads, and, as you describe, the main thread has to “pop” them from some queue.

I want to add asynchronous loading capability to CGE, see my plans about demo of “huge city” on Slides, movies and thoughts from my GIC 2022 presentation – Castle Game Engine . But this is not done yet. Also, the asynchronous API will still mean you always call CGE from a single thread – it will just be able to load and initialize as much as possible resources in a thread, but that will be hidden from you. (The reason is that coding using threads is notoriously hard for users, and allowing parts of API to be used from multiple threads is also hard for engine development – so we prefer a more limited approach.)

As for the overhead of adding scenes: it very depends on what the scene contains, so indeed some experimentation will be necessary. In general, the scene can reuse e.g. shaders from other scenes, this happens automatically under the hood. Since adding the scenes is synchronous for now, I would indeed recommend to add them “slowly”, not a big number of scenes at once.

DSK · November 14, 2022, 9:14am

Thank you, Michalis. I think I’m on the right track then

DSK · December 31, 2023, 12:30pm

After a break from my project, I now have a lot more available time to get back to it on a permanent basis

Continuing the above topic - with the general question of dynamically adding and removing “cells” from a large environment, I have a few observations that make me wonder if scenes are actually the most optimal level to perform this.

The cells in such an environment would never have transformations done on them (i.e. rotations, translations). The cells themselves would always be composed of indexed face sets. Occlusion is performed in the engine at the level of individual shapes inside each scene, and each shape has a WasVisible flag to see if it was visible in the last frame. Individual shapes can be added or removed from inside a scene - limiting this to adding / removing a small number of shapes per frame.

Would the correct granularity then be to actually store the whole terrain in a single scene object - and modify it by adding / removing individual shapes containing TIndexedFaceSetNode’s based on their visibility?

Or, would scenes still be the recommended granularity - given that each one would only store a single face set?

I’m wondering about this from the perspective of minimising internal overhead in the engine.

eugeneloza · December 31, 2023, 1:59pm

Unfortunately I can’t answer this question too precisely, I hope @michalis can provide you with more insight. But overall this is a question open to experimentation, so I can share the results that worked for me in Mazer game (https://www.youtube.com/watch?v=HW0GoyMhWSI - dungeons generated could be absolutely huge with practically no limit on how large they can be and still perform well - mostly limited by ability to show huge mini-map on the screen :)), the source code for the project is here EugeneLoza / Mazer · GitLab however it’s catastrophically obsolete by this time (6-7 years ago is a … big time :)). I’m planning to make a new version of the game (including turning it into a more fully-featured game) using more modern approach, however, it’s not a high-priority task for now.

So, how was the game optimized to support huge dungeons:

I didn’t use TIndexedFaceSetNode but instead was loading a TX3DRootNode for every dungeon room designed in Blender.
The models were sorted into two ways : into TSwitchNode and TCastleScenes.
2.1. Nearby tiles (the ones found in a single “chunk” by performing a raycast) were grouped into a TSwitchNode switching which can turn them on and off extremely performantly by using TSwitchNode.WhichChoice - however this solution “doesn’t scale” too well and starts lagging when the amount of switch nodes in a scene increases.
2.2. So, on a “larger scale” the game was setting TCastleScene.Exists to turn on and off the whole chunks together in one call. While each such call is less performant than TSwitchNode.WhichChoice but when dealing with hundreds of TSwitchNode a single call to TCastleScene.Exists outperforms them making able to “scale of the dungeon” almost infinitely (of course there are limitations).
Note that this did not include loading/unloading chunks/models realtime. The dungeon was constructed at the generation phase and only some elements of it were turned on and off as kind of a smart “occlusion system”.

I have big plans to make a good “world streaming” (smart name for load/unload) system, but again that’s not a high priority task and rework of Mazer above most likely will take the precedence (hence most likely this system will not be prototyped in the 2024).

Apart from that I can share some basic ideas on how to:

The “world state” (which chunks should be loaded and which can/should be unloaded) is monitored by a special management system.
If the management system detects chunks that need to be loaded in the nearest future it will look for potential next frame that wasn’t too heavy on calculations and will try to fragment the loading into as small as possible chunks trying to still keep FPS stable.
It will also prioritize tasks on “how important they are”. E.g. if the Player can potentially see some chunk if keeps going for the next 1 minute in that direction - it’s a low priority task to load this chunk. But if the Player is literally looking at a blank space right now - it’s a critical task, and if we go far under target FPS, we’ll just need to show “Loading…” screen stop the game for a second and load everything we didn’t manage to before.

If you’re ok to show “Loading…” screen not as recovering from a critical situation, but like Morrowind did - on a regular basis, then you can simplify this management system, and “when you need to show the chunk that wasn’t loaded yet” just pause the game, load the chunk and go on. In most situations it shouldn’t take more than a second.

DSK · December 31, 2023, 3:55pm

Thank you for such a detailed answer - much appreciated

What I have in mind is an open world (like a lot of people do…). For me, this is a revival of a project I worked on back in 2010 which was an “infinite” heightmap terrain based on recursive subdivision (mountains, valleys, lakes, oceans), without even using an engine at the time - just directly in OpenGL. It had just about enough performance to allow moving unlimited around the terrain with basic physics - with no 3D models, and was impractical to become a game with further overheads.

It sounds like we have quite similar ideas - and I like your Mazer game!

With what I imagine now - the only time there would be a delay to loading would be at the start when there is nothing loaded. The management system would first of all register cells that may potentially become visible (up to a maximum number of maintainable cells determined dynamically by framerate), and these would be instantiated in the Castle engine containing only a simple axis aligned cube, sorted by 3D distance from the camera. When these are rendered, then an occlusion query will tell when the cube is actually visible. If it only contains the dummy cube, then the cell will need to be populated with the terrain graphics (or other graphics such as buildings or caves etc). Cells will be populated with a level of detail appropriate for their distance from the camera (minimising seam artifacts between different detail levels)… As the camera moves, then further cells in the distance will need to be added to the potential set, and the system will make space for these by deallocating the furthest away cells currently registered. And cells that are generated at the wrong detail level may need to be regenerated according to distance from the camera - as a gradual process operating over many frames. The terrain is procedurally generated, but this may be overridden per cell to allow modifiable terrain. In this case the entire cell is instead loaded from some other source either locally stored or loaded from a server. There will be a delay to the loading / generation process, during which time the dummy cube will be displayed - possibly for several frames. But most of these should occur in the distance and be obscured by e.g. fog. In cases where the cells are hidden behind say a mountain, but otherwise are close to the camera, then these should be generated gradually and only made renderable as the framerate allows - otherwise these should remain as dummy cubes. The aim is to minimise the number of dummy cubes which get rendered to screen… Overall, the number of cells to render and maintain should be dynamically managed to maintain a minimum framerate - accepting minor artifacts created by delays in loading, but attempting to hide these in the distance etc.

The terrain would be logically divided into same sized chunks. But objects such as trees, buildings etc need not be constrained to this - allowing the graphics for say a building to be designed in Blender without regard to a cellular structure… But otherwise, they would be equivalent to the terrain in terms of occlusion and being loaded / unloaded etc. Perhaps these would exist in the engine as separate scenes, able to be repositioned etc. But I wonder if the terrain should be treated as a single scene, and the cellular structure handled at a lower level or not…

DSK · December 31, 2023, 5:15pm

Looking at this post

https://forum.castle-engine.io/t/clear-and-reuse-a-scene/881

may already answer what I was essentially thinking about. I think this is going to come down to testing both approaches. Perhaps making it parameterised - balancing how much is stored per cell against size of cells, against occlusion benefit - according to performance feedback…

It seems that toggling the “existence” of content - without deallocating is a fast operation. But unfortunately a large environment is going to require allocation and deallocation. It will have to be a matter of testing to find at what granularity this will be most efficient, and ways of spreading out the delay to minimise variation in the framerate.

Changing part of the X3D tree means rebuilding the tree.

I’m now wondering about a scheme of keeping everything inside one pre-allocated indexed face set for the terrain. Then reallocating cells and regenerating at different resolutions is a matter of changing vertices and indices in the one static buffer. This would essentially amount to creating a memory allocator to operate in the memory space of an indexed buffer…

But that won’t work with the occlusion culling…

So having a pool of pre-allocated vertex buffers and reassigning them to different cells by changing their vertices and indices inside the buffers? The buffers remain allocated - but they assume the roles of different cells as required by changing their contents - without changing their size?

I need to convert the slow operations of de-allocation and re-allocation of the scene graph to one of repurposing a non-changing scene graph by only using the fast operations of moving the positions of vertices and possibly indices.