Android crash + issue, resolved, but investigation of timer behavior could be important

Here is the commit which fixed the crash (the view file, other fix to CastleDownload issue is not a crash and quite specific to my case, but also a small point). A long debug and tests brought some clearance (bc at first sound lib on android was to blame) and a topic about timers Need Help Understanding How to Reset a Timer gave an idea to try approach with “trigger” instead of timer.

SO, the issue giving a crash revolves around usage of timer and switching view within its tick handler , which lead to a crash without any sort of stacktrace or logs (which I was tracking with adb). What’s funny - disabling switch of music theme (in a target view) was fixing the crash, but obviously I needed switch of music theme. That’s why at first I thought it is a sound lib issue.

Actually , in my case there is not a must to have a timer to switch the view, so I reworked it with trigger approach and it worked. However, I think this is a nasty issue to be addressed at some point, bc it crashes without any evident info (for me it was just starting playing the other theme and just closed app with android app crash report) and definitely not something that is prohibited or bad practice, anyone could just face it at some time. Unfortunately , idk the exact reason and didn’t yet investigate the source code, so can’t provide more clues for now.

The other minor issue (which is also fixed by this commit) is probably a tiny conditional block in android implementation of TCaslteDownload which led to http request status code 400+ be treated as success, and I had to make explicit check for it in app code. This issue might be even introduced by my own code changes, which I did a year or so ago. I will take a look a bit later and debug this, and probably post a PR to it.

The issue here, from what I see, is:

  • The timer components (TimerBlood, TimerGame, TimerHint: TCastleTimer) are loaded from the design. They are destroyed when the view stops.

  • This means that switching views, from the timer event → destroys the instance of the object calling this event.This can cause a lot of problems, in general.

    procedure TCastleTimer.DoTimer;
    begin
      if Assigned(OnTimer) then
        OnTimer(Self);
      SomeField := 123; // <- this will crash if OnTimer freed TCastleTimer instance
    end;
    

    then we will have a crash.

  • We have a loop iterating over UI components and calling their Update, which in case of timer will cause the DoTimer. But if some objects on this list suddenly disappear (or even all of them, if view disappeared) → we will get crashes.

“Crash” here means we access memory we should not access. So it may result in Access Violation (Unix SEGFAULT) exception, but it may also have a ~random effect, as it access memory where something unexpected lies, and this memory could be read/written.

The problem is more general than “switching views from TCastleTimer.OnTimer” to be clear. Switching views e.g. from button’s OnClick event has the same dangers.

For now, what I can offer is a rule: “Do not free the instance of the object from within that object’s event. And remember that this includes freeing the object indirectly, because the view stops → so do not stop view from within a component that is part of this view.”

I realize this sucks. It’s an easy mistake to make. Our API doesn’t prevent you from this in any way.

Ideas:

  1. First of all, we could detect (and then 1. very clearly warn about it, 2. after some time: even ignore?) at runtime this case. Trying to stop a view when we iterate over components of this view is an error.

  2. Another solution would be introducing API to schedule changing views – not doing it instantly. Here it’s a bit uneasy how to define it nicely.
    2.1. A new API, like ScheduleSetView, SchedulePushView etc. would be least invasive, but also don’t look nice.
    2.2. OTOH, changing the behavior of existing things, while it would look nice, could be confusing. E.g. after Container.View:=NewView, what should the immediate value of Container.View be? I guess NewView, but then remember NewView may not be actually yet part of Window.Controls, and effectively we will need to maintain a “future state of views” in addition to “current state of views”.

Hm, I guess 2 things should be done, as non-invasive, ASAP:

  • Make a warning if you try to stop a view from within component’s event that belongs to the view. This would at least make your error → obvious, I mean you would see in the log sthg like Warning: While processing Update of TViewGame children, that view is stopped. This likely means you stop the view from within an event handler of a component that is part of this view.
    • OK, I can likely make this warning better :slight_smile: But you get the idea.
  • Introduce ScheduleXxx methods, which would give you an immediate solution: you could just change your view operations to use ScheduleViewSet / ScheduleViewPop and be be OK.

And in the long-run, we can experiment should we change the current methods to immediately change views → into “schedule” versions. We could introduce a flag like TCastleContainer.ScheduleViewChanges:Boolean, if true → then all view change operations actually call their ScheduleXxx counterparts.

Added to TODO. My “productive April” has started, so I hope you will not have to wait for this long :slight_smile:

1 Like

it is pretty interesting, bc you say the things that are quite ambiguous to me, I mean you have the full authority and trust in my eyes, however, what is more natural than to switch a view by a button click handler ?? it is like everywhere. I agree that Stop can make some operations with view’s members be risky, but it seems that this is not the case, as seemingly no operations were done in handler after the switch of view inside, except for one boolean flag, and flag is a field in view instance, so should exist until destroy, also the switch to that view from the button worked well (with music switch in start and resume methods of target view, I tested both), and also switch to view worked from the timer handler, but with music change commented. So to me it seemed not like usual AV, but some race condition or really some memory corruption from specifics of thread processing on android (bc windows didn’t show such behavior ever).

I’ve read your thoughts and for sure you might see more behind it, compared to me, and I’m ok with whichever proposal from your side (and hooray, productive april ! :sweat_smile: ). I think still that destroyed timer on stop should not technically influence a workflow of a view method, as there is seemingly no dependency on the timer , and the DoTimer method should safely end even if timer itself is freed, as you pointing on source code, where nothing after handler processing is done. I can try to do it with the timer explicitly created (as a member of view class, and not from design) and see if it breaks, bc it is hard to believe (as I tested it in many ways, covered all the places with the logs and so on, commenting on and off blocks of code etc)

I share your surprise. It should not be dangerous to switch views from the button’s OnClick handler. Alas, you found a serious bug in CGE in this thread, that somehow was unnoticed for a long time → I confirmed now it’s a problem. I will fix it ASAP – I have some new ideas how to do this.

The TCastleView.InternalStop, which is called when view stops,

  • calls virtual Stop
  • and then does UnLoadDesign which, by doing FreeAndNil(FDesignLoadedOwner), frees all the design loaded components.

See castle-engine/src/ui/castleuicontrols_view.inc at bbbfc016ed179b2009cf8194dd692b81adf1f6e3 · castle-engine/castle-engine · GitHub .

I just confirmed it is a potential problem even in our platformer example code, that has OnClick handler doing trivial

procedure TViewMenu.ClickPlay(Sender: TObject);
begin
  Container.View := ViewPlay;
end;

Adding 3 Writelns to the TCastleButton code, I can unfortunately confirm that when DoClick finishes, this button was destroyed.

→ results in log like this:

DoClick begin: 134962968364080
Destroying TCastleButton: 134962968367248
Destroying TCastleButton: 134962968366192
Destroying TCastleButton: 134962968365136
Destroying TCastleButton: 134962968364080 // our button is destroyed here
... (various game loading logs)
DoClick end: 134962968364080 // but here we are still in this button's method code

So the button 134962968364080 is really destroyed in the middle of it’s OnClick handler. If the DoClick would do anything more that is accessing the fields → then it’s an undefined behavior. E.g. this is undefined behavior (right now, I mean before the fix; we will fix it to make it OK):

procedure TCastleButton.DoClick;
begin
  Writeln('DoClick begin: ', PtrUInt(Self));
  Writeln('  DoClick begin says: Name is ', Name);
  if Assigned(OnClick) then
    OnClick(Self);
  Writeln('DoClick end: ', PtrUInt(Self));
  Writeln('  DoClick end says: Name is ', Name); // <-- this accesses Self which is now a dangling pointer
end;

Log for above shows, for my system and compiler now (FPC 3.2.2, Linux/x86_64):

DoClick begin: 139979378930736
  DoClick begin says: Name is ButtonPlay
Destroying TCastleButton: 139979378933904
Destroying TCastleButton: 139979378932848
Destroying TCastleButton: 139979378931792
Destroying TCastleButton: 139979378930736
...
DoClick end: 139979378930736
  DoClick end says: Name is 

So it seems like the Name changed to '' . Why? By accident, this memory piece is nil now. But this is an accident, it could as well be a crash, and some memory garbage. Different OS, different program workflow, different compiler / optimization level → could cause a crash.

And even if we would not even do anything from DoClick after calling OnClick, it’s still dangerous because DoClick is done from Release which is done by TCastleContainer.EventRelease ( castle-engine/src/ui/castleuicontrols_container.inc at bbbfc016ed179b2009cf8194dd692b81adf1f6e3 · castle-engine/castle-engine · GitHub ) doing loop for I := C.ControlsCount - 1 downto 0 do if RecursiveRelease(C.Controls[I]) then . So it may work by accident (it did work by accident in all tests until this thread! :slight_smile: ), esp. if the list count drops only by 1, but when C has been freed → we are undefined, and we can get crashes.

If you create / destroy the timer in view’s Start / Stop methods, and OnTimer will change views (stopping the view owning it) → then you should be able to confirm my grim conclusions. I mean, things inside view’s Stop will effectively happen in the middle of OnTimer.

Whether the timer is part of the design file (.castle-user-interface) or it’s created explicitly → will not matter.

The only way you could avoid this problem is by destroying timer later, e.g. in view’s Destroy. (So one idea for the fix is to move destroying loaded components later; alas, since you may also create/destroy stuff explicitly in Start/Stop, we should also introduce the “schedule view change” mentioned in earlier post; I’ll experiment today what is best to do).

Do be clear, right now

  • TCastleTimer.DoTimer does nothing after calling OnTimer.
  • And the caller of DoTimer, which is TCastleTimer.Update, also does nothing after calling DoTimer.
  • By accident, other OnXxx calls also don’t do anything more – I checked TCastleButton.OnClick, TCastleCheckbox.OnChange, TCastleEdit.OnChange.
    So at least we don’t get crash in timer… but
  1. We want to have, in the future, ability to place some Pascal code to be done after OnXxx on that instance is called.
  2. Even if not, they are inside a loop inside a container. That they don’t crash so far → is a lucky accident :slight_smile:
1 Like

I made a testcase showing various ways how it can go wrong, and confirmed in practice the grim prediction above. Indeed the loop itself may enter into a crash if we switch view during OnTimer – this is on Linux/x86_64:

TestStopViewFromEvent with activator MyTimer
TTestCastleUIControls.TestStopViewFromEvent: Failed: Access violation
An unhandled exception occurred at $000000000073F02D:
EAccessViolation: Access violation
  $000000000073F02D  COUNT,  line 169 of ../src/ui/castleuicontrols_internalchildrencontrols.inc
  $0000000000736B39  CONTROLSCOUNT,  line 1864 of ../src/ui/castleuicontrols_userinterface.inc
  $00000000007300D1  RECURSIVEUPDATE,  line 1578 of ../src/ui/castleuicontrols_container.inc
  $00000000007300F4  RECURSIVEUPDATE,  line 1579 of ../src/ui/castleuicontrols_container.inc
  $000000000072FB51  EVENTUPDATE,  line 1739 of ../src/ui/castleuicontrols_container.inc
  $00000000014DA946  FAKETIMER,  line 774 of code/testcases/testcastleuicontrols.pas
  $00000000014DA611  CORETESTSTOPVIEWFROMEVENT,  line 837 of code/testcases/testcastleuicontrols.pas
  $00000000014DAD9A  TESTSTOPVIEWFROMEVENT,  line 866 of code/testcases/testcastleuicontrols.pas
  $0000000001447058  RUN,  line 1652 of code/tester-castle/castletester.pas
  $000000000144043C  RUNTEST,  line 645 of code/tester-castle/castletester.pas
  $000000000144028E  RUN,  line 622 of code/tester-castle/castletester.pas
  $00000000005309A3  RUN,  line 74 of code/tester-castle/castleconsoletester.pas
  $00000000004AB471  main,  line 44 of castle_tester_standalone.dpr

Working on a fix :slight_smile:

1 Like

Almost done – a have a version of the solution on a branch views-scheduled-change . You’re welcome to test it immediately. Or wait for merge to master, which I plan tomorrow (Monday).

Overview:

View changes (setting Container.View:=.., calling Container.PushView, Container.PopView) are no longer performed immediately, they are delayed.

This fixes the crashes / warnings in my tests, and should also fix your issue with timer usage. Please test your application (letting timer code just use OnTimer called automatically) and let me know.

The change is now in a branch views-scheduled-change.

I plan to merge it Monday. I would have merged it today, it passes new automatic tests, it works in CGE examples… but there are regressions in 2 weird placed when I tested it with “Unholy Society” and “Escape from the Universe” code. I’m investigating the cause, and whether this is a bug in CGE / game code / whether this can be really blamed on the change or maybe we had relied on some accidental behavior previously. Of course, ideally 100% things should work exactly the same after the change, but that isn’t really possible… Anyhow, I’ll debug it and let here know what are conclusions.

Details how it works / what changed:

  • We maintain the “scheduled view operations” list and “scheduled view stack” state. Setting Container.View:=.., calling Container.PushView, Container.PopView → adds operation to the list and changes the “scheduled view stack” state immediately, but does not change the current view stack.

    • In particular, this also means that Container.View is X immediately after Container.View:=X, which means this behaves intuitively, even though actual change may be delayed. So code like Container.View := X; if Container.View = X then ... will behave as expected.
  • The functions/properties that still expose the view stack have now 2 versions:

    • CurrentFrontView, CurrentViewStackCount, CurrentViewStack[...]. Correspond to really started / stopped views, so e.g. you know that CurrentFrontView has done it’s Start + Resume. They “lag behind” the calls to View:=..., PushView, PopView.

    • PendingFrontView, PendingViewStackCount, PendingViewStack[...]. Changed immediately by View:=..., PushView, PopView. In contrast to above, you cannot be certain that Start + Resume has been done on PendingFrontView, it may still be in waiting queue.

    • The unprefixed versions, that existed in the past (FrontView, ViewStackCount, ViewStack[...]) refer to the “current” variant (this matches the actual usage in all my examples, including “Unholy Society” and “Escape from the Universe” games). They are also deprecated – better user new versions with longer but more obvious now names.

  • You can still force making immediate changes by:

    • when 1. InternalViewChangeImmediate is true (only temporary hack, for internal usage/testing)
    • or 2. when doing Container.View:=X when view stack is empty, then we do the change automatically immediately. This is useful to have code like Window.Container.View := X; C := X.DesignedComponent(...) ...; keep working as before. And it is safe, since there’s no view stop done in this case, so we can as well do it immediately.

Even more notes, read only if you’re really curious:

  • When working on this, I uncovered why this problem remained hidden for so long, not causing actual crashes: the most common case, changing view from button’s OnClick, didn’t cause crashes by accident of our “capture click” logic, that was doing:

      if (Capture <> nil) and (Capture <> ForceCaptureInput) then
      begin
        Result := Capture.PreviewRelease(Event);
        Result := Capture.Release(Event);
        Exit; // if something is capturing the input, prevent other controls from getting the events
      end;
    

    So when clicking on a button… we quickly Exit, after calling Release (which calls DoClick which calls OnClick which may switch views, freeing the button). And this doesn’t iterate over controls (the Exit skips the recursive iteration) so we avoid any “freeing control while looping over it” problems. So by accident this most-typical-event-usage never caused crashes. I still consider this a “bug waiting to happen”, I never designed that mechanism to allow freeing the Capture instance within Capture.Release call – an unrelated modification of logic there could trigger the bug in all CGE games.

  • If you still manage to make the mistake of freeing the component from it’s even handler, we will now make clear warning:

    Warning: Component "ButtonPlay" (class "TCastleButton") is being destroyed while processing its own event "Release(for Capture)". This can lead to undefined behavior and crashes. Defer destroying it to later, e.g. with ApplicationProperties.FreeDelayed.
    

    Of course, our changes to “schedule view changes” mean that you should not actually see this message anymore in normal usage. But you can still achieve it e.g. by explicitly calling sthg crazy like Sender.Free from button’s OnClick handler :slight_smile:

  • Note that the new warnings themselves may uncover some crashes… because the check itself, doing Dec(C.ProcessingEvents), will now access the memory of the dangling pointer if things went wrong. Avoiding it didn’t seem possible (a dictionary like component->ProcessingEvents would be a performance hit).

    • At least you will get a clear warning about it in the log, now, before crash.
    • And it points to a code that was always invalid, strictly speaking → maybe just worked by accident before the change.
    • And of course the crashes around view switching should never happen thanks to new scheduling mechanism.
    • Still, if you do something crazy explicitly (like Sender.Free from OnClick handler) then now we have, after a clear warning, more chance to crash than to accidentally continue working.
  • Also some old deprecated class properties/methods on TCastleView, like TCastleView.StateStackCount, have been removed. Their counterparts on container are now more complicated, so it would be confusing to have 2 layers of deprecated API.

1 Like

A really good one !

I will test it on windows and android, and post result, pretty sure it would be flawless :+1:

Thanks for taking care about this, despite namely for me I had a workaround.

Ofc I understand that such sleeping bomb is anyway needed to be addressed, but not sure if I’m proud that I found it and somewhat inspected, or ashamed I made you spend time on it :sweat_smile:

1 Like

It was indeed somewhat a “sleeping bomb”. Serious problem in a view API, where using it in a completely natural way could lead to crashes/undefined behavior. I’m very happy you reported it and that it’s fixed now. The sooner → the better :slight_smile:

It is all merged to master now!

Some details what I’ve done, in addition to previous comments:

  • The “delay” mechanism is now done only when we’re processing events. Calling from outside (e.g. setting Container.View:=InitialView; from Application.OnInitialize or doing Container.View:=nil from some unit’s finalization) happens immediately.

    • ( This removes regression in “Escape from the Universe” code that I mentioned in previous post. )
  • For testing, InternalViewChangeImmediate is no longer available (unnecessary and was complicated to support), OTOH we added InternalForceViewChangeDelayed.

  • There are better API docs of this at TCastleContainer.View property, various related methods / properties link to it.

  • I also solved 2 regressions in “Unholy Society”. These deserved 2 fixes in the game – this is unfortunately the kind of regressions that may happen because of this change, and it’s unavoidable.

    1. The code of game was having assumption that PushView is synchronous, having code like
    if ItemToDisplay <> nil then
    begin
      Container.PushView(ItemPopup);
      ItemPopup.LabelItemName.Caption := ItemToDisplay.Name;
    end;
    

    And the above code executed in response to key press or touch (that trigger picking up item). So it was wrong now – Container.PushView is deferred, and ItemPopup.LabelItemName.Caption is not actually initialized now.

    ( Note that this example is a simplified version of what happens, and one can argue we could add PushView to exceptions to do it without delay, though it means we rely that previous view’s Pause doesn’t free any component. But the actual code is also doing PopView before in some situations, which really has to be delayed, following our findings (as it may free components). )

    1. And another case that required adjusting:
    procedure LocationChange;
    begin
      Container.View := nil; // we relied this makes Stop, which makes SaveGame with old Location
      GlobalLocation := NewLocation;
    end;
    

    This was again wrong after engine change, and there’s no magic solution that avoids changing game code. Container.View := nil is now delayed when done from event (and it’s a good security, it frees a button that possibly initiated this action!). And it means that Stop of the view calls SaveGame when GlobalLocation already changed to a new value.

    ( The fix I used here was to make the call to do AutoSave explicit, not from view’s Stop method. )

I will soon make a news post describing this change, reusing some bits of information from this thread, so that developers know about this change and what they (may) need to do to adjust.

1 Like

Get back to the machine where I have setup for Android compilation from sources (as usual flow I rely on github cicd :smiling_face_with_sunglasses: ) and played a little with the latest CGE changes, I reverted the mentioned fix and enabled back the flow where game event (hero death) enables timer, timer fires and in its handler I switch (replace) the view to the lose-view. It went well on Android (where it was failing previously) :ok_hand:

actually the timer approach is a little bit more comfortable, as it allows to setup time for scene switch without recompilation (ofc there are other ways, like configs, but timer in view.c-u-i file is a native instrument for it)

Also noticed small issue with my Java code (it is giving compilation error unfortunately) from last PR and submitted a fix:

1 Like

I extended our manual to document this: section “Beware: View changes are executed with a delay” in the Views chapter. Soon to be announced in news.

1 Like