Skip to content

task: tighten AUTO read-ahead wait-clear to match MDI (#3650)#3971

Draft
grandixximo wants to merge 1 commit intoLinuxCNC:masterfrom
grandixximo:fix/3650-queue-not-empty-probe
Draft

task: tighten AUTO read-ahead wait-clear to match MDI (#3650)#3971
grandixximo wants to merge 1 commit intoLinuxCNC:masterfrom
grandixximo:fix/3650-queue-not-empty-probe

Conversation

@grandixximo
Copy link
Copy Markdown

@grandixximo grandixximo commented Apr 26, 2026

Fixes the long-standing "Queue is not empty after probing" error.

mdi_execute_hook has, since 2012, gated wait-clear on execState==DONE && motion.traj.queue==0 && io.status==DONE. readahead_reading (AUTO mode) checks only execState==DONE, on the assumption that the prior EMC_TASK_PLAN_SYNCH precondition (WAITING_FOR_MOTION_AND_IO) had already drained motion.

That assumption breaks across cycles: emcTaskIssueCommand writes EMC_TRAJ_PROBE to motion via shared memory, then the same task cycle's emcMotionUpdate may snapshot emcmotStatus before the servo loop has added the new segment. motion.status reads DONE (stale), the SYNCH precondition transitions immediately, and the next Interp::_read runs read_inputs against a queue that is now non-empty. CHKS fires.

Patch: add the same two conjuncts to the AUTO check.

Closes #3650, #662, #263.

Complementary to #3537 (already merged): that fix handled probe re-trip during deceleration. This one handles the inter-command snapshot-lag race that lets the same family of bugs leak through.

cc @andypugh, @DauntlessAq, @rmu75, @BsAtHome: would value review on the wait-clear semantics.
cc @tiket18, @Cromaglious: could you test against your reproducing workload? Can use built artifacts for testing

Test plan

  • Cyclic G38.3/G38.5 sim loop (>20k iterations), error gone
  • Single-probe + tool-setter still latch `#5061..#5069` correctly
  • Reporter validation against XXYYZ scan + smartprobe workloads
  • 2.9 backport candidate after master soak

Auto mode cleared the read-ahead wait on execState==DONE alone, so a
G38 issued just before an emcMotionUpdate snapshot could let the next
read fire while motion.traj.queue was still non-zero, tripping
NCE_QUEUE_IS_NOT_EMPTY_AFTER_PROBING. Add the queue==0 + io.status
==DONE conjuncts already used by mdi_execute_hook.

Closes LinuxCNC#3650, LinuxCNC#662, LinuxCNC#263.
@rmu75 rmu75 added bug v2.9 candidate Things that would be nice to have for 2.9 2.10-candidate would be nice to have fixed in 2.10 labels Apr 26, 2026
@rmu75
Copy link
Copy Markdown
Collaborator

rmu75 commented Apr 26, 2026

Not sure I really qualify to comment, but explanation makes sense.

@grandixximo
Copy link
Copy Markdown
Author

Need testers, I will try to post in the forum as well

@grandixximo
Copy link
Copy Markdown
Author

I'll send a 2.9 PR just so I can have debs built, users on the forum want 2.9 to test the fix

@DauntlessAq
Copy link
Copy Markdown
Contributor

I'm slightly confused, does this issue occur when the probe move starts, or when it ends?

Because the error message implies it is at the end, that is to say that the probe move (EMC_TRAJ_PROBE) is still in the queue when it should have been removed?

Aka, currently it is possible that a state sync check at the end of the probe move passed when it should not do so. This would be because execState==DONE was set before the move was REMOVED from the queue, due to a race condition.
So just checking execState==DONE isn't completely reliable, hence your more stringent checks being a fix.

This makes sense to me as an explanation that also matches the error message (but of course it might not be correct)

 

But when you say that:

emcTaskIssueCommand writes EMC_TRAJ_PROBE to motion via shared memory, then the same task cycle's emcMotionUpdate may snapshot emcmotStatus before the servo loop has added the new segment

Are you saying that the error actually occurs during a state sync BEFORE the probe move takes place, instead of AFTER? Because you're talking about a sync where EMC_TRAJ_PROBE is the move which is written?

Or are you talking about the END of the probe move, and the following move just happens to be EMC_TRAJ_PROBE in your testing because you are using consecutive probe moves in your testing (e.g. both towards and away from the part)?

But either way, you're saying that the race condition is that the move is ADDED prematurely, not that the condition passed prematurely because the move has not been fully REMOVED?

Surely the move wouldn't be added prematurely because the addition of the move to the interpreter queue would itself be gated behind the state sync check of execState==DONE so that the moves aren't added to the queue until the sync has occurred?
Else the next move would always be added during a probe move, so this "Queue is not empty after probing" error would happen every probe move?

 

Or are you saying that is the race condition, that upon completion of the probe move there is a brief period where moves can be added to the queue before the sync is commanded?

And this can allow the next move to be added to the queue before the sync takes place, causing the current sync logic to fail because the next move has been added to the queue despite the machine otherwise being in a state where the sync should pass?

In which case, while tightening up this logic might be a fix, we'd ideally also want to fix the issue closer to its orgin by preventing moves which require a sync beforehand from being added to the queue before a sync is confirmed to have occurred?

 

Just to show where the error message originates:

This is where the "Queue is not empty after probing" message is defined:

#define NCE_QUEUE_IS_NOT_EMPTY_AFTER_PROBING _("Queue is not empty after probing")

And this is how the message is called:

if (settings->probe_flag) {

if (settings->probe_flag) {
CHKS((GET_EXTERNAL_QUEUE_EMPTY() == 0),
     NCE_QUEUE_IS_NOT_EMPTY_AFTER_PROBING);
set_probe_data(&_setup);
settings->probe_flag = false;
}

The message is triggered when the interpreter is commanded to read an input, if the probe_flag is true and the queue is not empty then the error occurs. So the error definitely seems to occur AFTER a probe move.

@grandixximo
Copy link
Copy Markdown
Author

Good catch, my wording conflated two phases. Let me restate, with the caveat that this is my theoretical model of the race and still needs validation against the reporter workloads.

The CHKS fires at the END of probe N, on the next read_inputs call when reading line N+1. You're right that execState==DONE was set before motion truly drained.

The mechanism I think is upstream of that. Trace:

  1. Cycle K: task pulls EMC_TRAJ_PROBE from interp_list, issues to motion via shared memory in step 3. End of cycle K, emcMotionUpdate runs usrmotReadEmcmotStatus.
  2. If the servo loop hasn't run since the NML write, emcmotStatus.depth and the INPOS bit still reflect the pre-probe state. Snapshot says queue==0, motion.status==DONE.
  3. Cycle K+1: task pulls TASK_PLAN_SYNCH (queued by the probe's INTERP_EXECUTE_FINISH path). Precondition `WAITING_FOR_MOTION_AND_IO` reads the stale snapshot and transitions to DONE immediately, then issues SYNCH same cycle.
  4. Cycle K+2: `interp_list==0, cmd==0, execState==DONE`. AUTO wait-clear releases. By now motion is processing the probe, queue grows.
  5. Cycle K+3: read line N+1, `read_inputs` CHKS reads queue from latest snapshot, sees probe N still in flight, fires `NCE_QUEUE_IS_NOT_EMPTY_AFTER_PROBING`.

So the proposed race: SYNCH's precondition raced through using stale "motion idle" data, and AUTO wait-clear didn't double-check. MDI's wait-clear has, since 2012, included `motion.traj.queue==0 && io.status==DONE` and doesn't hit this. Patch brings AUTO to the same standard.

Probe N+1 in the user's program is incidental, the race would fire on whatever line N+1 happens to be. User's cyclic G38.x just makes it more visible because every line is queue-busted.

Surely the move wouldn't be added prematurely because the addition of the move to the interpreter queue would itself be gated behind the state sync check of execState==DONE so that the moves aren't added to the queue until the sync has occurred? Else the next move would always be added during a probe move, so this "Queue is not empty after probing" error would happen every probe move?

Right, normally the SYNCH precondition handles it. The hypothesis is that the race only fires when servo hasn't run between NML write and snapshot, which is timing-sensitive on `[TASK]CYCLE_TIME` vs servo period. Most cycles align fine, occasionally don't. Matches the "tens of thousands of probes pass, then random fail" reports across #3650, #662, #263.

To stress: this is theory until someone with a reproducing setup confirms the patch fixes it. If reporters here (#3650 tiket18, #263 Cromaglious) can rerun their failing workloads against this patch and the error stops firing, that's strong evidence the model is right and this is the fix. If it still fires, the model is wrong and we go deeper.

we'd ideally also want to fix the issue closer to its origin by preventing moves which require a sync beforehand from being added to the queue before a sync is confirmed to have occurred

Agreed there's a deeper fix available. Could tighten the SYNCH precondition itself, or the `EMC_TRAJ_PROBE` postcondition, to require explicit queue+io drain rather than just `motion.status==DONE`. This patch is the minimum intervention matching the 2012 MDI gate. Happy to explore the deeper refactor once we have empirical confirmation that this surface fix actually closes the reported failures.

@grandixximo
Copy link
Copy Markdown
Author

cc @mhaberler: you wrote the 2012 MDI wait-clear gate (81f105b477) that this PR mirrors into the AUTO path. Do you remember whether the motion.traj.queue==0 && io.status==DONE conjuncts were added for a specific reproducer, or as defense-in-depth? Any context you have on why AUTO was left looser would be valuable for understanding whether the same fix is sufficient there.

@mhaberler
Copy link
Copy Markdown

sorry, no recollection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.10-candidate would be nice to have fixed in 2.10 bug v2.9 candidate Things that would be nice to have for 2.9

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Queue is not empty after probing

4 participants