Skip to content

fix(rpc): derive /raft/node leadership from raft leader ID#3266

Open
auricom wants to merge 4 commits intomainfrom
fix/3265-raft-node-leadership-state
Open

fix(rpc): derive /raft/node leadership from raft leader ID#3266
auricom wants to merge 4 commits intomainfrom
fix/3265-raft-node-leadership-state

Conversation

@auricom
Copy link
Copy Markdown
Contributor

@auricom auricom commented Apr 20, 2026

Fixes #3265

Tested this PR on a 3 members raft cluster, with this fix, the leader detection displayed on /raft/node endpoint is instantaneous instead of taking ~ 30 sedonds.

Overview

Summary by CodeRabbit

  • New Features

    • Added public API for safe leader ID retrieval from raft nodes with null-safety checks.
    • Enhanced leader status detection with fallback mechanism for improved reliability.
  • Tests

    • Added comprehensive HTTP endpoint tests covering leader status, empty leader scenarios, and HTTP method validation.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 20, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedApr 20, 2026, 10:04 AM

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 20, 2026

Claude finished @auricom's task in 1m 45s —— View job


Code Review

  • Gather context and read changed files
  • Review pkg/raft/node.go
  • Review pkg/rpc/server/http.go
  • Review pkg/rpc/server/server.go
  • Review pkg/rpc/server/http_test.go
  • Post final review

Summary

This PR correctly fixes the stale leadership state bug by deriving is_leader from the authoritative LeaderID() comparison (leaderID == nodeID) rather than trusting the possibly-lagging IsLeader() state machine field. The implementation is clean and the latest commits have already addressed the initial CodeRabbit feedback (doc comment on LeaderID, consolidating to a table-driven test, adding the critical "leader differs" case).


What's Good

  • Correct root cause fix: Comparing LeaderID() to NodeID() is the right approach — raft.State() can lag after a leadership transition, while LeaderWithID() reflects the current consensus view. (http.go:144-146)
  • Nil safety: LeaderID() guards against both nil receiver and nil inner raft field, matching the pattern of IsLeader() and HasQuorum(). (node.go:265-270)
  • Good doc comment: The stale-value caveat and HasQuorum cross-check guidance on LeaderID() is exactly the right kind of documentation for a concurrent state accessor.
  • Complete test matrix: The table-driven test includes the exact failure scenario this PR targets (leaderID="node-b", nodeID="node-a"is_leader=false) plus the empty-fallback and method-not-allowed cases.

Issues

Suggestion: expose leader_id in the /raft/node response

The response currently only returns is_leader and node_id, but operators troubleshooting a split-brain or stale-leader scenario need to know which node is the leader — not just whether the queried node is it. Adding leader_id to the response body is a low-cost change that would eliminate the need to poll every node.

// pkg/rpc/server/http.go
rsp := struct {
    IsLeader bool   `json:"is_leader"`
    NodeID   string `json:"node_id"`
    LeaderID string `json:"leader_id"`
}{
    IsLeader: isLeader,
    NodeID:   raftNode.NodeID(),
    LeaderID: leaderID,
}

Fix this →


Minor: JSON encode error handling fires after headers are sent

In http.go:155-157:

if err := json.NewEncoder(w).Encode(rsp); err != nil {
    http.Error(w, "Internal Server Error", http.StatusInternalServerError)
}

http.Error is called after Encode has already begun writing to w, so the 500 status can't be sent — this will produce a superfluous response.WriteHeader log warning. The correct pattern is to encode into a bytes.Buffer first, then write atomically. That said, encoding a struct of bool + string should never fail in practice, so the real-world impact is zero. This is also a pre-existing pattern in the file, not introduced by this PR.


Minor: nil store in raft-node tests could mask panics

http_test.go:122 passes nil as the store argument. Since nil is also passed for mux.HandleFunc("/health/live", ...) which calls s.Height(...), an accidental request to that endpoint would panic. The tests are focused enough that this isn't a real risk, but using a mocks.NewMockStore(t) (even without expectations) would be safer.


Minor: LeaderID() nil-guard paths not covered by tests

Codecov flags 4 lines in node.go as uncovered — these are the n == nil and n.raft == nil guards in LeaderID(). The existing IsLeader() and HasQuorum() guards have similar gaps. A unit test like:

func TestLeaderID_nilNode(t *testing.T) {
    var n *Node
    assert.Equal(t, "", n.LeaderID())
}

would cover the nil-receiver path cheaply. Fix this →


Verdict

The core fix is correct and well-tested. The suggestions above are quality-of-life improvements rather than blockers. Good to merge once the PR description is filled in (as flagged by CodeRabbit's pre-merge checks).

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

A new public LeaderID() method is added to the Node type, the RaftNodeSource interface is updated to include this method, the /raft/node HTTP endpoint now derives leader status by comparing LeaderID to NodeID, and comprehensive tests are added for the endpoint.

Changes

Cohort / File(s) Summary
Core API Addition
pkg/raft/node.go
Added public LeaderID() string method that safely returns the current raft leader ID, or empty string if receiver or raft is nil.
Interface Update
pkg/rpc/server/server.go
Extended RaftNodeSource interface with new LeaderID() string method requirement.
HTTP Endpoint Update
pkg/rpc/server/http.go
Modified /raft/node GET handler to derive is_leader by comparing LeaderID() to NodeID(), falling back to IsLeader() when LeaderID is empty.
Test Coverage
pkg/rpc/server/http_test.go
Added test helper type testRaftNodeSource and three new tests verifying /raft/node endpoint behavior for leader status derivation, empty leader ID fallback, and HTTP method validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A leader's beacon, once shrouded and veiled,
Now hops forth in code, where queries prevail! 🐰
LeaderID shines bright through the interface's call,
With tests standing guard to certify it all! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description includes a link to the related issue and mentions testing, but lacks detailed explanation of the problem, solution, and technical rationale. Expand the description to explain why deriving leadership from raft leader ID is more reliable, what the previous approach was, and provide more context about the 30-second delay issue.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: introducing leader ID-based leadership derivation in the /raft/node endpoint.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3265-raft-node-leadership-state

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
pkg/raft/node.go (1)

211-216: Add a doc comment for exported LeaderID.

Other exported methods on Node (e.g., IsLeader, HasQuorum) are documented; LeaderID should be as well.

📝 Proposed doc
+// LeaderID returns the raft leader's server ID, or "" if unknown or the node is not initialized.
 func (n *Node) LeaderID() string {
 	if n == nil || n.raft == nil {
 		return ""
 	}
 	return n.leaderID()
 }

As per coding guidelines: "Document exported types and functions in Go code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/raft/node.go` around lines 211 - 216, Add a Go doc comment for the
exported function LeaderID on type Node: place a comment immediately above func
(n *Node) LeaderID() string that describes what LeaderID returns (the current
cluster leader's ID as a string), its behavior when the receiver is nil or
uninitialized (returns empty string), and any guarantees or semantics (e.g., may
be stale until raft leadership changes or only valid when raft != nil). Use the
style of the existing comments for Node.IsLeader and Node.HasQuorum for
consistency.
pkg/rpc/server/http_test.go (2)

71-75: Test intent would be clearer by removing the misleading isLeader: false.

Setting isLeader: false while asserting body.IsLeader == true is correct (the handler derives from LeaderID), but makes the test read ambiguously — a reader might think the assertion is wrong. Either drop the field (zero value is false anyway, which actually reinforces the point) or add a comment noting that IsLeader() is intentionally stale here to verify LeaderID() takes precedence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/rpc/server/http_test.go` around lines 71 - 75, The test uses
testRaftNodeSource with isLeader:false while asserting body.IsLeader == true,
which is confusing; update the test by either removing the isLeader field (let
it use the zero value) or add a short clarifying comment next to the
testRaftNodeSource instantiation stating that IsLeader() is intentionally stale
and the handler should derive leadership from LeaderID() via LeaderID();
reference testRaftNodeSource, IsLeader, LeaderID, and body.IsLeader in the
comment or change so the intent is explicit.

67-147: Consider consolidating the three /raft/node tests into a table-driven test.

The three new tests are largely duplicated setup/teardown with only input and expected output differing. A single table-driven test would be more idiomatic and easier to extend (e.g., leader mismatch case: leaderID="node-b", nodeID="node-a"is_leader=false, which is currently not covered and is arguably the most important new behavior).

♻️ Suggested structure
func TestRegisterCustomHTTPEndpoints_RaftNodeStatus(t *testing.T) {
    specs := map[string]struct {
        src            testRaftNodeSource
        method         string
        expStatus      int
        expIsLeader    bool
        expNodeID      string
    }{
        "leader id matches node id":        {src: testRaftNodeSource{leaderID: "node-a", nodeID: "node-a"}, method: http.MethodGet, expStatus: http.StatusOK, expIsLeader: true, expNodeID: "node-a"},
        "leader id differs from node id":   {src: testRaftNodeSource{leaderID: "node-b", nodeID: "node-a"}, method: http.MethodGet, expStatus: http.StatusOK, expIsLeader: false, expNodeID: "node-a"},
        "empty leader id falls back":       {src: testRaftNodeSource{isLeader: false, leaderID: "", nodeID: "node-a"}, method: http.MethodGet, expStatus: http.StatusOK, expIsLeader: false, expNodeID: "node-a"},
        "non-GET is rejected":              {src: testRaftNodeSource{}, method: http.MethodPost, expStatus: http.StatusMethodNotAllowed},
    }
    for name, tc := range specs {
        t.Run(name, func(t *testing.T) { /* ... */ })
    }
}

Adding the "leader id differs from node id" case is especially worthwhile — it's the exact scenario the fix targets (node that stepped down but previously reported itself as leader). As per coding guidelines: "Use table-driven tests where appropriate".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/rpc/server/http_test.go` around lines 67 - 147, Consolidate the three
near-duplicate tests TestRegisterCustomHTTPEndpoints_RaftNodeStatus,
TestRegisterCustomHTTPEndpoints_RaftNodeStatusFallsBackWithoutLeaderID, and
TestRegisterCustomHTTPEndpoints_RaftNodeStatusMethodNotAllowed into a single
table-driven test that iterates over cases (using testRaftNodeSource entries)
and calls RegisterCustomHTTPEndpoints for each case; include cases for
leader-matches (leaderID == nodeID → is_leader true), leader-differs (leaderID
!= nodeID → is_leader false), empty leaderID fallback, and non-GET method
expecting StatusMethodNotAllowed, and assert expected HTTP status and decoded
response fields (IsLeader, NodeID) per case.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/raft/node.go`:
- Around line 211-216: Add a Go doc comment for the exported function LeaderID
on type Node: place a comment immediately above func (n *Node) LeaderID() string
that describes what LeaderID returns (the current cluster leader's ID as a
string), its behavior when the receiver is nil or uninitialized (returns empty
string), and any guarantees or semantics (e.g., may be stale until raft
leadership changes or only valid when raft != nil). Use the style of the
existing comments for Node.IsLeader and Node.HasQuorum for consistency.

In `@pkg/rpc/server/http_test.go`:
- Around line 71-75: The test uses testRaftNodeSource with isLeader:false while
asserting body.IsLeader == true, which is confusing; update the test by either
removing the isLeader field (let it use the zero value) or add a short
clarifying comment next to the testRaftNodeSource instantiation stating that
IsLeader() is intentionally stale and the handler should derive leadership from
LeaderID() via LeaderID(); reference testRaftNodeSource, IsLeader, LeaderID, and
body.IsLeader in the comment or change so the intent is explicit.
- Around line 67-147: Consolidate the three near-duplicate tests
TestRegisterCustomHTTPEndpoints_RaftNodeStatus,
TestRegisterCustomHTTPEndpoints_RaftNodeStatusFallsBackWithoutLeaderID, and
TestRegisterCustomHTTPEndpoints_RaftNodeStatusMethodNotAllowed into a single
table-driven test that iterates over cases (using testRaftNodeSource entries)
and calls RegisterCustomHTTPEndpoints for each case; include cases for
leader-matches (leaderID == nodeID → is_leader true), leader-differs (leaderID
!= nodeID → is_leader false), empty leaderID fallback, and non-GET method
expecting StatusMethodNotAllowed, and assert expected HTTP status and decoded
response fields (IsLeader, NodeID) per case.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 02c32518-6288-421f-9aab-a1309d349144

📥 Commits

Reviewing files that changed from the base of the PR and between 4c7323f and 2b3e5f9.

📒 Files selected for processing (4)
  • pkg/raft/node.go
  • pkg/rpc/server/http.go
  • pkg/rpc/server/http_test.go
  • pkg/rpc/server/server.go

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.54%. Comparing base (d9046e6) to head (ed2f57d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/raft/node.go 0.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3266      +/-   ##
==========================================
+ Coverage   62.52%   62.54%   +0.02%     
==========================================
  Files         122      122              
  Lines       13012    13020       +8     
==========================================
+ Hits         8136     8144       +8     
+ Misses       3992     3990       -2     
- Partials      884      886       +2     
Flag Coverage Δ
combined 62.54% <55.55%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

auricom and others added 3 commits April 20, 2026 11:54
- Add Go doc comment to Node.LeaderID() describing return value, nil-safety,
  and staleness semantics, consistent with IsLeader/HasQuorum style.
- Consolidate three near-duplicate TestRegisterCustomHTTPEndpoints_RaftNodeStatus*
  tests into a single table-driven test covering: leaderID==nodeID (is_leader
  true), leaderID!=nodeID (is_leader false), empty leaderID fallback, and
  non-GET method (405). Clarifies that is_leader is derived from LeaderID(),
  not the IsLeader() field on testRaftNodeSource.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Separate third-party imports from evstack/ev-node-prefixed imports
per project gci custom-order config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@auricom auricom enabled auto-merge April 20, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/raft/node endpoint lags behind actual raft leadership state

1 participant