Skip to content

KAFKA-20441: Fix handling of cordoned log dirs#22070

Open
mimaison wants to merge 2 commits intoapache:trunkfrom
mimaison:kafka-20441
Open

KAFKA-20441: Fix handling of cordoned log dirs#22070
mimaison wants to merge 2 commits intoapache:trunkfrom
mimaison:kafka-20441

Conversation

@mimaison
Copy link
Copy Markdown
Member

@mimaison mimaison commented Apr 15, 2026

Following feedback from
#21273 (review)

Reviewers: Chia-Ping Tsai chia7712@gmail.com, José Armando García
Sancio jsancio@users.noreply.github.com

@github-actions github-actions bot added kraft small Small PRs labels Apr 15, 2026
@mimaison
Copy link
Copy Markdown
Member Author

cc @chia7712 @jsancio

@mimaison mimaison marked this pull request as ready for review April 16, 2026 07:12
"value for inControlledShutdown field: %x", record, record.inControlledShutdown())));
Optional<List<Uuid>> directoriesChange = Optional.ofNullable(record.logDirs()).filter(list -> !list.isEmpty());
Optional<List<Uuid>> cordonedDirectoriesChange = Optional.ofNullable(record.cordonedLogDirs()).filter(list -> !list.isEmpty());
Optional<List<Uuid>> cordonedDirectoriesChange = Optional.ofNullable(record.cordonedLogDirs());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should be identical. Updated

Copy link
Copy Markdown
Member

@jsancio jsancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @mimaison . I just have some concerns that there may be a race during startup between the RPCs (registration and heartbeat) and the broker updating the dynamic configuration for cordoned log dir. Can you please check?

Also, in the previous PR the BrokerLifecycleManager has this state: Map<Uuid, Boolean>. I got the impression that we can just make it Set<Uuid>. What do you think?

directories = new ArrayList<>(directories);
directories.sort(Uuid::compareTo);
this.directories = Collections.unmodifiableList(directories);
this.cordonedDirectories = Collections.unmodifiableList(cordonedDirectories);
Copy link
Copy Markdown
Member

@jsancio jsancio Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the controller validate that cordonedDirectory is a subset of directories? Ideally, the controller will fail the registration and heartbeat request if this invariant is violated. What do you think?

Does the broker perform a similar validation when it validates the broker config? How about during dynamic configuration does the broker validate that cordoned is a subset of directories? Looks like the answer is yes but thought I still ask.

"about": "Log directories that are cordoned." }
{ "name": "CordonedLogDirs", "type": "[]uuid", "versions": "3+", "taggedVersions": "3+", "tag": 3,
"nullableVersions": "3+", "default": "null",
"about": "Log directories that are cordoned or null if no change." }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the documentation. The new schema looks good to me.

Optional<Boolean> inControlledShutdownChange,
Optional<List<Uuid>> directoriesChange,
Optional<List<Uuid>> cordonedDirectoriesChange
) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a race with dynamic cordoned directory. Not sure if we have a test for this. Let's assume that log dirs in broker.properties is (X, Y, Z) and that cordoned log dirs in broker.properties is (X). Let's also assume that the dynamic cordon directory was set to (X, Y). This means that the broker registration has the log dir as (X, Y, Z) and the cordoned log dir as (X, Y).

Looking at the broker startup code it looks like the broker will send log dirs (X, Y, Z) and cordoned log dir (X). This will override the metadata's cordon log dir to (X). Now the cordon log dir is (X) and not (X, Y).

While replaying the cluster metadata dynamic configuration it will discover that the cordoned log dir is (X, Y). The next broker heartbeat will set the cordoned log dir to (X, Y). In short there seems to be a race because broker registration and broker heartbeat, and the broker replaying the dynamic broker config. Does this match your understanding?

Take a look at this set of instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants