You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
118 lines
5.3 KiB
118 lines
5.3 KiB
From 1b07c7663b6a5c19c9303088d63c39dba7e3bb36 Mon Sep 17 00:00:00 2001
|
|
From: Eric Blake <eblake@redhat.com>
|
|
Date: Fri, 14 Apr 2023 10:33:58 -0500
|
|
Subject: [PATCH 1/5] migration: Handle block device inactivation failures
|
|
better
|
|
|
|
RH-Author: Eric Blake <eblake@redhat.com>
|
|
RH-MergeRequest: 273: migration: prevent source core dump if NFS dies mid-migration
|
|
RH-Bugzilla: 2177957
|
|
RH-Acked-by: Miroslav Rezanina <mrezanin@redhat.com>
|
|
RH-Acked-by: quintela1 <quintela@redhat.com>
|
|
RH-Acked-by: Kevin Wolf <kwolf@redhat.com>
|
|
RH-Commit: [1/3] 5892c17ca0a21d824d176e7398d12f7cf991651d (ebblake/qemu-kvm)
|
|
|
|
Consider what happens when performing a migration between two host
|
|
machines connected to an NFS server serving multiple block devices to
|
|
the guest, when the NFS server becomes unavailable. The migration
|
|
attempts to inactivate all block devices on the source (a necessary
|
|
step before the destination can take over); but if the NFS server is
|
|
non-responsive, the attempt to inactivate can itself fail. When that
|
|
happens, the destination fails to get the migrated guest (good,
|
|
because the source wasn't able to flush everything properly):
|
|
|
|
(qemu) qemu-kvm: load of migration failed: Input/output error
|
|
|
|
at which point, our only hope for the guest is for the source to take
|
|
back control. With the current code base, the host outputs a message, but then appears to resume:
|
|
|
|
(qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)
|
|
|
|
(src qemu)info status
|
|
VM status: running
|
|
|
|
but a second migration attempt now asserts:
|
|
|
|
(src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
|
|
|
|
Whether the guest is recoverable on the source after the first failure
|
|
is debatable, but what we do not want is to have qemu itself fail due
|
|
to an assertion. It looks like the problem is as follows:
|
|
|
|
In migration.c:migration_completion(), the source sets 'inactivate' to
|
|
true (since COLO is not enabled), then tries
|
|
savevm.c:qemu_savevm_state_complete_precopy() with a request to
|
|
inactivate block devices. In turn, this calls
|
|
block.c:bdrv_inactivate_all(), which fails when flushing runs up
|
|
against the non-responsive NFS server. With savevm failing, we are
|
|
now left in a state where some, but not all, of the block devices have
|
|
been inactivated; but migration_completion() then jumps to 'fail'
|
|
rather than 'fail_invalidate' and skips an attempt to reclaim those
|
|
those disks by calling bdrv_activate_all(). Even if we do attempt to
|
|
reclaim disks, we aren't taking note of failure there, either.
|
|
|
|
Thus, we have reached a state where the migration engine has forgotten
|
|
all state about whether a block device is inactive, because we did not
|
|
set s->block_inactive in enough places; so migration allows the source
|
|
to reach vm_start() and resume execution, violating the block layer
|
|
invariant that the guest CPUs should not be restarted while a device
|
|
is inactive. Note that the code in migration.c:migrate_fd_cancel()
|
|
will also try to reactivate all block devices if s->block_inactive was
|
|
set, but because we failed to set that flag after the first failure,
|
|
the source assumes it has reclaimed all devices, even though it still
|
|
has remaining inactivated devices and does not try again. Normally,
|
|
qmp_cont() will also try to reactivate all disks (or correctly fail if
|
|
the disks are not reclaimable because NFS is not yet back up), but the
|
|
auto-resumption of the source after a migration failure does not go
|
|
through qmp_cont(). And because we have left the block layer in an
|
|
inconsistent state with devices still inactivated, the later migration
|
|
attempt is hitting the assertion failure.
|
|
|
|
Since it is important to not resume the source with inactive disks,
|
|
this patch marks s->block_inactive before attempting inactivation,
|
|
rather than after succeeding, in order to prevent any vm_start() until
|
|
it has successfully reactivated all devices.
|
|
|
|
See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982
|
|
|
|
Signed-off-by: Eric Blake <eblake@redhat.com>
|
|
Reviewed-by: Juan Quintela <quintela@redhat.com>
|
|
Acked-by: Lukas Straub <lukasstraub2@web.de>
|
|
Tested-by: Lukas Straub <lukasstraub2@web.de>
|
|
Signed-off-by: Juan Quintela <quintela@redhat.com>
|
|
(cherry picked from commit 403d18ae384239876764bbfa111d6cc5dcb673d1)
|
|
---
|
|
migration/migration.c | 5 ++---
|
|
1 file changed, 2 insertions(+), 3 deletions(-)
|
|
|
|
diff --git a/migration/migration.c b/migration/migration.c
|
|
index 0885549de0..08e5e8f013 100644
|
|
--- a/migration/migration.c
|
|
+++ b/migration/migration.c
|
|
@@ -3256,13 +3256,11 @@ static void migration_completion(MigrationState *s)
|
|
MIGRATION_STATUS_DEVICE);
|
|
}
|
|
if (ret >= 0) {
|
|
+ s->block_inactive = inactivate;
|
|
qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
|
|
ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
|
|
inactivate);
|
|
}
|
|
- if (inactivate && ret >= 0) {
|
|
- s->block_inactive = true;
|
|
- }
|
|
}
|
|
qemu_mutex_unlock_iothread();
|
|
|
|
@@ -3321,6 +3319,7 @@ fail_invalidate:
|
|
bdrv_invalidate_cache_all(&local_err);
|
|
if (local_err) {
|
|
error_report_err(local_err);
|
|
+ s->block_inactive = true;
|
|
} else {
|
|
s->block_inactive = false;
|
|
}
|
|
--
|
|
2.39.1
|
|
|