You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
281 lines
11 KiB
281 lines
11 KiB
From c9eb5f8e86d031060c72aeb9d995844c6f842c58 Mon Sep 17 00:00:00 2001
|
|
From: Peter Xu <peterx@redhat.com>
|
|
Date: Wed, 19 Jun 2024 18:30:40 -0400
|
|
Subject: [PATCH 05/11] migration/postcopy: Add postcopy-recover-setup phase
|
|
|
|
RH-Author: Juraj Marcin <None>
|
|
RH-MergeRequest: 419: migration: New postcopy state, and some cleanups [rhel-9.5.z]
|
|
RH-Jira: RHEL-63874
|
|
RH-Acked-by: Peter Xu <peterx@redhat.com>
|
|
RH-Acked-by: Miroslav Rezanina <mrezanin@redhat.com>
|
|
RH-Commit: [5/11] ce81d3b247b9f9541a75265a07082394ce419f3a
|
|
|
|
This patch adds a migration state on src called "postcopy-recover-setup".
|
|
The new state will describe the intermediate step starting from when the
|
|
src QEMU received a postcopy recovery request, until the migration channels
|
|
are properly established, but before the recovery process take place.
|
|
|
|
The request came from Libvirt where Libvirt currently rely on the migration
|
|
state events to detect migration state changes. That works for most of the
|
|
migration process but except postcopy recovery failures at the beginning.
|
|
|
|
Currently postcopy recovery only has two major states:
|
|
|
|
- postcopy-paused: this is the state that both sides of QEMU will be in
|
|
for a long time as long as the migration channel was interrupted.
|
|
|
|
- postcopy-recover: this is the state where both sides of QEMU handshake
|
|
with each other, preparing for a continuation of postcopy which used to
|
|
be interrupted.
|
|
|
|
The issue here is when the recovery port is invalid, the src QEMU will take
|
|
the URI/channels, noticing the ports are not valid, and it'll silently keep
|
|
in the postcopy-paused state, with no event sent to Libvirt. In this case,
|
|
the only thing Libvirt can do is to poll the migration status with a proper
|
|
interval, however that's less optimal.
|
|
|
|
Considering that this is the only case where Libvirt won't get a
|
|
notification from QEMU on such events, let's add postcopy-recover-setup
|
|
state to mimic what we have with the "setup" state of a newly initialized
|
|
migration, describing the phase of connection establishment.
|
|
|
|
With that, postcopy recovery will have two paths to go now, and either path
|
|
will guarantee an event generated. Now the events will look like this
|
|
during a recovery process on src QEMU:
|
|
|
|
- Initially when the recovery is initiated on src, QEMU will go from
|
|
"postcopy-paused" -> "postcopy-recover-setup". Old QEMUs don't have
|
|
this event.
|
|
|
|
- Depending on whether the channel re-establishment is succeeded:
|
|
|
|
- In succeeded case, src QEMU will move from "postcopy-recover-setup"
|
|
to "postcopy-recover". Old QEMUs also have this event.
|
|
|
|
- In failure case, src QEMU will move from "postcopy-recover-setup" to
|
|
"postcopy-paused" again. Old QEMUs don't have this event.
|
|
|
|
This guarantees that Libvirt will always receive a notification for
|
|
recovery process properly.
|
|
|
|
One thing to mention is, such new status is only needed on src QEMU not
|
|
both. On dest QEMU, the state machine doesn't change. Hence the events
|
|
don't change either. It's done like so because dest QEMU may not have an
|
|
explicit point of setup start. E.g., it can happen that when dest QEMUs
|
|
doesn't use migrate-recover command to use a new URI/channel, but the old
|
|
URI/channels can be reused in recovery, in which case the old ports simply
|
|
can work again after the network routes are fixed up.
|
|
|
|
Add a new helper postcopy_is_paused() detecting whether postcopy is still
|
|
paused, taking RECOVER_SETUP into account too. When using it on both
|
|
src/dst, a slight change is done altogether to always wait for the
|
|
semaphore before checking the status, because for both sides a sem_post()
|
|
will be required for a recovery.
|
|
|
|
Cc: Jiri Denemark <jdenemar@redhat.com>
|
|
Cc: Prasad Pandit <ppandit@redhat.com>
|
|
Reviewed-by: Fabiano Rosas <farosas@suse.de>
|
|
Buglink: https://issues.redhat.com/browse/RHEL-38485
|
|
Signed-off-by: Peter Xu <peterx@redhat.com>
|
|
Signed-off-by: Fabiano Rosas <farosas@suse.de>
|
|
|
|
(cherry picked from commit 4146b77ec7640d3c30d42558e13423594b114385)
|
|
|
|
JIRA: https://issues.redhat.com/browse/RHEL-63874
|
|
Y-JIRA: https://issues.redhat.com/browse/RHEL-38485
|
|
|
|
Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
|
|
---
|
|
migration/migration.c | 40 ++++++++++++++++++++++++++++++++++------
|
|
migration/postcopy-ram.c | 6 ++++++
|
|
migration/postcopy-ram.h | 3 +++
|
|
migration/savevm.c | 4 ++--
|
|
qapi/migration.json | 4 ++++
|
|
5 files changed, 49 insertions(+), 8 deletions(-)
|
|
|
|
diff --git a/migration/migration.c b/migration/migration.c
|
|
index 21f20a8e1c..03e151a045 100644
|
|
--- a/migration/migration.c
|
|
+++ b/migration/migration.c
|
|
@@ -1100,6 +1100,7 @@ bool migration_is_setup_or_active(void)
|
|
case MIGRATION_STATUS_ACTIVE:
|
|
case MIGRATION_STATUS_POSTCOPY_ACTIVE:
|
|
case MIGRATION_STATUS_POSTCOPY_PAUSED:
|
|
+ case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
|
|
case MIGRATION_STATUS_POSTCOPY_RECOVER:
|
|
case MIGRATION_STATUS_SETUP:
|
|
case MIGRATION_STATUS_PRE_SWITCHOVER:
|
|
@@ -1122,6 +1123,7 @@ bool migration_is_running(void)
|
|
case MIGRATION_STATUS_ACTIVE:
|
|
case MIGRATION_STATUS_POSTCOPY_ACTIVE:
|
|
case MIGRATION_STATUS_POSTCOPY_PAUSED:
|
|
+ case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
|
|
case MIGRATION_STATUS_POSTCOPY_RECOVER:
|
|
case MIGRATION_STATUS_SETUP:
|
|
case MIGRATION_STATUS_PRE_SWITCHOVER:
|
|
@@ -1273,6 +1275,7 @@ static void fill_source_migration_info(MigrationInfo *info)
|
|
case MIGRATION_STATUS_PRE_SWITCHOVER:
|
|
case MIGRATION_STATUS_DEVICE:
|
|
case MIGRATION_STATUS_POSTCOPY_PAUSED:
|
|
+ case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
|
|
case MIGRATION_STATUS_POSTCOPY_RECOVER:
|
|
/* TODO add some postcopy stats */
|
|
populate_time_info(info, s);
|
|
@@ -1469,10 +1472,31 @@ static void migrate_error_free(MigrationState *s)
|
|
|
|
static void migrate_fd_error(MigrationState *s, const Error *error)
|
|
{
|
|
+ MigrationStatus current = s->state;
|
|
+ MigrationStatus next;
|
|
+
|
|
trace_migrate_fd_error(error_get_pretty(error));
|
|
assert(s->to_dst_file == NULL);
|
|
- migrate_set_state(&s->state, MIGRATION_STATUS_SETUP,
|
|
- MIGRATION_STATUS_FAILED);
|
|
+
|
|
+ switch (current) {
|
|
+ case MIGRATION_STATUS_SETUP:
|
|
+ next = MIGRATION_STATUS_FAILED;
|
|
+ break;
|
|
+ case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
|
|
+ /* Never fail a postcopy migration; switch back to PAUSED instead */
|
|
+ next = MIGRATION_STATUS_POSTCOPY_PAUSED;
|
|
+ break;
|
|
+ default:
|
|
+ /*
|
|
+ * This really shouldn't happen. Just be careful to not crash a VM
|
|
+ * just for this. Instead, dump something.
|
|
+ */
|
|
+ error_report("%s: Illegal migration status (%s) detected",
|
|
+ __func__, MigrationStatus_str(current));
|
|
+ return;
|
|
+ }
|
|
+
|
|
+ migrate_set_state(&s->state, current, next);
|
|
migrate_set_error(s, error);
|
|
}
|
|
|
|
@@ -1573,6 +1597,7 @@ bool migration_in_postcopy(void)
|
|
switch (s->state) {
|
|
case MIGRATION_STATUS_POSTCOPY_ACTIVE:
|
|
case MIGRATION_STATUS_POSTCOPY_PAUSED:
|
|
+ case MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP:
|
|
case MIGRATION_STATUS_POSTCOPY_RECOVER:
|
|
return true;
|
|
default:
|
|
@@ -1965,6 +1990,9 @@ static bool migrate_prepare(MigrationState *s, bool blk, bool blk_inc,
|
|
return false;
|
|
}
|
|
|
|
+ migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
|
|
+ MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
|
|
+
|
|
/* This is a resume, skip init status */
|
|
return true;
|
|
}
|
|
@@ -3020,9 +3048,9 @@ static MigThrError postcopy_pause(MigrationState *s)
|
|
* We wait until things fixed up. Then someone will setup the
|
|
* status back for us.
|
|
*/
|
|
- while (s->state == MIGRATION_STATUS_POSTCOPY_PAUSED) {
|
|
+ do {
|
|
qemu_sem_wait(&s->postcopy_pause_sem);
|
|
- }
|
|
+ } while (postcopy_is_paused(s->state));
|
|
|
|
if (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER) {
|
|
/* Woken up by a recover procedure. Give it a shot */
|
|
@@ -3687,7 +3715,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
|
|
{
|
|
Error *local_err = NULL;
|
|
uint64_t rate_limit;
|
|
- bool resume = s->state == MIGRATION_STATUS_POSTCOPY_PAUSED;
|
|
+ bool resume = (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP);
|
|
int ret;
|
|
|
|
/*
|
|
@@ -3754,7 +3782,7 @@ void migrate_fd_connect(MigrationState *s, Error *error_in)
|
|
|
|
if (resume) {
|
|
/* Wakeup the main migration thread to do the recovery */
|
|
- migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_PAUSED,
|
|
+ migrate_set_state(&s->state, MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP,
|
|
MIGRATION_STATUS_POSTCOPY_RECOVER);
|
|
qemu_sem_post(&s->postcopy_pause_sem);
|
|
return;
|
|
diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
|
|
index ef184d8d08..be10611048 100644
|
|
--- a/migration/postcopy-ram.c
|
|
+++ b/migration/postcopy-ram.c
|
|
@@ -1770,3 +1770,9 @@ void *postcopy_preempt_thread(void *opaque)
|
|
|
|
return NULL;
|
|
}
|
|
+
|
|
+bool postcopy_is_paused(MigrationStatus status)
|
|
+{
|
|
+ return status == MIGRATION_STATUS_POSTCOPY_PAUSED ||
|
|
+ status == MIGRATION_STATUS_POSTCOPY_RECOVER_SETUP;
|
|
+}
|
|
diff --git a/migration/postcopy-ram.h b/migration/postcopy-ram.h
|
|
index ecae941211..a6df1b2811 100644
|
|
--- a/migration/postcopy-ram.h
|
|
+++ b/migration/postcopy-ram.h
|
|
@@ -13,6 +13,8 @@
|
|
#ifndef QEMU_POSTCOPY_RAM_H
|
|
#define QEMU_POSTCOPY_RAM_H
|
|
|
|
+#include "qapi/qapi-types-migration.h"
|
|
+
|
|
/* Return true if the host supports everything we need to do postcopy-ram */
|
|
bool postcopy_ram_supported_by_host(MigrationIncomingState *mis,
|
|
Error **errp);
|
|
@@ -193,5 +195,6 @@ enum PostcopyChannels {
|
|
void postcopy_preempt_new_channel(MigrationIncomingState *mis, QEMUFile *file);
|
|
void postcopy_preempt_setup(MigrationState *s);
|
|
int postcopy_preempt_establish_channel(MigrationState *s);
|
|
+bool postcopy_is_paused(MigrationStatus status);
|
|
|
|
#endif
|
|
diff --git a/migration/savevm.c b/migration/savevm.c
|
|
index 5aa595e365..a0f7a9dceb 100644
|
|
--- a/migration/savevm.c
|
|
+++ b/migration/savevm.c
|
|
@@ -2860,9 +2860,9 @@ static bool postcopy_pause_incoming(MigrationIncomingState *mis)
|
|
error_report("Detected IO failure for postcopy. "
|
|
"Migration paused.");
|
|
|
|
- while (mis->state == MIGRATION_STATUS_POSTCOPY_PAUSED) {
|
|
+ do {
|
|
qemu_sem_wait(&mis->postcopy_pause_sem_dst);
|
|
- }
|
|
+ } while (postcopy_is_paused(mis->state));
|
|
|
|
trace_postcopy_pause_incoming_continued();
|
|
|
|
diff --git a/qapi/migration.json b/qapi/migration.json
|
|
index 8c65b90328..e518563f67 100644
|
|
--- a/qapi/migration.json
|
|
+++ b/qapi/migration.json
|
|
@@ -150,6 +150,9 @@
|
|
#
|
|
# @postcopy-paused: during postcopy but paused. (since 3.0)
|
|
#
|
|
+# @postcopy-recover-setup: setup phase for a postcopy recovery process,
|
|
+# preparing for a recovery phase to start. (since 9.1)
|
|
+#
|
|
# @postcopy-recover: trying to recover from a paused postcopy. (since
|
|
# 3.0)
|
|
#
|
|
@@ -174,6 +177,7 @@
|
|
{ 'enum': 'MigrationStatus',
|
|
'data': [ 'none', 'setup', 'cancelling', 'cancelled',
|
|
'active', 'postcopy-active', 'postcopy-paused',
|
|
+ 'postcopy-recover-setup',
|
|
'postcopy-recover', 'completed', 'failed', 'colo',
|
|
'pre-switchover', 'device', 'wait-unplug' ] }
|
|
##
|
|
--
|
|
2.39.3
|
|
|