You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1045 lines
47 KiB
1045 lines
47 KiB
3 months ago
|
From 3aa5bb0af1051432a83b2f7a9fd5c2763444c937 Mon Sep 17 00:00:00 2001
|
||
|
From: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
|
||
|
Date: Fri, 23 Feb 2024 15:51:46 +0100
|
||
|
Subject: [PATCH 19/41] mdadm: move documentation to folder
|
||
|
|
||
|
Move documentation text files to directory.
|
||
|
|
||
|
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
|
||
|
---
|
||
|
documentation/external-reshape-design.txt | 280 ++++++++++++++++++++++
|
||
|
documentation/mdadm.conf-example | 65 +++++
|
||
|
documentation/mdmon-design.txt | 146 +++++++++++
|
||
|
external-reshape-design.txt | 280 ----------------------
|
||
|
mdadm.conf-example | 65 -----
|
||
|
mdmon-design.txt | 146 -----------
|
||
|
6 files changed, 491 insertions(+), 491 deletions(-)
|
||
|
create mode 100644 documentation/external-reshape-design.txt
|
||
|
create mode 100644 documentation/mdadm.conf-example
|
||
|
create mode 100644 documentation/mdmon-design.txt
|
||
|
delete mode 100644 external-reshape-design.txt
|
||
|
delete mode 100644 mdadm.conf-example
|
||
|
delete mode 100644 mdmon-design.txt
|
||
|
|
||
|
diff --git a/documentation/external-reshape-design.txt b/documentation/external-reshape-design.txt
|
||
|
new file mode 100644
|
||
|
index 00000000..e4cf4e16
|
||
|
--- /dev/null
|
||
|
+++ b/documentation/external-reshape-design.txt
|
||
|
@@ -0,0 +1,280 @@
|
||
|
+External Reshape
|
||
|
+
|
||
|
+1 Problem statement
|
||
|
+
|
||
|
+External (third-party metadata) reshape differs from native-metadata
|
||
|
+reshape in three key ways:
|
||
|
+
|
||
|
+1.1 Format specific constraints
|
||
|
+
|
||
|
+In the native case reshape is limited by what is implemented in the
|
||
|
+generic reshape routine (Grow_reshape()) and what is supported by the
|
||
|
+kernel. There are exceptional cases where Grow_reshape() may block
|
||
|
+operations when it knows that the kernel implementation is broken, but
|
||
|
+otherwise the kernel is relied upon to be the final arbiter of what
|
||
|
+reshape operations are supported.
|
||
|
+
|
||
|
+In the external case the kernel, and the generic checks in
|
||
|
+Grow_reshape(), become the super-set of what reshapes are possible. The
|
||
|
+metadata format may not support, or have yet to implement a given
|
||
|
+reshape type. The implication for Grow_reshape() is that it must query
|
||
|
+the metadata handler and effect changes in the metadata before the new
|
||
|
+geometry is posted to the kernel. The ->reshape_super method allows
|
||
|
+Grow_reshape() to validate the requested operation and post the metadata
|
||
|
+update.
|
||
|
+
|
||
|
+1.2 Scope of reshape
|
||
|
+
|
||
|
+Native metadata reshape is always performed at the array scope (no
|
||
|
+metadata relationship with sibling arrays on the same disks). External
|
||
|
+reshape, depending on the format, may not allow the number of member
|
||
|
+disks to be changed in a subarray unless the change is simultaneously
|
||
|
+applied to all subarrays in the container. For example the imsm format
|
||
|
+requires all member disks to be a member of all subarrays, so a 4-disk
|
||
|
+raid5 in a container that also houses a 4-disk raid10 array could not be
|
||
|
+reshaped to 5 disks as the imsm format does not support a 5-disk raid10
|
||
|
+representation. This requires the ->reshape_super method to check the
|
||
|
+contents of the array and ask the user to run the reshape at container
|
||
|
+scope (if all subarrays are agreeable to the change), or report an
|
||
|
+error in the case where one subarray cannot support the change.
|
||
|
+
|
||
|
+1.3 Monitoring / checkpointing
|
||
|
+
|
||
|
+Reshape, unlike rebuild/resync, requires strict checkpointing to survive
|
||
|
+interrupted reshape operations. For example when expanding a raid5
|
||
|
+array the first few stripes of the array will be overwritten in a
|
||
|
+destructive manner. When restarting the reshape process we need to know
|
||
|
+the exact location of the last successfully written stripe, and we need
|
||
|
+to restore the data in any partially overwritten stripe. Native
|
||
|
+metadata stores this backup data in the unused portion of spares that
|
||
|
+are being promoted to array members, or in an external backup file
|
||
|
+(located on a non-involved block device).
|
||
|
+
|
||
|
+The kernel is in charge of recording checkpoints of reshape progress,
|
||
|
+but mdadm is delegated the task of managing the backup space which
|
||
|
+involves:
|
||
|
+1/ Identifying what data will be overwritten in the next unit of reshape
|
||
|
+ operation
|
||
|
+2/ Suspending access to that region so that a snapshot of the data can
|
||
|
+ be transferred to the backup space.
|
||
|
+3/ Allowing the kernel to reshape the saved region and setting the
|
||
|
+ boundary for the next backup.
|
||
|
+
|
||
|
+In the external reshape case we want to preserve this mdadm
|
||
|
+'reshape-manager' arrangement, but have a third actor, mdmon, to
|
||
|
+consider. It is tempting to give the role of managing reshape to mdmon,
|
||
|
+but that is counter to its role as a monitor, and conflicts with the
|
||
|
+existing capabilities and role of mdadm to manage the progress of
|
||
|
+reshape. For clarity the external reshape implementation maintains the
|
||
|
+role of mdmon as a (mostly) passive recorder of raid events, and mdadm
|
||
|
+treats it as it would the kernel in the native reshape case (modulo
|
||
|
+needing to send explicit metadata update messages and checking that
|
||
|
+mdmon took the expected action).
|
||
|
+
|
||
|
+External reshape can use the generic md backup file as a fallback, but in the
|
||
|
+optimal/firmware-compatible case the reshape-manager will use the metadata
|
||
|
+specific areas for managing reshape. The implementation also needs to spawn a
|
||
|
+reshape-manager per subarray when the reshape is being carried out at the
|
||
|
+container level. For these two reasons the ->manage_reshape() method is
|
||
|
+introduced. This method in addition to base tasks mentioned above:
|
||
|
+1/ Processed each subarray one at a time in series - where appropriate.
|
||
|
+2/ Uses either generic routines in Grow.c for md-style backup file
|
||
|
+ support, or uses the metadata-format specific location for storing
|
||
|
+ recovery data.
|
||
|
+This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
|
||
|
+optionally take advantage of generic infrastructure in Grow.c
|
||
|
+
|
||
|
+2 Details for specific reshape requests
|
||
|
+
|
||
|
+There are quite a few moving pieces spread out across md, mdadm, and mdmon for
|
||
|
+the support of external reshape, and there are several different types of
|
||
|
+reshape that need to be comprehended by the implementation. A rundown of
|
||
|
+these details follows.
|
||
|
+
|
||
|
+2.0 General provisions:
|
||
|
+
|
||
|
+Obtain an exclusive open on the container to make sure we are not
|
||
|
+running concurrently with a Create() event.
|
||
|
+
|
||
|
+2.1 Freezing sync_action
|
||
|
+
|
||
|
+ Before making any attempt at a reshape we 'freeze' every array in
|
||
|
+ the container to ensure no spare assignment or recovery happens.
|
||
|
+ This involves writing 'frozen' to sync_action and changing the '/'
|
||
|
+ after 'external:' in metadata_version to a '-'. mdmon knows that
|
||
|
+ this means not to perform any management.
|
||
|
+
|
||
|
+ Before doing this we check that all sync_actions are 'idle', which
|
||
|
+ is racy but still useful.
|
||
|
+ Afterwards we check that all member arrays have no spares
|
||
|
+ or partial spares (recovery_start != 'none') which would indicate a
|
||
|
+ race. If they do, we unfreeze again.
|
||
|
+
|
||
|
+ Once this completes we know all the arrays are stable. They may
|
||
|
+ still have failed devices as devices can fail at any time. However
|
||
|
+ we treat those like failures that happen during the reshape.
|
||
|
+
|
||
|
+2.2 Reshape size
|
||
|
+
|
||
|
+ 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
|
||
|
+ initializes st->update_tail
|
||
|
+ 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
|
||
|
+ is allowed (being performed at subarray scope / enough room) prepares a
|
||
|
+ metadata update
|
||
|
+ 3/ mdadm::Grow_reshape(): flushes the metadata update (via
|
||
|
+ flush_metadata_update(), or ->sync_metadata())
|
||
|
+ 4/ mdadm::Grow_reshape(): post the new size to the kernel
|
||
|
+
|
||
|
+
|
||
|
+2.3 Reshape level (simple-takeover)
|
||
|
+
|
||
|
+"simple-takeover" implies the level change can be satisfied without touching
|
||
|
+sync_action
|
||
|
+
|
||
|
+ 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
|
||
|
+ initializes st->update_tail
|
||
|
+ 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
|
||
|
+ is allowed (being performed at subarray scope) prepares a
|
||
|
+ metadata update
|
||
|
+ 2a/ raid10 --> raid0: degrade all mirror legs prior to calling
|
||
|
+ ->reshape_super
|
||
|
+ 3/ mdadm::Grow_reshape(): flushes the metadata update (via
|
||
|
+ flush_metadata_update(), or ->sync_metadata())
|
||
|
+ 4/ mdadm::Grow_reshape(): post the new level to the kernel
|
||
|
+
|
||
|
+2.4 Reshape chunk, layout
|
||
|
+
|
||
|
+2.5 Reshape raid disks (grow)
|
||
|
+
|
||
|
+ 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
|
||
|
+ because only redundant raid levels can modify the number of raid disks
|
||
|
+ 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
|
||
|
+ change is allowed (being performed at proper scope / permissible
|
||
|
+ geometry / proper spares available in the container), chooses
|
||
|
+ the spares to use, and prepares a metadata update.
|
||
|
+ 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
|
||
|
+ raid level that can perform the reshape and starts mdmon.
|
||
|
+ 4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
|
||
|
+ 5/ mdadm::Grow_reshape(): uses container_content to find details of
|
||
|
+ the spares and passes them to the kernel.
|
||
|
+ 6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
|
||
|
+ sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
|
||
|
+ and starts the reshape by writing 'reshape' to sync_action.
|
||
|
+ 7/ mdmon::monitor notices the sync_action change and tells
|
||
|
+ managemon to check for new devices. managemon notices the new
|
||
|
+ devices, opens relevant sysfs file, and passes them all to
|
||
|
+ monitor.
|
||
|
+ 8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
|
||
|
+ rest of the reshape.
|
||
|
+
|
||
|
+ 9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
|
||
|
+ the kernel to either the backup file or the metadata specific location,
|
||
|
+ advances sync_max, waits for reshape, ping mdmon, repeat.
|
||
|
+ Meanwhile mdmon::read_and_act(): records checkpoints.
|
||
|
+ Specifically.
|
||
|
+
|
||
|
+ 9a/ if the 'next' stripe to be reshaped will over-write
|
||
|
+ itself during reshape then:
|
||
|
+ 9a.1/ increase suspend_hi to cover a suitable number of
|
||
|
+ stripes.
|
||
|
+ 9a.2/ backup those stripes safely.
|
||
|
+ 9a.3/ advance sync_max to allow those stripes to be backed up
|
||
|
+ 9a.4/ when sync_completed indicates that those stripes have
|
||
|
+ been reshaped, manage_reshape must ping_manager
|
||
|
+ 9a.5/ when mdmon notices that sync_completed has been updated,
|
||
|
+ it records the new checkpoint in the metadata
|
||
|
+ 9a.6/ after the ping_manager, manage_reshape will increase
|
||
|
+ suspend_lo to allow access to those stripes again
|
||
|
+
|
||
|
+ 9b/ if the 'next' stripe to be reshaped will over-write unused
|
||
|
+ space during reshape then we apply same process as above,
|
||
|
+ except that there is no need to back anything up.
|
||
|
+ Note that we *do* need to keep suspend_hi progressing as
|
||
|
+ it is not safe to write to the area-under-reshape. For
|
||
|
+ kernel-managed-metadata this protection is provided by
|
||
|
+ ->reshape_safe, but that does not protect us in the case
|
||
|
+ of user-space-managed-metadata.
|
||
|
+
|
||
|
+ 10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
|
||
|
+ level back to the nominal raid level (if necessary)
|
||
|
+
|
||
|
+ FIXME: native metadata does not have the capability to record the original
|
||
|
+ raid level in reshape-restart case because the kernel always records current
|
||
|
+ raid level to the metadata, whereas external metadata can masquerade at an
|
||
|
+ alternate level based on the reshape state.
|
||
|
+
|
||
|
+2.6 Reshape raid disks (shrink)
|
||
|
+
|
||
|
+3 Interaction with metadata handle.
|
||
|
+
|
||
|
+ The following calls are made into the metadata handler to assist
|
||
|
+ with initiating and monitoring a 'reshape'.
|
||
|
+
|
||
|
+ 1/ ->reshape_super is called quite early (after only minimial
|
||
|
+ checks) to make sure that the metadata can record the new shape
|
||
|
+ and any necessary transitions. It may be passed a 'container'
|
||
|
+ or an individual array within a container, and it should notice
|
||
|
+ the difference and act accordingly.
|
||
|
+ When a reshape is requested against a container it is expected
|
||
|
+ that it should be applied to every array in the container,
|
||
|
+ however it is up to the metadata handler to determine final
|
||
|
+ policy.
|
||
|
+
|
||
|
+ If the reshape is supportable, the internal copy of the metadata
|
||
|
+ should be updated, and a metadata update suitable for sending
|
||
|
+ to mdmon should be queued.
|
||
|
+
|
||
|
+ If the reshape will involve converting spares into array members,
|
||
|
+ this must be recorded in the metadata too.
|
||
|
+
|
||
|
+ 2/ ->container_content will be called to find out the new state
|
||
|
+ of all the array, or all arrays in the container. Any newly
|
||
|
+ added devices (with state==0 and raid_disk >= 0) will be added
|
||
|
+ to the array as spares with the relevant slot number.
|
||
|
+
|
||
|
+ It is likely that the info returned by ->container_content will
|
||
|
+ have ->reshape_active set, ->reshape_progress set to e.g. 0, and
|
||
|
+ new_* set appropriately. mdadm will use this information to
|
||
|
+ cause the correct reshape to start at an appropriate time.
|
||
|
+
|
||
|
+ 3/ ->set_array_state will be called by mdmon when reshape has
|
||
|
+ started and again periodically as it progresses. This should
|
||
|
+ record the ->last_checkpoint as the point where reshape has
|
||
|
+ progressed to. When the reshape finished this will be called
|
||
|
+ again and it should notice that ->curr_action is no longer
|
||
|
+ 'reshape' and so should record that the reshape has finished
|
||
|
+ providing 'last_checkpoint' has progressed suitably.
|
||
|
+
|
||
|
+ 4/ ->manage_reshape will be called once the reshape has been set
|
||
|
+ up in the kernel but before sync_max has been moved from 0, so
|
||
|
+ no actual reshape will have happened.
|
||
|
+
|
||
|
+ ->manage_reshape should call progress_reshape() to allow the
|
||
|
+ reshape to progress, and should back-up any data as indicated
|
||
|
+ by the return value. See the documentation of that function
|
||
|
+ for more details.
|
||
|
+ ->manage_reshape will be called multiple times when a
|
||
|
+ container is being reshaped, once for each member array in
|
||
|
+ the container.
|
||
|
+
|
||
|
+
|
||
|
+ The progress of the metadata is as follows:
|
||
|
+ 1/ mdadm sends a metadata update to mdmon which marks the array
|
||
|
+ as undergoing a reshape. This is set up by
|
||
|
+ ->reshape_super and applied by ->process_update
|
||
|
+ For container-wide reshape, this happens once for the whole
|
||
|
+ container.
|
||
|
+ 2/ mdmon notices progress via the sysfs files and calls
|
||
|
+ ->set_array_state to update the state periodically
|
||
|
+ For container-wide reshape, this happens repeatedly for
|
||
|
+ one array, then repeatedly for the next, etc.
|
||
|
+ 3/ mdmon notices when reshape has finished and call
|
||
|
+ ->set_array_state to record the the reshape is complete.
|
||
|
+ For container-wide reshape, this happens once for each
|
||
|
+ member array.
|
||
|
+
|
||
|
+
|
||
|
+
|
||
|
+...
|
||
|
+
|
||
|
+[1]: Linux kernel design patterns - part 3, Neil Brown https://lwn.net/Articles/336262/
|
||
|
diff --git a/documentation/mdadm.conf-example b/documentation/mdadm.conf-example
|
||
|
new file mode 100644
|
||
|
index 00000000..35a75d12
|
||
|
--- /dev/null
|
||
|
+++ b/documentation/mdadm.conf-example
|
||
|
@@ -0,0 +1,65 @@
|
||
|
+# mdadm configuration file
|
||
|
+#
|
||
|
+# mdadm will function properly without the use of a configuration file,
|
||
|
+# but this file is useful for keeping track of arrays and member disks.
|
||
|
+# In general, a mdadm.conf file is created, and updated, after arrays
|
||
|
+# are created. This is the opposite behavior of /etc/raidtab which is
|
||
|
+# created prior to array construction.
|
||
|
+#
|
||
|
+#
|
||
|
+# the config file takes two types of lines:
|
||
|
+#
|
||
|
+# DEVICE lines specify a list of devices of where to look for
|
||
|
+# potential member disks
|
||
|
+#
|
||
|
+# ARRAY lines specify information about how to identify arrays so
|
||
|
+# so that they can be activated
|
||
|
+#
|
||
|
+# You can have more than one device line and use wild cards. The first
|
||
|
+# example includes SCSI the first partition of SCSI disks /dev/sdb,
|
||
|
+# /dev/sdc, /dev/sdd, /dev/sdj, /dev/sdk, and /dev/sdl. The second
|
||
|
+# line looks for array slices on IDE disks.
|
||
|
+#
|
||
|
+#DEVICE /dev/sd[bcdjkl]1
|
||
|
+#DEVICE /dev/hda1 /dev/hdb1
|
||
|
+#
|
||
|
+# If you mount devfs on /dev, then a suitable way to list all devices is:
|
||
|
+#DEVICE /dev/discs/*/*
|
||
|
+#
|
||
|
+#
|
||
|
+# The AUTO line can control which arrays get assembled by auto-assembly,
|
||
|
+# meaing either "mdadm -As" when there are no 'ARRAY' lines in this file,
|
||
|
+# or "mdadm --incremental" when the array found is not listed in this file.
|
||
|
+# By default, all arrays that are found are assembled.
|
||
|
+# If you want to ignore all DDF arrays (maybe they are managed by dmraid),
|
||
|
+# and only assemble 1.x arrays if which are marked for 'this' homehost,
|
||
|
+# but assemble all others, then use
|
||
|
+#AUTO -ddf homehost -1.x +all
|
||
|
+#
|
||
|
+# ARRAY lines specify an array to assemble and a method of identification.
|
||
|
+# Arrays can currently be identified by using a UUID, superblock minor number,
|
||
|
+# or a listing of devices.
|
||
|
+#
|
||
|
+# super-minor is usually the minor number of the metadevice
|
||
|
+# UUID is the Universally Unique Identifier for the array
|
||
|
+# Each can be obtained using
|
||
|
+#
|
||
|
+# mdadm -D <md>
|
||
|
+#
|
||
|
+#ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371
|
||
|
+#ARRAY /dev/md1 super-minor=1
|
||
|
+#ARRAY /dev/md2 devices=/dev/hda1,/dev/hdb1
|
||
|
+#
|
||
|
+# ARRAY lines can also specify a "spare-group" for each array. mdadm --monitor
|
||
|
+# will then move a spare between arrays in a spare-group if one array has a failed
|
||
|
+# drive but no spare
|
||
|
+#ARRAY /dev/md4 uuid=b23f3c6d:aec43a9f:fd65db85:369432df spare-group=group1
|
||
|
+#ARRAY /dev/md5 uuid=19464854:03f71b1b:e0df2edd:246cc977 spare-group=group1
|
||
|
+#
|
||
|
+# When used in --follow (aka --monitor) mode, mdadm needs a
|
||
|
+# mail address and/or a program. This can be given with "mailaddr"
|
||
|
+# and "program" lines to that monitoring can be started using
|
||
|
+# mdadm --follow --scan & echo $! > /run/mdadm/mon.pid
|
||
|
+# If the lines are not found, mdadm will exit quietly
|
||
|
+#MAILADDR root@mydomain.tld
|
||
|
+#PROGRAM /usr/sbin/handle-mdadm-events
|
||
|
diff --git a/documentation/mdmon-design.txt b/documentation/mdmon-design.txt
|
||
|
new file mode 100644
|
||
|
index 00000000..f09184a9
|
||
|
--- /dev/null
|
||
|
+++ b/documentation/mdmon-design.txt
|
||
|
@@ -0,0 +1,146 @@
|
||
|
+
|
||
|
+When managing a RAID1 array which uses metadata other than the
|
||
|
+"native" metadata understood by the kernel, mdadm makes use of a
|
||
|
+partner program named 'mdmon' to manage some aspects of updating
|
||
|
+that metadata and synchronising the metadata with the array state.
|
||
|
+
|
||
|
+This document provides some details on how mdmon works.
|
||
|
+
|
||
|
+Containers
|
||
|
+----------
|
||
|
+
|
||
|
+As background: mdadm makes a distinction between an 'array' and a
|
||
|
+'container'. Other sources sometimes use the term 'volume' or
|
||
|
+'device' for an 'array', and may use the term 'array' for a
|
||
|
+'container'.
|
||
|
+
|
||
|
+For our purposes:
|
||
|
+ - a 'container' is a collection of devices which are described by a
|
||
|
+ single set of metadata. The metadata may be stored equally
|
||
|
+ on all devices, or different devices may have quite different
|
||
|
+ subsets of the total metadata. But there is conceptually one set
|
||
|
+ of metadata that unifies the devices.
|
||
|
+
|
||
|
+ - an 'array' is a set of datablock from various devices which
|
||
|
+ together are used to present the abstraction of a single linear
|
||
|
+ sequence of block, which may provide data redundancy or enhanced
|
||
|
+ performance.
|
||
|
+
|
||
|
+So a container has some metadata and provides a number of arrays which
|
||
|
+are described by that metadata.
|
||
|
+
|
||
|
+Sometimes this model doesn't work perfectly. For example, global
|
||
|
+spares may have their own metadata which is quite different from the
|
||
|
+metadata from any device that participates in one or more arrays.
|
||
|
+Such a global spare might still need to belong to some container so
|
||
|
+that it is available to be used should a failure arise. In that case
|
||
|
+we consider the 'metadata' to be the union of the metadata on the
|
||
|
+active devices which describes the arrays, and the metadata on the
|
||
|
+global spares which only describes the spares. In this case different
|
||
|
+devices in the one container will have quite different metadata.
|
||
|
+
|
||
|
+
|
||
|
+Purpose
|
||
|
+-------
|
||
|
+
|
||
|
+The main purpose of mdmon is to update the metadata in response to
|
||
|
+changes to the array which need to be reflected in the metadata before
|
||
|
+futures writes to the array can safely be performed.
|
||
|
+These include:
|
||
|
+ - transitions from 'clean' to 'dirty'.
|
||
|
+ - recording the devices have failed.
|
||
|
+ - recording the progress of a 'reshape'
|
||
|
+
|
||
|
+This requires mdmon to be running at any time that the array is
|
||
|
+writable (a read-only array does not require mdmon to be running).
|
||
|
+
|
||
|
+Because mdmon must be able to process these metadata updates at any
|
||
|
+time, it must (when running) have exclusive write access to the
|
||
|
+metadata. Any other changes (e.g. reconfiguration of the array) must
|
||
|
+go through mdmon.
|
||
|
+
|
||
|
+A secondary role for mdmon is to activate spares when a device fails.
|
||
|
+This role is much less time-critical than the other metadata updates,
|
||
|
+so it could be performed by a separate process, possibly
|
||
|
+"mdadm --monitor" which has a related role of moving devices between
|
||
|
+arrays. A main reason for including this functionality in mdmon is
|
||
|
+that in the native-metadata case this function is handled in the
|
||
|
+kernel, and mdmon's reason for existence to provide functionality
|
||
|
+which is otherwise handled by the kernel.
|
||
|
+
|
||
|
+
|
||
|
+Design overview
|
||
|
+---------------
|
||
|
+
|
||
|
+mdmon is structured as two threads with a common address space and
|
||
|
+common data structures. These threads are know as the 'monitor' and
|
||
|
+the 'manager'.
|
||
|
+
|
||
|
+The 'monitor' has the primary role of monitoring the array for
|
||
|
+important state changes and updating the metadata accordingly. As
|
||
|
+writes to the array can be blocked until 'monitor' completes and
|
||
|
+acknowledges the update, it much be very careful not to block itself.
|
||
|
+In particular it must not block waiting for any write to complete else
|
||
|
+it could deadlock. This means that it must not allocate memory as
|
||
|
+doing this can require dirty memory to be written out and if the
|
||
|
+system choose to write to the array that mdmon is monitoring, the
|
||
|
+memory allocation could deadlock.
|
||
|
+
|
||
|
+So 'monitor' must never allocate memory and must limit the number of
|
||
|
+other system call it performs. It may:
|
||
|
+ - use select (or poll) to wait for activity on a file descriptor
|
||
|
+ - read from a sysfs file descriptor
|
||
|
+ - write to a sysfs file descriptor
|
||
|
+ - write the metadata out to the block devices using O_DIRECT
|
||
|
+ - send a signal (kill) to the manager thread
|
||
|
+
|
||
|
+It must not e.g. open files or do anything similar that might allocate
|
||
|
+resources.
|
||
|
+
|
||
|
+The 'manager' thread does everything else that is needed. If any
|
||
|
+files are to be opened (e.g. because a device has been added to the
|
||
|
+array), the manager does that. If any memory needs to be allocated
|
||
|
+(e.g. to hold data about a new array as can happen when one set of
|
||
|
+metadata describes several arrays), the manager performs that
|
||
|
+allocation.
|
||
|
+
|
||
|
+The 'manager' is also responsible for communicating with mdadm and
|
||
|
+assigning spares to replace failed devices.
|
||
|
+
|
||
|
+
|
||
|
+Handling metadata updates
|
||
|
+-------------------------
|
||
|
+
|
||
|
+There are a number of cases in which mdadm needs to update the
|
||
|
+metdata which mdmon is managing. These include:
|
||
|
+ - creating a new array in an active container
|
||
|
+ - adding a device to a container
|
||
|
+ - reconfiguring an array
|
||
|
+etc.
|
||
|
+
|
||
|
+To complete these updates, mdadm must send a message to mdmon which
|
||
|
+will merge the update into the metadata as it is at that moment.
|
||
|
+
|
||
|
+To achieve this, mdmon creates a Unix Domain Socket which the manager
|
||
|
+thread listens on. mdadm sends a message over this socket. The
|
||
|
+manager thread examines the message to see if it will require
|
||
|
+allocating any memory and allocates it. This is done in the
|
||
|
+'prepare_update' metadata method.
|
||
|
+
|
||
|
+The update message is then queued for handling by the monitor thread
|
||
|
+which it will do when convenient. The monitor thread calls
|
||
|
+->process_update which should atomically make the required changes to
|
||
|
+the metadata, making use of the pre-allocate memory as required. Any
|
||
|
+memory the is no-longer needed can be placed back in the request and
|
||
|
+the manager thread will free it.
|
||
|
+
|
||
|
+The exact format of a metadata update is up to the implementer of the
|
||
|
+metadata handlers. It will simply describe a change that needs to be
|
||
|
+made. It will sometimes contain fragments of the metadata to be
|
||
|
+copied in to place. However the ->process_update routine must make
|
||
|
+sure not to over-write any field that the monitor thread might have
|
||
|
+updated, such as a 'device failed' or 'array is dirty' state.
|
||
|
+
|
||
|
+When the monitor thread has completed the update and written it to the
|
||
|
+devices, an acknowledgement message is sent back over the socket so
|
||
|
+that mdadm knows it is complete.
|
||
|
diff --git a/external-reshape-design.txt b/external-reshape-design.txt
|
||
|
deleted file mode 100644
|
||
|
index e4cf4e16..00000000
|
||
|
--- a/external-reshape-design.txt
|
||
|
+++ /dev/null
|
||
|
@@ -1,280 +0,0 @@
|
||
|
-External Reshape
|
||
|
-
|
||
|
-1 Problem statement
|
||
|
-
|
||
|
-External (third-party metadata) reshape differs from native-metadata
|
||
|
-reshape in three key ways:
|
||
|
-
|
||
|
-1.1 Format specific constraints
|
||
|
-
|
||
|
-In the native case reshape is limited by what is implemented in the
|
||
|
-generic reshape routine (Grow_reshape()) and what is supported by the
|
||
|
-kernel. There are exceptional cases where Grow_reshape() may block
|
||
|
-operations when it knows that the kernel implementation is broken, but
|
||
|
-otherwise the kernel is relied upon to be the final arbiter of what
|
||
|
-reshape operations are supported.
|
||
|
-
|
||
|
-In the external case the kernel, and the generic checks in
|
||
|
-Grow_reshape(), become the super-set of what reshapes are possible. The
|
||
|
-metadata format may not support, or have yet to implement a given
|
||
|
-reshape type. The implication for Grow_reshape() is that it must query
|
||
|
-the metadata handler and effect changes in the metadata before the new
|
||
|
-geometry is posted to the kernel. The ->reshape_super method allows
|
||
|
-Grow_reshape() to validate the requested operation and post the metadata
|
||
|
-update.
|
||
|
-
|
||
|
-1.2 Scope of reshape
|
||
|
-
|
||
|
-Native metadata reshape is always performed at the array scope (no
|
||
|
-metadata relationship with sibling arrays on the same disks). External
|
||
|
-reshape, depending on the format, may not allow the number of member
|
||
|
-disks to be changed in a subarray unless the change is simultaneously
|
||
|
-applied to all subarrays in the container. For example the imsm format
|
||
|
-requires all member disks to be a member of all subarrays, so a 4-disk
|
||
|
-raid5 in a container that also houses a 4-disk raid10 array could not be
|
||
|
-reshaped to 5 disks as the imsm format does not support a 5-disk raid10
|
||
|
-representation. This requires the ->reshape_super method to check the
|
||
|
-contents of the array and ask the user to run the reshape at container
|
||
|
-scope (if all subarrays are agreeable to the change), or report an
|
||
|
-error in the case where one subarray cannot support the change.
|
||
|
-
|
||
|
-1.3 Monitoring / checkpointing
|
||
|
-
|
||
|
-Reshape, unlike rebuild/resync, requires strict checkpointing to survive
|
||
|
-interrupted reshape operations. For example when expanding a raid5
|
||
|
-array the first few stripes of the array will be overwritten in a
|
||
|
-destructive manner. When restarting the reshape process we need to know
|
||
|
-the exact location of the last successfully written stripe, and we need
|
||
|
-to restore the data in any partially overwritten stripe. Native
|
||
|
-metadata stores this backup data in the unused portion of spares that
|
||
|
-are being promoted to array members, or in an external backup file
|
||
|
-(located on a non-involved block device).
|
||
|
-
|
||
|
-The kernel is in charge of recording checkpoints of reshape progress,
|
||
|
-but mdadm is delegated the task of managing the backup space which
|
||
|
-involves:
|
||
|
-1/ Identifying what data will be overwritten in the next unit of reshape
|
||
|
- operation
|
||
|
-2/ Suspending access to that region so that a snapshot of the data can
|
||
|
- be transferred to the backup space.
|
||
|
-3/ Allowing the kernel to reshape the saved region and setting the
|
||
|
- boundary for the next backup.
|
||
|
-
|
||
|
-In the external reshape case we want to preserve this mdadm
|
||
|
-'reshape-manager' arrangement, but have a third actor, mdmon, to
|
||
|
-consider. It is tempting to give the role of managing reshape to mdmon,
|
||
|
-but that is counter to its role as a monitor, and conflicts with the
|
||
|
-existing capabilities and role of mdadm to manage the progress of
|
||
|
-reshape. For clarity the external reshape implementation maintains the
|
||
|
-role of mdmon as a (mostly) passive recorder of raid events, and mdadm
|
||
|
-treats it as it would the kernel in the native reshape case (modulo
|
||
|
-needing to send explicit metadata update messages and checking that
|
||
|
-mdmon took the expected action).
|
||
|
-
|
||
|
-External reshape can use the generic md backup file as a fallback, but in the
|
||
|
-optimal/firmware-compatible case the reshape-manager will use the metadata
|
||
|
-specific areas for managing reshape. The implementation also needs to spawn a
|
||
|
-reshape-manager per subarray when the reshape is being carried out at the
|
||
|
-container level. For these two reasons the ->manage_reshape() method is
|
||
|
-introduced. This method in addition to base tasks mentioned above:
|
||
|
-1/ Processed each subarray one at a time in series - where appropriate.
|
||
|
-2/ Uses either generic routines in Grow.c for md-style backup file
|
||
|
- support, or uses the metadata-format specific location for storing
|
||
|
- recovery data.
|
||
|
-This aims to avoid a "midlayer mistake"[1] and lets the metadata handler
|
||
|
-optionally take advantage of generic infrastructure in Grow.c
|
||
|
-
|
||
|
-2 Details for specific reshape requests
|
||
|
-
|
||
|
-There are quite a few moving pieces spread out across md, mdadm, and mdmon for
|
||
|
-the support of external reshape, and there are several different types of
|
||
|
-reshape that need to be comprehended by the implementation. A rundown of
|
||
|
-these details follows.
|
||
|
-
|
||
|
-2.0 General provisions:
|
||
|
-
|
||
|
-Obtain an exclusive open on the container to make sure we are not
|
||
|
-running concurrently with a Create() event.
|
||
|
-
|
||
|
-2.1 Freezing sync_action
|
||
|
-
|
||
|
- Before making any attempt at a reshape we 'freeze' every array in
|
||
|
- the container to ensure no spare assignment or recovery happens.
|
||
|
- This involves writing 'frozen' to sync_action and changing the '/'
|
||
|
- after 'external:' in metadata_version to a '-'. mdmon knows that
|
||
|
- this means not to perform any management.
|
||
|
-
|
||
|
- Before doing this we check that all sync_actions are 'idle', which
|
||
|
- is racy but still useful.
|
||
|
- Afterwards we check that all member arrays have no spares
|
||
|
- or partial spares (recovery_start != 'none') which would indicate a
|
||
|
- race. If they do, we unfreeze again.
|
||
|
-
|
||
|
- Once this completes we know all the arrays are stable. They may
|
||
|
- still have failed devices as devices can fail at any time. However
|
||
|
- we treat those like failures that happen during the reshape.
|
||
|
-
|
||
|
-2.2 Reshape size
|
||
|
-
|
||
|
- 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
|
||
|
- initializes st->update_tail
|
||
|
- 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the size change
|
||
|
- is allowed (being performed at subarray scope / enough room) prepares a
|
||
|
- metadata update
|
||
|
- 3/ mdadm::Grow_reshape(): flushes the metadata update (via
|
||
|
- flush_metadata_update(), or ->sync_metadata())
|
||
|
- 4/ mdadm::Grow_reshape(): post the new size to the kernel
|
||
|
-
|
||
|
-
|
||
|
-2.3 Reshape level (simple-takeover)
|
||
|
-
|
||
|
-"simple-takeover" implies the level change can be satisfied without touching
|
||
|
-sync_action
|
||
|
-
|
||
|
- 1/ mdadm::Grow_reshape(): checks if mdmon is running and optionally
|
||
|
- initializes st->update_tail
|
||
|
- 2/ mdadm::Grow_reshape() calls ->reshape_super() to check that the level change
|
||
|
- is allowed (being performed at subarray scope) prepares a
|
||
|
- metadata update
|
||
|
- 2a/ raid10 --> raid0: degrade all mirror legs prior to calling
|
||
|
- ->reshape_super
|
||
|
- 3/ mdadm::Grow_reshape(): flushes the metadata update (via
|
||
|
- flush_metadata_update(), or ->sync_metadata())
|
||
|
- 4/ mdadm::Grow_reshape(): post the new level to the kernel
|
||
|
-
|
||
|
-2.4 Reshape chunk, layout
|
||
|
-
|
||
|
-2.5 Reshape raid disks (grow)
|
||
|
-
|
||
|
- 1/ mdadm::Grow_reshape(): unconditionally initializes st->update_tail
|
||
|
- because only redundant raid levels can modify the number of raid disks
|
||
|
- 2/ mdadm::Grow_reshape(): calls ->reshape_super() to check that the level
|
||
|
- change is allowed (being performed at proper scope / permissible
|
||
|
- geometry / proper spares available in the container), chooses
|
||
|
- the spares to use, and prepares a metadata update.
|
||
|
- 3/ mdadm::Grow_reshape(): Converts each subarray in the container to the
|
||
|
- raid level that can perform the reshape and starts mdmon.
|
||
|
- 4/ mdadm::Grow_reshape(): Pushes the update to mdmon.
|
||
|
- 5/ mdadm::Grow_reshape(): uses container_content to find details of
|
||
|
- the spares and passes them to the kernel.
|
||
|
- 6/ mdadm::Grow_reshape(): gives raid_disks update to the kernel,
|
||
|
- sets sync_max, sync_min, suspend_lo, suspend_hi all to zero,
|
||
|
- and starts the reshape by writing 'reshape' to sync_action.
|
||
|
- 7/ mdmon::monitor notices the sync_action change and tells
|
||
|
- managemon to check for new devices. managemon notices the new
|
||
|
- devices, opens relevant sysfs file, and passes them all to
|
||
|
- monitor.
|
||
|
- 8/ mdadm::Grow_reshape() calls ->manage_reshape to oversee the
|
||
|
- rest of the reshape.
|
||
|
-
|
||
|
- 9/ mdadm::<format>->manage_reshape(): saves data that will be overwritten by
|
||
|
- the kernel to either the backup file or the metadata specific location,
|
||
|
- advances sync_max, waits for reshape, ping mdmon, repeat.
|
||
|
- Meanwhile mdmon::read_and_act(): records checkpoints.
|
||
|
- Specifically.
|
||
|
-
|
||
|
- 9a/ if the 'next' stripe to be reshaped will over-write
|
||
|
- itself during reshape then:
|
||
|
- 9a.1/ increase suspend_hi to cover a suitable number of
|
||
|
- stripes.
|
||
|
- 9a.2/ backup those stripes safely.
|
||
|
- 9a.3/ advance sync_max to allow those stripes to be backed up
|
||
|
- 9a.4/ when sync_completed indicates that those stripes have
|
||
|
- been reshaped, manage_reshape must ping_manager
|
||
|
- 9a.5/ when mdmon notices that sync_completed has been updated,
|
||
|
- it records the new checkpoint in the metadata
|
||
|
- 9a.6/ after the ping_manager, manage_reshape will increase
|
||
|
- suspend_lo to allow access to those stripes again
|
||
|
-
|
||
|
- 9b/ if the 'next' stripe to be reshaped will over-write unused
|
||
|
- space during reshape then we apply same process as above,
|
||
|
- except that there is no need to back anything up.
|
||
|
- Note that we *do* need to keep suspend_hi progressing as
|
||
|
- it is not safe to write to the area-under-reshape. For
|
||
|
- kernel-managed-metadata this protection is provided by
|
||
|
- ->reshape_safe, but that does not protect us in the case
|
||
|
- of user-space-managed-metadata.
|
||
|
-
|
||
|
- 10/ mdadm::<format>->manage_reshape(): Once reshape completes changes the raid
|
||
|
- level back to the nominal raid level (if necessary)
|
||
|
-
|
||
|
- FIXME: native metadata does not have the capability to record the original
|
||
|
- raid level in reshape-restart case because the kernel always records current
|
||
|
- raid level to the metadata, whereas external metadata can masquerade at an
|
||
|
- alternate level based on the reshape state.
|
||
|
-
|
||
|
-2.6 Reshape raid disks (shrink)
|
||
|
-
|
||
|
-3 Interaction with metadata handle.
|
||
|
-
|
||
|
- The following calls are made into the metadata handler to assist
|
||
|
- with initiating and monitoring a 'reshape'.
|
||
|
-
|
||
|
- 1/ ->reshape_super is called quite early (after only minimial
|
||
|
- checks) to make sure that the metadata can record the new shape
|
||
|
- and any necessary transitions. It may be passed a 'container'
|
||
|
- or an individual array within a container, and it should notice
|
||
|
- the difference and act accordingly.
|
||
|
- When a reshape is requested against a container it is expected
|
||
|
- that it should be applied to every array in the container,
|
||
|
- however it is up to the metadata handler to determine final
|
||
|
- policy.
|
||
|
-
|
||
|
- If the reshape is supportable, the internal copy of the metadata
|
||
|
- should be updated, and a metadata update suitable for sending
|
||
|
- to mdmon should be queued.
|
||
|
-
|
||
|
- If the reshape will involve converting spares into array members,
|
||
|
- this must be recorded in the metadata too.
|
||
|
-
|
||
|
- 2/ ->container_content will be called to find out the new state
|
||
|
- of all the array, or all arrays in the container. Any newly
|
||
|
- added devices (with state==0 and raid_disk >= 0) will be added
|
||
|
- to the array as spares with the relevant slot number.
|
||
|
-
|
||
|
- It is likely that the info returned by ->container_content will
|
||
|
- have ->reshape_active set, ->reshape_progress set to e.g. 0, and
|
||
|
- new_* set appropriately. mdadm will use this information to
|
||
|
- cause the correct reshape to start at an appropriate time.
|
||
|
-
|
||
|
- 3/ ->set_array_state will be called by mdmon when reshape has
|
||
|
- started and again periodically as it progresses. This should
|
||
|
- record the ->last_checkpoint as the point where reshape has
|
||
|
- progressed to. When the reshape finished this will be called
|
||
|
- again and it should notice that ->curr_action is no longer
|
||
|
- 'reshape' and so should record that the reshape has finished
|
||
|
- providing 'last_checkpoint' has progressed suitably.
|
||
|
-
|
||
|
- 4/ ->manage_reshape will be called once the reshape has been set
|
||
|
- up in the kernel but before sync_max has been moved from 0, so
|
||
|
- no actual reshape will have happened.
|
||
|
-
|
||
|
- ->manage_reshape should call progress_reshape() to allow the
|
||
|
- reshape to progress, and should back-up any data as indicated
|
||
|
- by the return value. See the documentation of that function
|
||
|
- for more details.
|
||
|
- ->manage_reshape will be called multiple times when a
|
||
|
- container is being reshaped, once for each member array in
|
||
|
- the container.
|
||
|
-
|
||
|
-
|
||
|
- The progress of the metadata is as follows:
|
||
|
- 1/ mdadm sends a metadata update to mdmon which marks the array
|
||
|
- as undergoing a reshape. This is set up by
|
||
|
- ->reshape_super and applied by ->process_update
|
||
|
- For container-wide reshape, this happens once for the whole
|
||
|
- container.
|
||
|
- 2/ mdmon notices progress via the sysfs files and calls
|
||
|
- ->set_array_state to update the state periodically
|
||
|
- For container-wide reshape, this happens repeatedly for
|
||
|
- one array, then repeatedly for the next, etc.
|
||
|
- 3/ mdmon notices when reshape has finished and call
|
||
|
- ->set_array_state to record the the reshape is complete.
|
||
|
- For container-wide reshape, this happens once for each
|
||
|
- member array.
|
||
|
-
|
||
|
-
|
||
|
-
|
||
|
-...
|
||
|
-
|
||
|
-[1]: Linux kernel design patterns - part 3, Neil Brown https://lwn.net/Articles/336262/
|
||
|
diff --git a/mdadm.conf-example b/mdadm.conf-example
|
||
|
deleted file mode 100644
|
||
|
index 35a75d12..00000000
|
||
|
--- a/mdadm.conf-example
|
||
|
+++ /dev/null
|
||
|
@@ -1,65 +0,0 @@
|
||
|
-# mdadm configuration file
|
||
|
-#
|
||
|
-# mdadm will function properly without the use of a configuration file,
|
||
|
-# but this file is useful for keeping track of arrays and member disks.
|
||
|
-# In general, a mdadm.conf file is created, and updated, after arrays
|
||
|
-# are created. This is the opposite behavior of /etc/raidtab which is
|
||
|
-# created prior to array construction.
|
||
|
-#
|
||
|
-#
|
||
|
-# the config file takes two types of lines:
|
||
|
-#
|
||
|
-# DEVICE lines specify a list of devices of where to look for
|
||
|
-# potential member disks
|
||
|
-#
|
||
|
-# ARRAY lines specify information about how to identify arrays so
|
||
|
-# so that they can be activated
|
||
|
-#
|
||
|
-# You can have more than one device line and use wild cards. The first
|
||
|
-# example includes SCSI the first partition of SCSI disks /dev/sdb,
|
||
|
-# /dev/sdc, /dev/sdd, /dev/sdj, /dev/sdk, and /dev/sdl. The second
|
||
|
-# line looks for array slices on IDE disks.
|
||
|
-#
|
||
|
-#DEVICE /dev/sd[bcdjkl]1
|
||
|
-#DEVICE /dev/hda1 /dev/hdb1
|
||
|
-#
|
||
|
-# If you mount devfs on /dev, then a suitable way to list all devices is:
|
||
|
-#DEVICE /dev/discs/*/*
|
||
|
-#
|
||
|
-#
|
||
|
-# The AUTO line can control which arrays get assembled by auto-assembly,
|
||
|
-# meaing either "mdadm -As" when there are no 'ARRAY' lines in this file,
|
||
|
-# or "mdadm --incremental" when the array found is not listed in this file.
|
||
|
-# By default, all arrays that are found are assembled.
|
||
|
-# If you want to ignore all DDF arrays (maybe they are managed by dmraid),
|
||
|
-# and only assemble 1.x arrays if which are marked for 'this' homehost,
|
||
|
-# but assemble all others, then use
|
||
|
-#AUTO -ddf homehost -1.x +all
|
||
|
-#
|
||
|
-# ARRAY lines specify an array to assemble and a method of identification.
|
||
|
-# Arrays can currently be identified by using a UUID, superblock minor number,
|
||
|
-# or a listing of devices.
|
||
|
-#
|
||
|
-# super-minor is usually the minor number of the metadevice
|
||
|
-# UUID is the Universally Unique Identifier for the array
|
||
|
-# Each can be obtained using
|
||
|
-#
|
||
|
-# mdadm -D <md>
|
||
|
-#
|
||
|
-#ARRAY /dev/md0 UUID=3aaa0122:29827cfa:5331ad66:ca767371
|
||
|
-#ARRAY /dev/md1 super-minor=1
|
||
|
-#ARRAY /dev/md2 devices=/dev/hda1,/dev/hdb1
|
||
|
-#
|
||
|
-# ARRAY lines can also specify a "spare-group" for each array. mdadm --monitor
|
||
|
-# will then move a spare between arrays in a spare-group if one array has a failed
|
||
|
-# drive but no spare
|
||
|
-#ARRAY /dev/md4 uuid=b23f3c6d:aec43a9f:fd65db85:369432df spare-group=group1
|
||
|
-#ARRAY /dev/md5 uuid=19464854:03f71b1b:e0df2edd:246cc977 spare-group=group1
|
||
|
-#
|
||
|
-# When used in --follow (aka --monitor) mode, mdadm needs a
|
||
|
-# mail address and/or a program. This can be given with "mailaddr"
|
||
|
-# and "program" lines to that monitoring can be started using
|
||
|
-# mdadm --follow --scan & echo $! > /run/mdadm/mon.pid
|
||
|
-# If the lines are not found, mdadm will exit quietly
|
||
|
-#MAILADDR root@mydomain.tld
|
||
|
-#PROGRAM /usr/sbin/handle-mdadm-events
|
||
|
diff --git a/mdmon-design.txt b/mdmon-design.txt
|
||
|
deleted file mode 100644
|
||
|
index f09184a9..00000000
|
||
|
--- a/mdmon-design.txt
|
||
|
+++ /dev/null
|
||
|
@@ -1,146 +0,0 @@
|
||
|
-
|
||
|
-When managing a RAID1 array which uses metadata other than the
|
||
|
-"native" metadata understood by the kernel, mdadm makes use of a
|
||
|
-partner program named 'mdmon' to manage some aspects of updating
|
||
|
-that metadata and synchronising the metadata with the array state.
|
||
|
-
|
||
|
-This document provides some details on how mdmon works.
|
||
|
-
|
||
|
-Containers
|
||
|
-----------
|
||
|
-
|
||
|
-As background: mdadm makes a distinction between an 'array' and a
|
||
|
-'container'. Other sources sometimes use the term 'volume' or
|
||
|
-'device' for an 'array', and may use the term 'array' for a
|
||
|
-'container'.
|
||
|
-
|
||
|
-For our purposes:
|
||
|
- - a 'container' is a collection of devices which are described by a
|
||
|
- single set of metadata. The metadata may be stored equally
|
||
|
- on all devices, or different devices may have quite different
|
||
|
- subsets of the total metadata. But there is conceptually one set
|
||
|
- of metadata that unifies the devices.
|
||
|
-
|
||
|
- - an 'array' is a set of datablock from various devices which
|
||
|
- together are used to present the abstraction of a single linear
|
||
|
- sequence of block, which may provide data redundancy or enhanced
|
||
|
- performance.
|
||
|
-
|
||
|
-So a container has some metadata and provides a number of arrays which
|
||
|
-are described by that metadata.
|
||
|
-
|
||
|
-Sometimes this model doesn't work perfectly. For example, global
|
||
|
-spares may have their own metadata which is quite different from the
|
||
|
-metadata from any device that participates in one or more arrays.
|
||
|
-Such a global spare might still need to belong to some container so
|
||
|
-that it is available to be used should a failure arise. In that case
|
||
|
-we consider the 'metadata' to be the union of the metadata on the
|
||
|
-active devices which describes the arrays, and the metadata on the
|
||
|
-global spares which only describes the spares. In this case different
|
||
|
-devices in the one container will have quite different metadata.
|
||
|
-
|
||
|
-
|
||
|
-Purpose
|
||
|
--------
|
||
|
-
|
||
|
-The main purpose of mdmon is to update the metadata in response to
|
||
|
-changes to the array which need to be reflected in the metadata before
|
||
|
-futures writes to the array can safely be performed.
|
||
|
-These include:
|
||
|
- - transitions from 'clean' to 'dirty'.
|
||
|
- - recording the devices have failed.
|
||
|
- - recording the progress of a 'reshape'
|
||
|
-
|
||
|
-This requires mdmon to be running at any time that the array is
|
||
|
-writable (a read-only array does not require mdmon to be running).
|
||
|
-
|
||
|
-Because mdmon must be able to process these metadata updates at any
|
||
|
-time, it must (when running) have exclusive write access to the
|
||
|
-metadata. Any other changes (e.g. reconfiguration of the array) must
|
||
|
-go through mdmon.
|
||
|
-
|
||
|
-A secondary role for mdmon is to activate spares when a device fails.
|
||
|
-This role is much less time-critical than the other metadata updates,
|
||
|
-so it could be performed by a separate process, possibly
|
||
|
-"mdadm --monitor" which has a related role of moving devices between
|
||
|
-arrays. A main reason for including this functionality in mdmon is
|
||
|
-that in the native-metadata case this function is handled in the
|
||
|
-kernel, and mdmon's reason for existence to provide functionality
|
||
|
-which is otherwise handled by the kernel.
|
||
|
-
|
||
|
-
|
||
|
-Design overview
|
||
|
----------------
|
||
|
-
|
||
|
-mdmon is structured as two threads with a common address space and
|
||
|
-common data structures. These threads are know as the 'monitor' and
|
||
|
-the 'manager'.
|
||
|
-
|
||
|
-The 'monitor' has the primary role of monitoring the array for
|
||
|
-important state changes and updating the metadata accordingly. As
|
||
|
-writes to the array can be blocked until 'monitor' completes and
|
||
|
-acknowledges the update, it much be very careful not to block itself.
|
||
|
-In particular it must not block waiting for any write to complete else
|
||
|
-it could deadlock. This means that it must not allocate memory as
|
||
|
-doing this can require dirty memory to be written out and if the
|
||
|
-system choose to write to the array that mdmon is monitoring, the
|
||
|
-memory allocation could deadlock.
|
||
|
-
|
||
|
-So 'monitor' must never allocate memory and must limit the number of
|
||
|
-other system call it performs. It may:
|
||
|
- - use select (or poll) to wait for activity on a file descriptor
|
||
|
- - read from a sysfs file descriptor
|
||
|
- - write to a sysfs file descriptor
|
||
|
- - write the metadata out to the block devices using O_DIRECT
|
||
|
- - send a signal (kill) to the manager thread
|
||
|
-
|
||
|
-It must not e.g. open files or do anything similar that might allocate
|
||
|
-resources.
|
||
|
-
|
||
|
-The 'manager' thread does everything else that is needed. If any
|
||
|
-files are to be opened (e.g. because a device has been added to the
|
||
|
-array), the manager does that. If any memory needs to be allocated
|
||
|
-(e.g. to hold data about a new array as can happen when one set of
|
||
|
-metadata describes several arrays), the manager performs that
|
||
|
-allocation.
|
||
|
-
|
||
|
-The 'manager' is also responsible for communicating with mdadm and
|
||
|
-assigning spares to replace failed devices.
|
||
|
-
|
||
|
-
|
||
|
-Handling metadata updates
|
||
|
--------------------------
|
||
|
-
|
||
|
-There are a number of cases in which mdadm needs to update the
|
||
|
-metdata which mdmon is managing. These include:
|
||
|
- - creating a new array in an active container
|
||
|
- - adding a device to a container
|
||
|
- - reconfiguring an array
|
||
|
-etc.
|
||
|
-
|
||
|
-To complete these updates, mdadm must send a message to mdmon which
|
||
|
-will merge the update into the metadata as it is at that moment.
|
||
|
-
|
||
|
-To achieve this, mdmon creates a Unix Domain Socket which the manager
|
||
|
-thread listens on. mdadm sends a message over this socket. The
|
||
|
-manager thread examines the message to see if it will require
|
||
|
-allocating any memory and allocates it. This is done in the
|
||
|
-'prepare_update' metadata method.
|
||
|
-
|
||
|
-The update message is then queued for handling by the monitor thread
|
||
|
-which it will do when convenient. The monitor thread calls
|
||
|
-->process_update which should atomically make the required changes to
|
||
|
-the metadata, making use of the pre-allocate memory as required. Any
|
||
|
-memory the is no-longer needed can be placed back in the request and
|
||
|
-the manager thread will free it.
|
||
|
-
|
||
|
-The exact format of a metadata update is up to the implementer of the
|
||
|
-metadata handlers. It will simply describe a change that needs to be
|
||
|
-made. It will sometimes contain fragments of the metadata to be
|
||
|
-copied in to place. However the ->process_update routine must make
|
||
|
-sure not to over-write any field that the monitor thread might have
|
||
|
-updated, such as a 'device failed' or 'array is dirty' state.
|
||
|
-
|
||
|
-When the monitor thread has completed the update and written it to the
|
||
|
-devices, an acknowledgement message is sent back over the socket so
|
||
|
-that mdadm knows it is complete.
|
||
|
--
|
||
|
2.40.1
|
||
|
|