Day 2 operations for Ceph with Juju storage on MAAS

szeestraten · 5 December 2019 20:17

Hey folks, I’m considering testing out Juju storage on MAAS for our next Ceph deployment and I was looking for some feedback.

Being able to define a storage pool in a bundle instead of device names sounds nice and clean, but are there any day 2 operation issues or drawbacks to consider? As the documentation mentions, the storage for MAAS is static so how do you go about when you need to replace a disk/OSD?

erik-lonroth · 5 December 2019 21:53

@lasse @hallback pinged

jamesbeedy · 7 December 2019 18:32

For me, this is where I have always accepted the fact that maas becomes out of sync with the state of the deployed node (you can change the physical configuration of the node, but not the node state in maas).

When we have needed to replace a disk in the past we have resorted to running the charm actions in the ceph-osd charm.

szeestraten · 7 December 2019 19:44

Do you use juju remove-storage or juju detach-storage to clean up the entries in Juju storage or do you only use the charm actions?
Could you perhaps share which charm actions you are using?

afreiberger · 9 December 2019 15:30

@szeestraten

As a member of Canonical’s Boostack team which provides day 2 managed service operations for our cloud products, I have quite a bit of experience with the replacement of failed disk devices provided to the ceph-osd charm.

While I am not familiar with juju storage beyond the documentation which can be found here, the basic principles of the underlying technology are going to be the same between juju, maas, and the ceph charms.

While the MAAS storage provider can provide you pools of storage to your ceph-osd units based on storage tagging, it does not provide “cloudy” type replacement of failing OSD storage devices such as you may find in the AWS EBS storage provider for juju.

Juju’s basic operating model is such that if a unit fails, you should be able (in a high availability application configuration), to juju remove-unit <app-name/unitX> and juju add-unit to replace the failed component. If you only have one or two OSD disks installed in each metal server, this would likely be the ideal replacement scenario.

However, in practice, most ceph-osd nodes have between 3 and 20 disks per node (depending on packing factors, performance requirements, etc) and evacuating (ungracefully) the storage from the entire unit to replace one drive is not ideal.

In this situation, you must turn to using the ceph documentation along with the ceph-mon and ceph-osd charm actions to provide a process to replace the singleton failed disk from the running metal.

The high level process is described in one of the bugs filed regarding the entire single-disk-replacement process not being entirely action-based.

For reference: Bug #1813360 “[wishlist] Action to 'purge-osd' and 'set-osd-out'...” : Bugs : OpenStack ceph-mon charm

You will basically be performing ceph operations for the most part via ssh to ceph-mon/$leader and the ceph-osd unit using a couple of actions along the way.

Here is the ceph documentation for the process of cleanly removing the failing disk:

Once those bits are completed, and you have replaced the drive and performed any low-level tasks to provide the disk to the running OS (raid configuration, disk probing, partitioning and adding bcache if you so choose), you can use the add-disk action on the ceph-osd unit to have the disk added back into ceph.

That being said, the add-disk function may come with some limitations if you’re using advanced/multiple ceph pool definitions.

If you are using bcache, there is also no way to readd the bcache caching device to the replaced disk via charmed actions. This is noted in Bug #1813359 “[wishlist] Action to “bcachify-disk” needed for en...” : Bugs : OpenStack ceph-osd charm and the manual process is outlined as well.

If you might be interested in our managed cloud services to cover your day 2 operations of this environment, please feel free to click the “Talk to Us” button at this link:
Managed OpenStack on Ubuntu | OpenStack | Ubuntu.

Best of luck,
-Drew

szeestraten · 9 December 2019 19:53

Thank you for your extensive answer @afreiberger! It is great to get some input from you and your team.

We have been running clusters for a couple of years so the charm and Ceph parts of maintenance and replacements procedures are clear to us. We don’t mind recommissioning and redeploying where it makes sense, but not for individual disk replacements.

May I ask if you primarily use Juju storage or the regular device list in the options and what are the reasons for choosing so?

Regarding the Juju storage use case:

Does replacing a disk affect the entries in juju storage in any way?
How does adding additional disks or WAL/DB devices work? add-disk action is the only option for disks?
Are there any gotchas with defining WAL/DB devices?

afreiberger · 9 December 2019 22:40

Currently, all Bootstack managed clouds follow the Foundation Cloud architecture which is to use the osd-devices to point to /dev/disk/by-dname/bcacheX where bcacheX is the name provided to the cache-backed device configured in MAAS. These by-dname paths are managed by udev rules created by curtin from the MAAS configs in the UI or via CLI/API. I believe the reasoning for our use of osd-devices is that our internal tooling had already solved storage templating within MAAS for the singular use-case of building an Openstack Cloud before juju storage was an available feature usable by many modeled applications.

My personal opinion is that juju storage is a perfect solution for providing easy to manage storage to applications that need persistent storage on ephemeral cloud infrastructure (think mysql on AWS or postgres on openstack) to provide a single interface/API that is cloud agnostic (meaning you don’t have to learn both cinder and EBS and know which cloud your application is running in). I don’t know that I’d want another layer of obfuscation between MAAS and Ceph when I’m looking for a bespoke configuration on metal, but that’s merely borne out of my own unfamiliarity with the juju storage tooling as of today. I absolutely grant that being able to tag storage devices and use them based on qualities of the drive (ssd, hdd, size, etc) would allow for a lot less per-host configuration time spent in the MAAS UI, and for that reason, I see juju storage as an absolute win in MAAS clouds as well.

I look forward to more out of this discussion.