Deployment Guidelines
Overview
Data services such as RabbitMQ often have many tunable parameters. Some configurations or practices make a lot of sense for development but are not really suitable for production. No single configuration fits every use case. It is, therefore, important to assess system configuration and have a plan for "day two operations" activities such as upgrades before going into production.
Table of Contents
Production systems have concerns that go beyond configuration: system observability, security, application development practices, resource usage, release support timeline, and more.
Monitoring and metrics are the foundation of a production-grade system. Besides helping detect issues, it provides the operator data that can be used to size and configure both RabbitMQ nodes and applications.
This guide provides recommendations in a few areas:
- Storage considerations for node data directories
- Networking-related recommendations
- Recommendations related to virtual hosts, users and permissions
- Monitoring and resource usage
- Per-virtual host and per-user limits
- Security
- Clustering and multi-node deployments
- Application-level practices and considerations
and more.
Storage Considerations
Use Durable Storage
Modern RabbitMQ 3.x features, most notably quorum queues and streams, are not designed with transient storage in mind.
Data safety features of quorum queues and streams expect node data storage to be durable. Both data structures also assume reasonably stable latency of I/O operations, something that network-attached storage will not be always ready to provide in practice.
Quorum queue and stream replicas hosted on restarted nodes that use transient storage will have to perform a full sync of the entire data set on the leader replica. This can result in massive data transfers and network link overload that could have been avoided by using durable storage.
When nodes are restarted, the rest of the cluster expects them to retain the information about their cluster peers. When this is not the case, restarted nodes may be able to rejoin as new nodes but a special peer clean up mechanism would have to be enabled to remove their prior identities.
Transient entities (such as queues) and RAM node support will be removed in RabbitMQ 4.0.
Network-attached Storage (NAS)
Network-attached storage (NAS) can be used for RabbitMQ node data directories, provided that the NAS volume
- It offers low I/O latency
- It can guarantee no significant latency spikes (for example, due to sharing with other I/O-heavy services)
Quorum queues, streams, and other RabbitMQ features will benefit from fast local SSD and NVMe storage. When possible, prefer local storage to NAS.
Storage Isolation
RabbitMQ nodes must never share their data directories. Ideally, should not share their disk I/O with other services for most predictable latency and throughput.
Choice of a Filesystem
RabbitMQ nodes can use most widely used local filesystems: ext4, btfs, and so on.
Avoid using distributed filesystems for node data directories:
- RabbitMQ's storage subsystem assumes the standard local filesystem semantics for
fsync(2)
and other key operations. Distributed filesystems often deviate from these standard guarantees - Distributed filesystems are usually designed for shared access to a subset of directories. Sharing a data directory between RabbitMQ nodes is an absolute no-no and is guaranteed to result in data corruption since nodes will not coordinate their writes
Virtual Hosts, Users, Permissions
It is often necessary to seed a cluster with virtual hosts, users, permissions, topologies, policies
and so on. The recommended way of doing this at deployment time is via definition import.
Definitions can be imported on node boot or at any point after cluster deployment
using rabbitmqadmin
or the POST /api/definitions
HTTP API endpoint.