Scaling P2P messaging to 1M msg/sec with persistent MQTT clients

Point-to-point (P2P) communication is one of the core MQTT patterns, enabling devices to exchange messages directly in a one-to-one manner. This pattern is especially relevant for IoT scenarios requiring reliable, targeted messaging — private messaging, command delivery, and other direct interaction use cases.

This test evaluates how well TBMQ performs with persistent DEVICE clients in a P2P scenario. Persistent DEVICE clients are well-suited for P2P messaging because they use a shared Kafka topic, reducing load on Kafka, while using Redis to ensure reliable delivery even when a client is temporarily offline. The tests scaled infrastructure while progressively increasing load from 200,000 to 1,000,000 messages per second, demonstrating TBMQ’s scalability and consistent performance.

TBMQ P2P performance test architecture diagram

Test methodology

Five tests measured performance, efficiency, and latency, with a maximum throughput of 1M msg/sec. Throughput refers to the total number of messages per second, including both incoming and outgoing. The environment was deployed on an AWS EKS cluster and scaled horizontally as workload increased.

Each test ran for 10 minutes, using an equal number of publishers and subscribers. Both operated with QoS 1. Subscribers used clean_session=false so all messages could be retained during offline periods and delivered on reconnect. Publishers sent 62-byte messages to unique topics europe/ua/kyiv/$number once per second; subscribers subscribed to europe/ua/kyiv/$number/+, where $number is the unique identifier for each publisher–subscriber pair.

Test agent setup

The test agent simulated publishers and subscribers. It consisted of a cluster of performance test pods (runners) and an orchestrator pod. Two runner types — publishers and subscribers — were deployed with equal pod counts. The table below shows pods per instance and instances deployed for one runner type across throughput levels.

Throughput (msg/sec)	Pods/Instance	EC2 instances
200k	5	1
400k	5	1
600k	10	1
800k	5	2
1M	5	4

The orchestrator ran on a separate EC2 instance that also hosted Kafka Redpanda console and Redis Insight for monitoring and coordination. This configuration enabled the test agent to adapt to rising traffic demands while addressing infrastructure constraints, such as port limitations.

Infrastructure overview

Hardware specifications:

Service	TBMQ	Kafka	Redis	AWS RDS (PostgreSQL)
Instance type	c7a.4xlarge	c7a.large	c7a.large	db.m6i.large
vCPU	16	2	2	2
Memory (GiB)	32	4	4	8
Storage (GiB)	20	30	8	20
Network (Gbps)	12.5	12.5	12.5	12.5

Redis master nodes without replicas were used to reduce costs during load testing.

EC2 instances and pod distribution (final 1M msg/sec test):

AWS EC2 instances deployed for the P2P performance test

Performance tests

Tests ran in phases from 200k to 1M msg/sec, incrementing by 200k. TBMQ brokers and Redis nodes were scaled at each phase; for the 1M test, Kafka nodes were also scaled.

Throughput (msg/sec)	Publishers/Subscribers	TBMQ nodes	Redis nodes	Kafka nodes
200k	100k	1	3	3
400k	200k	2	5	3
600k	300k	3	7	3
800k	400k	4	9	3
1M	500k	5	11	5

Key takeaways:

Scalability: TBMQ demonstrated linear scalability by incrementally adding nodes at higher workloads.
Efficient resource utilization: CPU utilization on TBMQ nodes stayed consistently around ~90% across all phases.
Latency: End-to-end latency remained within two-digit millisecond bounds — predictable given the QoS 1 level applied to both publishers and persistent subscribers. Publisher ACK latency stayed within single-digit milliseconds across all phases.
Throughput efficiency: ~8,900 msg/sec per CPU core, calculated by dividing total throughput by total CPU cores.

Final 1M msg/sec test results:

QoS	Msg latency avg	Pub ACK avg	TBMQ CPU avg	Payload
1	~75 ms	~8 ms	91%	62 bytes

Where:

TBMQ CPU avg: average CPU utilization across all TBMQ nodes
Msg latency avg: average time from publisher transmission to subscriber receipt
Pub ACK avg: average time from publisher transmission to PUBACK receipt

Redis Insight monitoring: ~5M commands/sec and ~2.5M keys

TBMQ monitoring showing 1M connected clients

TBMQ monitoring showing 500k subscriptions at QoS 1

How to repeat the test

Refer to the AWS cluster installation guide for step-by-step instructions on deploying TBMQ on AWS.

For the P2P scenario, the testing tool was improved to autogenerate publisher and subscriber configurations instead of loading them from a JSON configuration file.

Migrating from Jedis to Lettuce

One significant challenge during testing was the Jedis library’s synchronous nature becoming a bottleneck under high throughput. Each Redis command was sent and processed sequentially — the system had to wait for each command to complete before issuing the next, limiting Redis’s ability to handle concurrent operations.

The team migrated to the Lettuce client, which uses Netty for asynchronous communication. Unlike Jedis, Lettuce allows multiple commands to be sent in parallel without waiting for completion, enabling non-blocking operations and better resource utilization. This made it possible to fully utilize Redis’s performance capabilities, especially under high message loads. The migration required rewriting a substantial portion of the codebase to transition from synchronous to asynchronous workflows, including restructuring how Redis commands were issued and handled. Careful planning and rigorous testing ensured that these changes maintained system reliability and correctness.

Future optimizations

TBMQ currently uses Lua scripts in Redis to process messages for DEVICE persistent clients, ensuring atomic operations for saving, updating, and deleting messages — crucial for maintaining data consistency. One script is executed per client to comply with Redis Cluster constraints (all keys accessed in a script must reside in the same hash slot).

A potential optimization is adjusting the hashing mechanism for client IDs to intentionally group more clients into the same Redis hash slots, increasing the likelihood of batching operations into a single Lua script execution per hash slot and allowing the scripts to handle multiple clients simultaneously. This would reduce overhead and improve Redis efficiency while adhering to cluster constraints.

Conclusion

Across all five tests, TBMQ demonstrated linear scalability and efficient resource utilization. As workload increased from 200k to 1M msg/sec, the system maintained reliable message delivery and sufficiently low latency for point-to-point messaging scenarios.

These capabilities make TBMQ well-suited for IoT use cases requiring high-performance one-to-one communication with guaranteed delivery. Follow the project on GitHub to stay updated on further improvements.