Reference Documentation

Design docs, concept definitions, and references for APIs and CLIs.

Thingsboard Data Collection Performance

One of the key features of Thingsboard open-source IoT Platform is data collection and this is crucial feature that must work reliable under high load. In this article we are going to describe steps and improvements that we have made to ensure that single instance of Thingsboard server can constantly handle 20,000+ devices and 30,000+ MQTT publish messages per second, which in summary gives us around 2 million published messages per minute.

Architecture

Thingsboard performance leverages three main projects:

We also use Zookeeper for coordination and gRPC in cluster mode. See platform architecture for more details.

Data flow and test tools

IoT devices connect to Thingsboard server via MQTT and issue “publish” commands with JSON payload. Size of single publish message is approximately 100 bytes. MQTT is light-weight publish/subscribe messaging protocol and offers number of advantages over HTTP request/response protocol.

image

Thingsboard server processes MQTT publish messages and stores them to Cassandra asynchronously. Server may also push data to websocket subscriptions from the Web UI dashboards (if present). We try to avoid any blocking operations and this is critical for overall system performance. Thingsboard supports MQTT QoS level 1, which means that client receives response to the publish message only after data is stored to Cassandra DB. Data duplicates which are possible with QoS level 1 are just overwrites to corresponding Cassandra row and thus are not present in persisted data. This functionality provides reliable data delivery and persistence.

We have used Gatling load testing framework that is also based on Akka and Netty. Gatling is able to simulate 10K MQTT clients using 5-10% of a 2-core CPU. See our separate article about how we improved unofficial Gatling MQTT plugin to support our use case.

Performance improvement steps

Step 1. Asynchronous Cassandra Driver API

The results of first performance tests on the modern 4-core laptop with SSD was quite poor. Platform was able to process only 200 messages per second. The root cause and main performance bottle-neck was quite obvious and easy to find. It appears that the processing was not 100% asynchronous and we were executing blocking API call of Cassandra driver inside the Telemetry plugin actor. Quick refactoring of the plugin implementation resulted in more then 10X performance improvement and we received approximately 2500 published messages per second from 1000 devices. We would like to recommend this article about async queries to Cassandra.

Step 2. Connection pooling

We have decided to move to AWS EC2 instances to be able to share both results and tests we executed. We start running tests on c4.xlarge instance (4 vCPUs and 7.5 Gb of RAM) with Cassandra and Thingsboard services co-located.

image

Test specification:

First test results were obviously unacceptable:

image

The huge response time above was caused by the fact that server simply not able to process 10 K messages per second and they are getting queued.

We have started our investigation with monitoring memory and CPU load on the testing instance. Initially our guessing regarding poor performance was because of the heavy load on CPU or RAM. But in fact during load testing we have seen that CPU in particular moments was idle for a couple of seconds. This ‘pause’ event was happening every 3-7 seconds, see chart below.

image

As next step we have decided to do the thread dump during these pauses. We were expecting to see threads that are blocked and this could give us some clue what is happening while pauses. So we have opened separate console to monitor CPU load and another one to execute thread dump while performing stress tests using the following command:


kill -3 THINGSBOARD_PID

We have identified that during pause there was always one thread in TIMED_WAITING state and the root cause was in method awaitAvailableConnection of Cassandra driver:

java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for  <0x0000000092d9d390> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
at com.datastax.driver.core.HostConnectionPool.awaitAvailableConnection(HostConnectionPool.java:287)
at com.datastax.driver.core.HostConnectionPool.waitForConnection(HostConnectionPool.java:328)
at com.datastax.driver.core.HostConnectionPool.borrowConnection(HostConnectionPool.java:251)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.query(RequestHandler.java:301)
at com.datastax.driver.core.RequestHandler$SpeculativeExecution.sendRequest(RequestHandler.java:281)
at com.datastax.driver.core.RequestHandler.startNewExecution(RequestHandler.java:115)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:91)
at com.datastax.driver.core.SessionManager.executeAsync(SessionManager.java:132)
at org.thingsboard.server.dao.AbstractDao.executeAsync(AbstractDao.java:91)
at org.thingsboard.server.dao.AbstractDao.executeAsyncWrite(AbstractDao.java:75)
at org.thingsboard.server.dao.timeseries.BaseTimeseriesDao.savePartition(BaseTimeseriesDao.java:135)

As a result we have realized that default connection pool configuration for cassandra driver caused bad results in our use case.

Official configuration for connection pool feature contains special option ‘Simultaneous requests per connection’ that allows you to tune concurrent request per single connection. We use cassandra driver protocol v3 and by default it uses next values:

Considering the fact that we are actually pulling data from 10,000 devices, default values are definitely not enough. So we have done changes in the code and updated values for LOCAL and REMOTE hosts and set them to the maximum possible values:

poolingOptions
    .setMaxRequestsPerConnection(HostDistance.LOCAL, 32768)
    .setMaxRequestsPerConnection(HostDistance.REMOTE, 32768);

Test results after the applied changes are listed below.

image

image

The results were much better, but far from even 1 million messages per minute. We have not seen pauses in CPU load during our tests on c4.xlarge any more. CPU load was high (80-95%) during entire test. We have done couple thread dumps to verify that cassandra driver does not awaiting available connections and indeed we have not seen this issue anymore.

Step 3: Vertical scaling

We have decided to run the same tests on twice more powerful node c4.2xlarge with 8 vCPUs and 15Gb of RAM. The performance increase was not linear and the CPU was still loaded (80-90%).

image

We have noticed significant improvement in response time. After significant peak on the start of the test maximum response time was within 200ms and average response time was ~ 50ms.

image

Number of requests per second was around 10K

image

We have also executed test on c4.4xlarge with 16 vCPUs and 30Gb of RAM but have not noticed significant improvements and decided to separate Thingsboard server and move Cassandra to three nodes cluster.

Step 4: Horizontal scaling

Our main goal was to identify how much MQTT messages we can handle using single Thingsboard server running on c4.2xlarge. We will cover horizontal scalability of Thingsboard cluster in a separate article. So, we decided to move Cassandra to three c4.xlarge separate instances with default configuration and launch gatling stress test tool from two separate c4.xlarge instances simultaneously to minimize possible affect on latency and throughput by thirdparty.

image

Test specification:

The statistics of two simultaneous test runs launched on different client machines is listed below.

image image image

Based on the data from two simultaneous test runs we have reached 30 000 published messages per second which is equal to 1.8 million per minute.

How to repeat the tests

We have prepared several AWS AMIs for anyone who is interested in replication of these tests. See separate documentation page with detailed instructions.

Conclusion

This performance test demonstrates how small Thingsboard cluster, that cost approximately 1$ per hour, can easily receive, store and visualize more than 100 million messages from your devices. We will continue our work on performance improvements and going to publish performance results for cluster of Thingsboard servers in our next blog post.
We hope this article will be useful for people who are evaluating the platform and want to execute performance tests on their own. We also hope that performance improvement steps will be useful for any engineers who use similar technologies.

Please let us know your feedback and follow our project on Github or Twitter.