Troubleshooting Tools and Tips
Kafka Queue: Consumer Group Message Lag
You can use the log shown below to identify any issues with the processing of messages or other parts of TBMQ infrastructure.
Since Kafka is used for MQTT message processing and other major parts of the system, such as client sessions
, client subscriptions
, retained messages
, etc.,
you can analyze the overall state of the broker.
TBMQ provides the ability to monitor whether the rate of producing messages to Kafka is faster than the rate of consuming and processing them. In such cases, you will experience a growing latency for message processing. To enable this functionality, ensure that Kafka consumer-stats are enabled (see the queue.kafka.consumer-stats section of the Configuration properties).
Once Kafka consumer-stats are enabled, logs (see Troubleshooting) about offset lag for consumer groups will be generated.
Here is an example of the log message:
1
2022-11-27 02:33:23,625 [kafka-consumer-stats-1-thread-1] INFO o.t.m.b.q.k.s.TbKafkaConsumerStatsService - [msg-all-consumer-group] Topic partitions with lag: [[topic=[tbmq.msg.all], partition=[2], lag=[5]]].
From this message we can see that there are five messages pushed to the tbmq.msg.all
topic but not yet processed.
In general, the logs have the following structure:
1
TIME [STATS_PRINTING_THREAD_NAME] INFO o.t.m.b.q.k.s.TbKafkaConsumerStatsService - [CONSUMER_GROUP_NAME] Topic partitions with lag: [[topic=[KAFKA_TOPIC], partition=[KAFKA_TOPIC_PARTITION], lag=[LAG]],[topic=[ANOTHER_TOPIC], partition=[], lag=[]],...].
Where:
CONSUMER_GROUP_NAME
- Name of the consumer group that is processing messages.KAFKA_TOPIC
- Name of the exact Kafka topic.KAFKA_TOPIC_PARTITION
- Number of the topic’s partition.LAG
- The amount of unprocessed messages.
NOTE: Logs about consumer lag are printed only if there is a lag for this consumer group.
CPU/Memory Usage
Sometimes, a problem arises due to a lack of resources for a particular service.
You can view CPU and Memory usage by logging into your server/container/pod
and executing the top
Linux command.
For more convenient monitoring, it is better to configure Prometheus and Grafana.
If you see that some services sometimes use 100% of the CPU, you should either scale the service horizontally by creating new nodes in the cluster or scale it vertically by increasing the total amount of CPU.
Logs
Reading Logs
Regardless of the deployment type, TBMQ logs are stored in the following directory:
1
/var/log/thingsboard-mqtt-broker
Different deployment tools provide different ways to view logs:
View last logs in runtime:
You can use grep command to show only the output with desired string in it. For example, you can use the following command in order to check if there are any errors on the backend side:
Tip: you can redirect logs to file and then analyze with any text editor:
Note: you can always log into TBMQ container and view logs there:
|
View all pods of the cluster:
View last logs for the desired pod:
To view TBMQ logs use command:
You can use grep command to show only the output with desired string in it. For example, you can use the following command in order to check if there are any errors on the backend side:
If you have multiple nodes you could redirect logs from all nodes to files on your machine and then analyze them:
Note: you can always log into TBMQ container and view logs there:
|
Enabling Certain Logs
To facilitate troubleshooting, TBMQ allows users to enable or disable logging for specific parts of the system. This can be achieved by modifying the logback.xml file, which is located in the following directory:
1
/usr/share/thingsboard-mqtt-broker/conf
Please note that there are separate files for k8s and Docker deployments.
Here’s an example of the logback.xml configuration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">
<appender name="fileLogAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.log</file>
<rollingPolicy
class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/thingsboard-mqtt-broker/${TB_SERVICE_ID}/thingsboard-mqtt-broker.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>3GB</totalSizeCap>
</rollingPolicy>
<encoder>
<pattern>%d{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<logger name="org.thingsboard.mqtt.broker.actors.client.service.connect" level="TRACE"/>
<logger name="org.thingsboard.mqtt.broker.actors.client.service.disconnect.DisconnectServiceImpl" level="INFO"/>
<logger name="org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem" level="OFF"/>
<root level="INFO">
<appender-ref ref="fileLogAppender"/>
</root>
</configuration>
The configuration files contain loggers which are the most useful for troubleshooting, as they allow you to enable or disable logging for a certain class or group of classes.
In the example given above, the default logging level is set to INFO, which means that the logs will contain general information, warnings, and errors.
However, for the org.thingsboard.mqtt.broker.actors.client.service.connect
package, the most detailed level of logging is enabled.
You can also completely disable logs for a part of the system, as is done for the org.thingsboard.mqtt.broker.actors.DefaultTbActorSystem
class using the OFF log-level.
To enable or disable logging for a certain part of the system, you need to add the appropriate </logger>
configuration and wait for up to 10 seconds.
Different deployment tools have different ways to update logs:
For docker-compose deployment we are mapping |
For kubernetes deployment we are using ConfigMap kubernetes entity to provide tb-brokers with logback configuration.
So in order to update logback.xml you need to edit
After 10 seconds the changes should be applied to logging configuration. |
Metrics
To enable Prometheus metrics in TBMQ you must:
- Set the
STATS_ENABLED
environment variable totrue
. - Set the
METRICS_ENDPOINTS_EXPOSE
environment variable toprometheus
in the configuration file.
The metrics can then be accessed via the following path: https://<yourhostname>/actuator/prometheus
, and scraped by Prometheus (authentication is not required).
Prometheus metrics
The Spring Actuator in TBMQ can expose some internal state metrics through Prometheus.
Here is a list of the metrics that TBMQ pushes to Prometheus:
TBMQ-specific metrics:
- incomingPublishMsg_published (statsNames - totalMsgs, successfulMsgs, failedMsgs): stats about incoming Publish messages to be persisted in the general queue.
- incomingPublishMsg_consumed (statsNames - totalMsgs, successfulMsgs, timeoutMsgs, failedMsgs, tmpTimeout, tmpFailed, successfulIterations, failedIterations): stats about incoming Publish messages processing from general queue.
- deviceProcessor (statsNames - successfulMsgs, failedMsgs, tmpFailed, successfulIterations, failedIterations):
stats about DEVICE client messages processing.
Some stats descriptions:
- failedMsgs: number of failed messages to be persisted in database and were discarded afterwards
- tmpFailed: number of failed messages to be persisted in database and got reprocessed later
- appProcessor (statsNames - successfulPublishMsgs, successfulPubRelMsgs, tmpTimeoutPublish, tmpTimeoutPubRel, timeoutPublishMsgs, timeoutPubRelMsgs, successfulIterations, failedIterations): stats about APPLICATION client messages processing.
Some stats descriptions:
- tmpTimeoutPubRel: number of PubRel messages that timed out and got reprocessed later
- tmpTimeoutPublish: number of Publish messages that timed out and got reprocessed later
- timeoutPubRelMsgs: number of PubRel messages that timed out and were discarded afterwards
- timeoutPublishMsgs: number of Publish messages that timed out and were discarded afterwards
- failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully
- appProcessor_latency (statsNames - puback, pubrec, pubcomp): stats about APPLICATION processor latency of different message types.
- actors_processing (statsNames - MQTT_CONNECT_MSG, MQTT_PUBLISH_MSG, MQTT_PUBACK_MSG, etc.): stats about actors processing average time of different message types.
- clientSubscriptionsConsumer (statsNames - totalSubscriptions, acceptedSubscriptions, ignoredSubscriptions):
stats about the client subscriptions read from Kafka by the broker node.
Some stats descriptions:
- totalSubscriptions: total number of new subscriptions added to the broker cluster
- acceptedSubscriptions: number of new subscriptions persisted by the broker node
- ignoredSubscriptions: number of ignored subscriptions since they were already initially processed by the broker node
- retainedMsgConsumer (statsNames - totalRetainedMsgs, newRetainedMsgs, clearedRetainedMsgs): stats about retain messages processing.
- subscriptionLookup: stats about average time of client subscriptions lookup in trie data structure.
- retainedMsgLookup: stats about average time of retain messages lookup in trie data structure.
- clientSessionsLookup: stats about average time of client sessions lookup from found client subscriptions for publish message.
- notPersistentMessagesProcessing: stats about average time for processing message delivery for not persistent clients.
- persistentMessagesProcessing: stats about average time for processing message delivery for persistent clients.
- delivery: stats about average time for message delivery to clients.
- subscriptionTopicTrieSize: stats about client subscriptions count in trie data structure.
- subscriptionTrieNodes: stats about client subscriptions nodes count in trie data structure.
- retainMsgTrieSize: stats about retain message count in trie data structure.
- retainMsgTrieNodes: stats about retain message nodes count in trie data structure.
- lastWillClients: stats about last will clients count.
- connectedSessions: stats about connected sessions count.
- connectedSslSessions: stats about connected via TLS sessions count.
- allClientSessions: stats about all client sessions count.
- clientSubscriptions: stats about client subscriptions count in the in-memory map.
- retainedMessages: stats about retain messages count in the in-memory map.
- activeAppProcessors: stats about active APPLICATION processors count.
- activeSharedAppProcessors: stats about active APPLICATION processors count for shared subscriptions.
- runningActors: stats about running actors count.
PostgreSQL-specific metrics:
- sqlQueue_InsertUnauthorizedClientQueue_${index_of_queue} (statsNames - totalMsgs, failedMsgs, successfulMsgs): stats about updating unauthorized clients to the database.
- sqlQueue_DeleteUnauthorizedClientQueue_${index_of_queue} (statsNames - totalMsgs, failedMsgs, successfulMsgs): stats about removing unauthorized clients to the database.
- sqlQueue_LatestTimeseriesQueue_${index_of_queue} (statsNames - totalMsgs, failedMsgs, successfulMsgs): stats about latest historical stats persistence to the database.
- sqlQueue_TimeseriesQueue_${index_of_queue} (statsNames - totalMsgs, failedMsgs, successfulMsgs): stats about historical stats persistence to the database.
Please note that in order to achieve maximum performance, TBMQ uses several queues (threads) per each of the specified queues above.
Getting help
The best way to contact our engineers and share your ideas with them is through our Gitter channel.
Q&A forumFor community support, we recommend visiting our user forum. It's a great place to connect with other users and find solutions to common issues.
Stack OverflowThe ThingsBoard team actively monitors posts that are tagged with "thingsboard" on the user forum. If you can't find an existing question that addresses your issue, feel free to ask a new one. Our team will be happy to assist you.
If you are unable to find a solution to your problem from any of the guides provided above, please do not hesitate to contact the ThingsBoard team for further assistance.