Possible performance issues
Here we will describe different possible scenarios of what may go wrong.
If you are using Here what you can do to find the reason of the issue:
Tip: Separate unstable and test use-cases from other rule-engine by creating a separate queue. In this case failure will only affect the processing of this separate queue, and not the whole system. You can configure this logic to happen automatically for the device by using Device Profile feature. Tip: Handle Failure events for all rule-nodes that connect to some external service (REST API call, Kafka, MQTT etc). This way you guaranty that your rule-engine processing won’t stop in case of some failure on the side of external system. You can store failed message in DB, send some notification or just log message. |
Sometimes you can experience growing latency of message processing inside the rule-engine. Here are the steps you can take to discover the reason for the issue:
|
Troubleshooting instruments and tips
Rule Engine Statistics Dashboard
You can see if there are any Failures, Timeouts or Exceptions during the processing of your rule-chain. More detailed information you can find here.
Consumer group message lag for Kafka Queue
Note: This method can be used only if Kafka is selected as a queue.
With this log you can identify if there’s some issue with processing of your messages (since Queue is used for all messaging inside the system you can analyze not only rule-engine queues but also transport, core etc). For more detailed information about troubleshooting rule-engine processing using consumer-group lag click here.
CPU/Memory Usage
Sometimes the problem is that you don’t have enough resources for some service. You can view CPU and Memory usage by logging into your server/container/pod and executing top
linux command.
For the more convenient monitoring it is better to have configured Prometheus and Grafana.
If you see that some services sometimes use 100% of the CPU, you should either scale the service horizontally by creating new nodes in cluster or scale it vertically by increasing the total amount of CPU.
Message Pack Processing Log
You can enable logging of the slowest and most frequently called rule-nodes. To do this you need to update your logging configuration with the following logger:
1
<logger name="org.thingsboard.server.service.queue.TbMsgPackProcessingContext" level="DEBUG" />
After this you can find the following messages in your logs:
1
2
3
4
5
6
7
8
9
2021-03-24 17:01:21,023 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by max execution time:
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1102. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by avg execution time:
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 604.0. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 1.0. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by execution count:
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] execution count: 2. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,028 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] execution count: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
Clearing Redis Cache
Note: This can be used only if Redis is selected as a cache.
It is possible that the data inside the cache somehow got corrupted. Regardless of the reason, it is always safe to clear cache, ThingsBoard will just refill it at the runtime.
To clear Redis cache you need to log into the server/container/pod with Redis on it and call redis-cli FLUSHALL
command.
So if you are struggling with identifying the reason of some problem, you can safely clear Redis cache to make sure it isn’t the reason of the issue.
Logs
Read logs
Regardless of the deployment type, ThingsBoard logs are stored on the same server/container as ThingsBoard Server/Node itself in the following directory:
1
/var/log/thingsboard
Different deployment tools provide different ways to view logs:
View last logs in runtime:
You can use grep command to show only the output with desired string in it. For example you can use the following command in order to check if there are any errors on the backend side:
|
View last logs in runtime:
If you suspect the issue is related only to rule-engine, you can filter and view only the rule-engine logs:
You can use grep command to show only the output with desired string in it. For example you can use the following command in order to check if there are any errors on the backend side:
Tip: you can redirect logs to file and then analyze with any text editor:
Note: you can always log into the ThingsBoard container and view logs there:
|
View all pods of the cluster:
View last logs for the desired pod:
To view ThingsBoard node logs use command:
You can use grep command to show only the output with desired string in it. For example you can use the following command in order to check if there are any errors on the backend side:
If you have multiple nodes you could redirect logs from all nodes to files on you machine and then analyze them:
Note: you can always log into the ThingsBoard container and view logs there:
|
Enable certain logs
ThingsBoard provides the ability to enable/disable logging for certain parts of the system depending on what information do you need for troubleshooting.
You can do this by modifying logback.xml file. As logs itself, it is stored on the same server/container as ThingsBoard Server/Node in the following directory:
1
/usr/share/thingsboard/conf
Here’s an example of the logback.xml configuration:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">
<appender name="fileLogAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/thingsboard/thingsboard.log</file>
<rollingPolicy
class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/thingsboard/thingsboard.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>3GB</totalSizeCap>
</rollingPolicy>
<encoder>
<pattern>%d{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<logger name="org.thingsboard.server" level="INFO" />
<logger name="org.thingsboard.js.api" level="TRACE" />
<logger name="com.microsoft.azure.servicebus.primitives.CoreMessageReceiver" level="OFF" />
<root level="INFO">
<appender-ref ref="fileLogAppender"/>
</root>
</configuration>
The most useful for the troubleshooting parts of the config files are loggers.
They allow you to enable/disable logging for the certain class or group of classes.
In the example above the default logging level is INFO (it means that logs will contain only general information, warnings and errors), but for the package org.thingsboard.js.api
we enabled the most detailed level of logging.
There’s also a possibility to completely disable logs for some part of the system, in the example above we did it to com.microsoft.azure.servicebus.primitives.CoreMessageReceiver
class using OFF log-level.
To enable/disable logging for some part of the system you need to add proper </logger>
configuration and wait up to 10 seconds.
Different deployment tools provide different ways to update logs:
For standalone deployment you need to update |
For docker-compose deployment we are mapping |
For kubernetes deployment we are using ConfigMap kubernetes entity to provide tb-nodes with logback configuration. So in order to update logback.xml you need to do the following:
After 10 seconds the changes should be applied to logging configuration. |
Metrics
You may enable prometheus metrics by setting environment variable METRICS_ENDPOINTS_EXPOSE
to value prometheus
in the configuration file.
These metrics are exposed at the path: https://<yourhostname>/actuator/prometheus
which can be scraped by prometheus (No authentication required).
Getting help
Community chat
Our Gitter channel is the best way to contact our engineers and share your ideas with them.
Q&A forum
Our user forum is a great place to go for community support.
Stack Overflow
The ThingsBoard team will also monitor posts tagged thingsboard. If there aren’t any existing questions that help, please ask a new one!
If your problem isn’t answered by any of the guides above, feel free to contact ThingsBoard team.