Troubleshooting

Possible Performance Issues

No Message Processing

If you are using RETRY_ALL, RETRY_FAILED, RETRY_TIMED_OUT, or RETRY_FAILED_AND_TIMED_OUT strategy for a rule-engine queue, a failed node can block all message processing in that queue.

Here is what you can do to identify the cause:

Analyze the Rule Engine Statistics Dashboard. Check whether any messages failed or timed out. Exception details, including the failing rule node’s name, appear at the bottom of the dashboard.
After identifying the failing rule node, enable DEBUG to see which messages trigger the failure and examine the detailed error.

Tip: Separate unstable and test use cases from production by creating a dedicated queue. Failures then affect only that queue, not the whole system. Configure this automatically per device using the Device Profile feature.

Tip: Handle Failure events for all rule nodes that connect to external services (REST API, Kafka, MQTT, etc.) to prevent rule-engine processing from stopping when an external system fails. You can store the failed message in the database, send a notification, or log it.

Growing Latency for Messages

You may experience growing message processing latency in the rule-engine. Here are the steps to diagnose the cause:

Check if there are timeouts in the Rule Engine Statistics Dashboard. Timeouts in rule-nodes slow down the processing of the queue and can lead to latency.
Check CPU usage for the following services:
- ThingsBoard services (tb-nodes, tb-rule-engine and tb-core nodes, transport nodes). High CPU load on some services means that you need to scale up that part of the system.
- PostgreSQL and pgpool (if you are in high-availability mode). High load on Postgres can lead to slow processing of all Postgres-related rule-nodes (saving attributes, reading attributes etc), and the system in general.
- Cassandra (if you are using Cassandra as storage for timeseries data). High load on Cassandra can lead to slow processing of all Cassandra-related rule-nodes (saving timeseries etc).
- Queue. Regardless of the queue type, make sure that it always has enough resources.
Check consumer-group lag (if you are using Kafka as queue).
Enable Message Pack Processing Log. It will allow you to see the name of the slowest rule-node.
Separate use cases with dedicated queues. If a group of devices requires isolated processing, configure a separate rule-engine queue for that group. You can also route messages to different queues using logic in the Root rule chain. This ensures slow processing of one use case does not affect others.

Troubleshooting Instruments and Tips

Rule Engine Statistics Dashboard

Check for Failures, Timeouts, and Exceptions during rule-chain processing. For more details, see the Rule Engine Statistics section.

Consumer Group Message Lag for Kafka Queue

Use this metric to identify message processing issues. Since the queue handles all system messaging, you can monitor not only rule-engine queues but also transport, core, and others. For details on troubleshooting rule-engine processing with consumer-group lag, see the Rule Engine Monitoring page.

CPU/Memory Usage

If a service lacks resources, check CPU and memory usage by logging into the server/container/pod and running the top command.

For continuous monitoring, configure Prometheus and Grafana.

If a service consistently reaches 100% CPU, scale it horizontally by adding cluster nodes or vertically by increasing CPU allocation.

Message Pack Processing Log

Enable logging of the slowest and most frequently called rule nodes by adding the following logger to your logging configuration:

<logger name="org.thingsboard.server.service.queue.TbMsgPackProcessingContext" level="DEBUG" />

The following entries will then appear in your logs:

2021-03-24 17:01:21,023 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by max execution time:
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1102. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO  o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by avg execution time:
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO  o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 604.0. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO  o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 1.0. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO  o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by execution count:
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO  o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] execution count: 2. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,028 [tb-rule-engine-consumer-24-thread-3] INFO  o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] execution count: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]

Clearing Redis/Valkey Cache

Cached data can become corrupted. Clearing the cache is always safe — ThingsBoard repopulates it at runtime. To clear it, log into the server/container/pod, open the command-line tool (redis-cli for Redis or valkey-cli for Valkey), and run FLUSHALL. In Sentinel mode, access the master container and run the same command.

If you cannot identify the cause of a problem, clear the cache to rule it out.

Logs

Read Logs

Regardless of the deployment type, ThingsBoard logs are stored on the same server/container as the ThingsBoard Server/Node in the following directory:

/var/log/thingsboard

Different deployment types provide different ways to view logs:

View last logs in runtime:

tail -f /var/log/thingsboard/thingsboard.log

Use grep to filter output by a specific string. For example, to check for backend errors:

cat /var/log/thingsboard/thingsboard.log | grep ERROR

View last logs in runtime:

docker compose logs -f tb-core1 tb-core2 tb-rule-engine1 tb-rule-engine2

To view only rule-engine logs:

docker compose logs -f tb-rule-engine1 tb-rule-engine2

Use grep to filter output by a specific string. For example, to check for backend errors:

docker compose logs tb-core1 tb-core2 tb-rule-engine1 tb-rule-engine2 | grep ERROR

Tip: Redirect logs to a file for offline analysis:

docker compose logs -f tb-rule-engine1 tb-rule-engine2 > rule-engine.log

To access logs directly inside the container:

docker ps
docker exec -it NAME_OF_THE_CONTAINER bash

View all pods of the cluster:

kubectl get pods

View last logs for the desired pod:

kubectl logs -f POD_NAME

To view ThingsBoard node logs:

kubectl logs -f tb-node-0

Use grep to filter output by a specific string. For example, to check for backend errors:

kubectl logs -f tb-node-0 | grep ERROR

To redirect logs from all nodes to local files for analysis:

kubectl logs -f tb-node-0 > tb-node-0.log
kubectl logs -f tb-node-1 > tb-node-1.log

To access logs directly inside the container:

kubectl exec -it tb-node-0 -- bash
cat /var/log/thingsboard/tb-node-0/thingsboard.log

Enable Certain Logs

ThingsBoard lets you enable or disable logging for specific components depending on what you need for troubleshooting.

Modify the logback.xml file, located on the same server/container as ThingsBoard, in the following directory:

/usr/share/thingsboard/conf

Here’s an example of the logback.xml configuration:

<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">

    <appender name="fileLogAppender"
              class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>/var/log/thingsboard/thingsboard.log</file>
        <rollingPolicy
                class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>/var/log/thingsboard/thingsboard.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
            <maxFileSize>100MB</maxFileSize>
            <maxHistory>30</maxHistory>
            <totalSizeCap>3GB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <logger name="org.thingsboard.server" level="INFO" />
    <logger name="org.thingsboard.js.api" level="TRACE" />
    <logger name="com.microsoft.azure.servicebus.primitives.CoreMessageReceiver" level="OFF" />

    <root level="INFO">
        <appender-ref ref="fileLogAppender"/>
    </root>
</configuration>

The most useful config elements for troubleshooting are the loggers, which enable or disable logging per class or package. In the example above, the default level is INFO (general info, warnings, and errors), while org.thingsboard.js.api is set to the most detailed logging level. Logging can also be completely disabled for a component — as shown for com.microsoft.azure.servicebus.primitives.CoreMessageReceiver using the OFF level.

To change logging for a component, add or update the <logger> entry and wait up to 10 seconds for the change to take effect.

Different deployment types require different steps to apply the updated configuration:

Update /usr/share/thingsboard/conf/logback.xml to change the logging configuration.

The /config folder inside the container is mapped to your local system (./tb-node/conf folder). Update ./tb-node/conf/logback.xml to change the logging configuration.

Kubernetes uses a ConfigMap to provide tb-nodes with logback configuration. To update logback.xml:

edit common/tb-node-configmap.yml
kubectl apply -f common/tb-node-configmap.yml

After 10 seconds the changes will be applied to the logging configuration.

Metrics

Enable Prometheus metrics by setting METRICS_ENABLED to true and METRICS_ENDPOINTS_EXPOSE to prometheus in the configuration file.

When running ThingsBoard as microservices with separate MQTT and CoAP transport services, also set WEB_APPLICATION_ENABLE to true, WEB_APPLICATION_TYPE to servlet, and HTTP_BIND_PORT to 8081 for those services.

Metrics are available at https://<yourhostname>/actuator/prometheus (no authentication required).

Prometheus Metrics

The following internal state metrics are exposed via Spring Actuator to Prometheus.

tb-node Metrics

attributes_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing attributes to the database. Several queues (threads) handle attribute persistence for maximum throughput.
ruleEngine_{name_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs, tmpFailed, failedIterations, successfulIterations, timeoutMsgs, tmpTimeout): stats for Rule Engine message processing, per queue (e.g., Main, HighPriority, SequentialByOriginator). Stat descriptions:
- tmpFailed: number of messages that failed and got reprocessed later
- tmpTimeout: number of messages that timed out and got reprocessed later
- timeoutMsgs: number of messages that timed out and were discarded afterwards
- failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully
ruleEngine_{name_of_queue}_seconds (for each present tenantId): stats about the time message processing took for different queues.
core (statsNames — totalMsgs, toDevRpc, coreNfs, sessionEvents, subInfo, subToAttr, subToRpc, deviceState, getAttr, claimDevice, subMsgs): stats for internal system message processing:
- toDevRpc: number of processed RPC responses from Transport services
- sessionEvents: number of session events from Transport services
- subInfo: number of subscription infos from Transport services
- subToAttr: number of subscribes to attribute updates from Transport services
- subToRpc: number of subscribes to RPC from Transport services
- getAttr: number of ‘get attributes’ requests from Transport services
- claimDevice: number of Device claims from Transport services
- deviceState: number of processed changes to Device State
- subMsgs: number of processed subscriptions
- coreNfs: number of processed specific ‘system’ messages
jsInvoke (statsNames — requests, responses, failures): stats for total, successful, and failed requests to JS executors
attributes_cache (results — hit, miss): stats about how many attribute requests went to the cache

Transport Metrics

transport (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for requests received by Transport from TB nodes
ruleEngine_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for messages pushed from Transport to the Rule Engine
core_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for messages pushed from Transport to the TB node Device actor
transport_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for requests from Transport to TB

Some metrics depend on the type of database you are using to persist timeseries data.

PostgreSQL-Specific Metrics

ts_latest_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing latest telemetry to the database. Several queues (threads) maximize write throughput.
ts_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing telemetry to the database. Several queues (threads) maximize write throughput.

Cassandra-Specific Metrics

rateExecutor_currBuffer: number of messages that are currently being persisted inside Cassandra.
rateExecutor_tenant (for each present tenantId): number of requests that got rate-limited
rateExecutor (statsNames — totalAdded, totalRejected, totalLaunched, totalReleased, totalFailed, totalExpired, totalRateLimited). Stats descriptions:
- totalAdded: number of messages that were submitted for persisting
- totalRejected: number of messages that were rejected while trying to submit for persisting
- totalLaunched: number of messages sent to Cassandra
- totalReleased: number of successfully persisted messages
- totalFailed: number of messages that were not persisted
- totalExpired: number of expired messages that were not sent to Cassandra
- totalRateLimited: number of messages that were not processed because of the Tenant’s rate-limits

Grafana Dashboards

You can import preconfigured Grafana dashboards from this repository.

Grafana dashboards are also available when deploying the ThingsBoard Docker Compose cluster. See the Docker Compose cluster setup guide for details. Set MONITORING_ENABLED to true before deployment. Once running, Prometheus is available at http://localhost:9090 and Grafana at http://localhost:3000 (default credentials: admin / foobar).

OAuth2

Sometimes after configuring OAuth you cannot see the button for logging in with an OAuth provider. This happens when Domain name and Redirect URI Template contain faulty values — they need to match the URL you use to access your ThingsBoard web page.

Base URL	Domain name	Redirect URI Template
`http://mycompany.com:8080`	`mycompany.com:8080`	`http://mycompany.com:8080/login/oauth2/code`
`https://mycompany.com`	`mycompany.com`	`https://mycompany.com/login/oauth2/code`

For OAuth2 configuration, see OAuth 2.0 Support.

Getting Help

GitHub Project — check out the project and consider contributing.
Stack Overflow — ask questions tagged with thingsboard; the ThingsBoard team monitors this tag.
Contact us — if your problem isn’t answered by any of the guides above, contact the ThingsBoard team directly.