Skip to content
NEW AI Solution Creator — get a working IoT prototype in 10 minutes
Stand with Ukraine flag

Troubleshooting

If you are using RETRY_ALL, RETRY_FAILED, RETRY_TIMED_OUT, or RETRY_FAILED_AND_TIMED_OUT strategy for a rule-engine queue, a failed node can block all message processing in that queue.

Here is what you can do to identify the cause:

  • Analyze the Rule Engine Statistics Dashboard. Check whether any messages failed or timed out. Exception details, including the failing rule node’s name, appear at the bottom of the dashboard.

  • After identifying the failing rule node, enable DEBUG to see which messages trigger the failure and examine the detailed error.

Tip: Separate unstable and test use cases from production by creating a dedicated queue. Failures then affect only that queue, not the whole system. Configure this automatically per device using the Device Profile feature.

Tip: Handle Failure events for all rule nodes that connect to external services (REST API, Kafka, MQTT, etc.) to prevent rule-engine processing from stopping when an external system fails. You can store the failed message in the database, send a notification, or log it.

You may experience growing message processing latency in the rule-engine. Here are the steps to diagnose the cause:

  • Check if there are timeouts in the Rule Engine Statistics Dashboard. Timeouts in rule-nodes slow down the processing of the queue and can lead to latency.

  • Check CPU usage for the following services:

    • ThingsBoard services (tb-nodes, tb-rule-engine and tb-core nodes, transport nodes). High CPU load on some services means that you need to scale up that part of the system.
    • PostgreSQL and pgpool (if you are in high-availability mode). High load on Postgres can lead to slow processing of all Postgres-related rule-nodes (saving attributes, reading attributes etc), and the system in general.
    • Cassandra (if you are using Cassandra as storage for timeseries data). High load on Cassandra can lead to slow processing of all Cassandra-related rule-nodes (saving timeseries etc).
    • Queue. Regardless of the queue type, make sure that it always has enough resources.
  • Check consumer-group lag (if you are using Kafka as queue).

  • Enable Message Pack Processing Log. It will allow you to see the name of the slowest rule-node.

  • Separate use cases with dedicated queues. If a group of devices requires isolated processing, configure a separate rule-engine queue for that group. You can also route messages to different queues using logic in the Root rule chain. This ensures slow processing of one use case does not affect others.

Check for Failures, Timeouts, and Exceptions during rule-chain processing. For more details, see the Rule Engine Statistics section.

Consumer Group Message Lag for Kafka Queue

Section titled “Consumer Group Message Lag for Kafka Queue”

Use this metric to identify message processing issues. Since the queue handles all system messaging, you can monitor not only rule-engine queues but also transport, core, and others. For details on troubleshooting rule-engine processing with consumer-group lag, see the Rule Engine Monitoring page.

If a service lacks resources, check CPU and memory usage by logging into the server/container/pod and running the top command.

For continuous monitoring, configure Prometheus and Grafana.

If a service consistently reaches 100% CPU, scale it horizontally by adding cluster nodes or vertically by increasing CPU allocation.

Enable logging of the slowest and most frequently called rule nodes by adding the following logger to your logging configuration:

<logger name="org.thingsboard.server.service.queue.TbMsgPackProcessingContext" level="DEBUG" />

The following entries will then appear in your logs:

2021-03-24 17:01:21,023 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by max execution time:
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1102. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] DEBUG o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] max execution time: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by avg execution time:
2021-03-24 17:01:21,024 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 604.0. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] avg execution time: 1.0. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - Top Rule Nodes by execution count:
2021-03-24 17:01:21,025 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f740670-8cc0-11eb-bcd9-d343878c0c7f] execution count: 2. [RuleChain: Thermostat|RuleNode: Device Profile Node(3f740670-8cc0-11eb-bcd9-d343878c0c7f)]
2021-03-24 17:01:21,028 [tb-rule-engine-consumer-24-thread-3] INFO o.t.s.s.q.TbMsgPackProcessingContext - [Main][3f6debf0-8cc0-11eb-bcd9-d343878c0c7f] execution count: 1. [RuleChain: Thermostat|RuleNode: Message Type Switch(3f6debf0-8cc0-11eb-bcd9-d343878c0c7f)]

Cached data can become corrupted. Clearing the cache is always safe — ThingsBoard repopulates it at runtime. To clear it, log into the server/container/pod, open the command-line tool (redis-cli for Redis or valkey-cli for Valkey), and run FLUSHALL. In Sentinel mode, access the master container and run the same command.

If you cannot identify the cause of a problem, clear the cache to rule it out.

Regardless of the deployment type, ThingsBoard logs are stored on the same server/container as the ThingsBoard Server/Node in the following directory:

Terminal window
/var/log/thingsboard

Different deployment types provide different ways to view logs:

View last logs in runtime:

Terminal window
tail -f /var/log/thingsboard/thingsboard.log

Use grep to filter output by a specific string. For example, to check for backend errors:

Terminal window
cat /var/log/thingsboard/thingsboard.log | grep ERROR

ThingsBoard lets you enable or disable logging for specific components depending on what you need for troubleshooting.

Modify the logback.xml file, located on the same server/container as ThingsBoard, in the following directory:

Terminal window
/usr/share/thingsboard/conf

Here’s an example of the logback.xml configuration:

<!DOCTYPE configuration>
<configuration scan="true" scanPeriod="10 seconds">
<appender name="fileLogAppender"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>/var/log/thingsboard/thingsboard.log</file>
<rollingPolicy
class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<fileNamePattern>/var/log/thingsboard/thingsboard.%d{yyyy-MM-dd}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<maxHistory>30</maxHistory>
<totalSizeCap>3GB</totalSizeCap>
</rollingPolicy>
<encoder>
<pattern>%d{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<logger name="org.thingsboard.server" level="INFO" />
<logger name="org.thingsboard.js.api" level="TRACE" />
<logger name="com.microsoft.azure.servicebus.primitives.CoreMessageReceiver" level="OFF" />
<root level="INFO">
<appender-ref ref="fileLogAppender"/>
</root>
</configuration>

The most useful config elements for troubleshooting are the loggers, which enable or disable logging per class or package. In the example above, the default level is INFO (general info, warnings, and errors), while org.thingsboard.js.api is set to the most detailed logging level. Logging can also be completely disabled for a component — as shown for com.microsoft.azure.servicebus.primitives.CoreMessageReceiver using the OFF level.

To change logging for a component, add or update the <logger> entry and wait up to 10 seconds for the change to take effect.

Different deployment types require different steps to apply the updated configuration:

Update /usr/share/thingsboard/conf/logback.xml to change the logging configuration.

Enable Prometheus metrics by setting METRICS_ENABLED to true and METRICS_ENDPOINTS_EXPOSE to prometheus in the configuration file.

When running ThingsBoard as microservices with separate MQTT and CoAP transport services, also set WEB_APPLICATION_ENABLE to true, WEB_APPLICATION_TYPE to servlet, and HTTP_BIND_PORT to 8081 for those services.

Metrics are available at https://<yourhostname>/actuator/prometheus (no authentication required).

The following internal state metrics are exposed via Spring Actuator to Prometheus.

  • attributes_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing attributes to the database. Several queues (threads) handle attribute persistence for maximum throughput.
  • ruleEngine_{name_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs, tmpFailed, failedIterations, successfulIterations, timeoutMsgs, tmpTimeout): stats for Rule Engine message processing, per queue (e.g., Main, HighPriority, SequentialByOriginator). Stat descriptions:
    • tmpFailed: number of messages that failed and got reprocessed later
    • tmpTimeout: number of messages that timed out and got reprocessed later
    • timeoutMsgs: number of messages that timed out and were discarded afterwards
    • failedIterations: iterations of processing messages pack where at least one message wasn’t processed successfully
  • ruleEngine_{name_of_queue}_seconds (for each present tenantId): stats about the time message processing took for different queues.
  • core (statsNames — totalMsgs, toDevRpc, coreNfs, sessionEvents, subInfo, subToAttr, subToRpc, deviceState, getAttr, claimDevice, subMsgs): stats for internal system message processing:
    • toDevRpc: number of processed RPC responses from Transport services
    • sessionEvents: number of session events from Transport services
    • subInfo: number of subscription infos from Transport services
    • subToAttr: number of subscribes to attribute updates from Transport services
    • subToRpc: number of subscribes to RPC from Transport services
    • getAttr: number of ‘get attributes’ requests from Transport services
    • claimDevice: number of Device claims from Transport services
    • deviceState: number of processed changes to Device State
    • subMsgs: number of processed subscriptions
    • coreNfs: number of processed specific ‘system’ messages
  • jsInvoke (statsNames — requests, responses, failures): stats for total, successful, and failed requests to JS executors
  • attributes_cache (results — hit, miss): stats about how many attribute requests went to the cache
  • transport (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for requests received by Transport from TB nodes
  • ruleEngine_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for messages pushed from Transport to the Rule Engine
  • core_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for messages pushed from Transport to the TB node Device actor
  • transport_producer (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for requests from Transport to TB

Some metrics depend on the type of database you are using to persist timeseries data.

  • ts_latest_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing latest telemetry to the database. Several queues (threads) maximize write throughput.
  • ts_queue_{index_of_queue} (statsNames — totalMsgs, failedMsgs, successfulMsgs): stats for writing telemetry to the database. Several queues (threads) maximize write throughput.
  • rateExecutor_currBuffer: number of messages that are currently being persisted inside Cassandra.
  • rateExecutor_tenant (for each present tenantId): number of requests that got rate-limited
  • rateExecutor (statsNames — totalAdded, totalRejected, totalLaunched, totalReleased, totalFailed, totalExpired, totalRateLimited). Stats descriptions:
    • totalAdded: number of messages that were submitted for persisting
    • totalRejected: number of messages that were rejected while trying to submit for persisting
    • totalLaunched: number of messages sent to Cassandra
    • totalReleased: number of successfully persisted messages
    • totalFailed: number of messages that were not persisted
    • totalExpired: number of expired messages that were not sent to Cassandra
    • totalRateLimited: number of messages that were not processed because of the Tenant’s rate-limits

You can import preconfigured Grafana dashboards from this repository.

Grafana dashboards are also available when deploying the ThingsBoard Docker Compose cluster. See the Docker Compose cluster setup guide for details. Set MONITORING_ENABLED to true before deployment. Once running, Prometheus is available at http://localhost:9090 and Grafana at http://localhost:3000 (default credentials: admin / foobar).

Sometimes after configuring OAuth you cannot see the button for logging in with an OAuth provider. This happens when Domain name and Redirect URI Template contain faulty values — they need to match the URL you use to access your ThingsBoard web page.

Base URLDomain nameRedirect URI Template
http://mycompany.com:8080mycompany.com:8080http://mycompany.com:8080/login/oauth2/code
https://mycompany.commycompany.comhttps://mycompany.com/login/oauth2/code

For OAuth2 configuration, see OAuth 2.0 Support.

There are cases when the “Forgot my password” email is not sent. This can happen for one of the following reasons:

  • User is not activated: If the user account is not activated, the “Forgot my password” email will not be sent. You must activate the user first.
  • SMTP is not configured: If SMTP is not set up on your ThingsBoard instance, password recovery emails cannot be delivered. Make sure SMTP is configured correctly.
  • Case sensitivity of email login: By default, email login in ThingsBoard is case-sensitive for security reasons. To disable case sensitivity, update the SECURITY_USER_LOGIN_CASE_SENSITIVE parameter to false in your configuration file. After that, you must restart your ThingsBoard service to apply the changes.

Dashboard Export (tb-web-report)

Section titled Dashboard Export (tb-web-report)

The Export Dashboard feature relies on tb-web-report — a service that runs a headless Chromium browser to render dashboards and capture them as PDF or PNG. Most issues with the Dashboard export process are caused by timeout values that are too low for the complexity of the dashboard being rendered.

Reading tb-web-report logs

Section titled Reading tb-web-report logs

Before diagnosing an issue, check where logs are written for your deployment.

Log files are written to:

Terminal window
/var/log/tb-web-report/tb-web-report-<YYYY-MM-DD-HH>.log

Files rotate hourly and are compressed after rotation. To view recent entries:

Terminal window
tail -200 /var/log/tb-web-report/tb-web-report-*.log

The default log level is info. To increase verbosity, set LOGGER_LEVEL to debug:

Edit /etc/tb-web-report/conf/tb-web-report.conf and set:

Terminal window
export LOGGER_LEVEL=debug

Available levels from least to most verbose: errorwarninfohttpverbosedebugsilly

Remember to set the level back to info after diagnosing, then restart tb-web-report.

For the full reference of Report Service configuration parameters, see the Report Service configuration reference.

Symptom: “Failed to load dashboard page: page.goto: Timeout 10000ms exceeded”

Section titled Symptom: “Failed to load dashboard page: page.goto: Timeout 10000ms exceeded”

This is the most common Dashboard Export error.

In the tb-web-report log:

Failed to load dashboard page: page.goto: Timeout 10000ms exceeded

In the ThingsBoard UI or API response: The export request fails with a 503 error.

This error typically means tb-web-report did not have enough time to render the dashboard before the timeout expired.

Common causes:

  • The dashboard has many widgets, or widgets that fetch large amounts of data.
  • The configured time range covers a long period.
  • LOAD_DASHBOARD_RESOURCES_TIMEOUT is set too low (default: 10,000 ms).

How to resolve:

1. Increase the tb-web-report timeouts.

Edit /etc/tb-web-report/conf/tb-web-report.conf and add or update:

Terminal window
export LOAD_DASHBOARD_RESOURCES_TIMEOUT=180000
export DASHBOARD_IDLE_WAIT_TIME=20000
export GENERATE_REPORT_TIMEOUT=180000

2. Increase the ThingsBoard server-side timeout.

Edit /etc/thingsboard/conf/thingsboard.conf and add or update:

Terminal window
export SPRING_MVC_ASYNC_REQUEST_TIMEOUT=60000

3. Restart both services (tb-web-report and ThingsBoard).

4. If the timeout persists, review the dashboard complexity. The more widgets and data a dashboard loads, the longer rendering takes. Consider:

  • Reducing the number of widgets on dashboards used for reports.
  • Narrowing the default time range.
  • Using aggregated data rather than raw telemetry where possible.

Symptom: Report generates but is empty or shows no data

Section titled Symptom: Report generates but is empty or shows no data

The PDF or image file is produced, but all widgets appear blank.

In the tb-web-report log: No error is logged — the export completes successfully from the service’s perspective.

In the exported file: Widgets are visible but show no data, or the dashboard appears empty.

Cause: tb-web-report takes a snapshot immediately after the page finishes loading. If widgets fetch telemetry asynchronously, the data may not have arrived yet when the snapshot is taken.

How to resolve:

Set DASHBOARD_IDLE_WAIT_TIME to add an extra delay after page load before the snapshot is taken. Start with 5,000 ms (5 seconds) and increase if widgets are still blank.

Edit /etc/tb-web-report/conf/tb-web-report.conf:

Terminal window
export DASHBOARD_IDLE_WAIT_TIME=5000

Restart tb-web-report after changing this value.

Symptom: Export request times out at the ThingsBoard level

Section titled Symptom: Export request times out at the ThingsBoard level

The export request times out before tb-web-report finishes rendering.

In the ThingsBoard log (not the tb-web-report log): An async request timeout error appears. There is no corresponding error in the tb-web-report log.

In the UI or API response: “504 Gateway Timeout” (if behind a reverse proxy), or a timeout error returned from ThingsBoard without any error in the tb-web-report log.

Cause: SPRING_MVC_ASYNC_REQUEST_TIMEOUT on the ThingsBoard side is shorter than the time tb-web-report needs to generate the export.

How to resolve:

Set SPRING_MVC_ASYNC_REQUEST_TIMEOUT to match or exceed your GENERATE_REPORT_TIMEOUT value in tb-web-report (default: 30,000 ms). The value is in milliseconds.

Edit /etc/thingsboard/conf/thingsboard.conf and add or update:

Terminal window
export SPRING_MVC_ASYNC_REQUEST_TIMEOUT=60000

Restart ThingsBoard after changing this value.

All environment variable names are case-sensitive. An incorrectly spelled variable name is silently ignored — the service falls back to the default value without any warning or error message.

For the full reference of Report Service configuration parameters, see the Report Service configuration reference.

  • GitHub Project — check out the project and consider contributing.
  • Stack Overflow — ask questions tagged with thingsboard; the ThingsBoard team monitors this tag.
  • Contact us — if your problem isn’t answered by any of the guides above, contact the ThingsBoard team directly.