Content • Mar 26, 2025

Why and How We Migrated from Sidekiq to Temporal

by Danielle Vansia and Emily Dunenfeld

We process thousands of background jobs daily. Temporal's reliability, scalability, and debuggability gave us better visibility and helped us to ensure that critical ETL processes run efficiently.

At Vantage, we process thousands of background jobs daily, from complex ETL workflows to simple task automation. Initially, we relied on Sidekiq to handle these jobs efficiently, but as our scale and complexity grew, we encountered challenges with reliability, debugging, and orchestration. To address these issues, we transitioned to Temporal, a workflow orchestration engine that provides durable, stateful execution with automatic retries.

Overview of Temporal vs. Sidekiq

Sidekiq and Temporal are both used for background job processing, but they have different architectures and ways of handling processing and jobs durability.

Sidekiq is a Redis-dependent job processor that executes tasks asynchronously in a queue. Sidekiq uses a threaded model for jobs processing, which means multiple jobs can run concurrently within a single process. The benefit to this is that Sidekiq is efficient at handling many lightweight tasks; however, this can also introduce challenges, like race conditions and contention between jobs trying to use the same resource. It’s used in the Ruby ecosystem for handling background work.
Temporal is a workflow orchestration engine that manages distributed application state at scale and ensures job execution is both durable and stateful. Temporal keeps track of workflow progress and guarantees completion, even across failures.

Because of these differences, a direct one-to-one comparison isn’t entirely fair. Sidekiq is focused on fast, asynchronous job execution, while Temporal is built for durable, stateful Workflows. The feature comparison below highlights key distinctions rather than suggesting they are interchangeable.

Feature	Sidekiq	Temporal
Processing model	Background job processing with state managed in Redis.	Uses a durable, stateful workflow execution.
Visibility and debugging	Provides a web UI, but there are also several open-source Ruby gems and other tools for monitoring and debugging.	Offers an out-of-the-box web UI with a complete workflow history.
Concurrency and scaling	Uses a threaded model, meaning each worker process can run multiple threads concurrently.	Designed for scalability. Enables dynamic distribution of workload across workers.
Language support	Ruby	Go, Java, PHP, Python, TypeScript, .NET, and Ruby natively supported, as well as community-supported SDKs for Clojure and Scala

Sidekiq vs Temporal features.

Our Ruby on Rails Setup Pre-Temporal

Vantage is a Rails application. We started off with using Sidekiq for background job processing, primarily to handle complex ETL (Extract-Transform-Load) tasks in our cloud cost optimization workflows. ETL processes, which are fundamental to data processing, extract cost data from various sources, transform it into meaningful insights, and load it into our system for analysis. As our operations grew, so did the need for scalability and debugging. Vantage helps both small and large corporations optimize cloud spending across platforms, like AWS, GCP, and Azure, and provides tools for tracking and forecasting cloud costs. We even recently announced an integration for ingesting Temporal Cloud costs. Our internal workflows process vast amounts of data. Currently, we process roughly 25 million jobs per day, across hundreds of worker processes, ranging from complex data processing pipelines (ETLs) to simple email notifications.

That’s when we turned to Temporal; however, we recognize that Temporal is not a direct replacement for Sidekiq. Each tool has its strengths, and they cater to different needs. Yet, this shift allows us to track every step of our workflows and improves visibility for our growing engineering team.

Temporal Benefits and Why We Migrated from Sidekiq

Temporal’s execution is based on “durable, reliable, and scalable function execution.” This means that Temporal ensures that functions continue to run reliably, even in the event of failures. Durable means that Workflows will persist, regardless of a time limit. Reliable means that Temporal automatically retries failed tasks until they succeed. Scalable means that Temporal is able to distribute workloads across workers and handle high loads. Therefore, Temporal is ideal for orchestrating long-running, complex processes and does not get hung up on network failures.

Reliability and Durability

Temporal’s state management and automatic retries make it exceptionally resilient. Since the Workflow execution state is durably stored, failures don’t disrupt progress—Workflows can resume exactly where they left off, or another worker can pick up execution from the persisted state. With our old Sidekiq setup, we ran into instances where jobs were canceled and since they were not automatically resumed, we had to write code to solve it ourselves, rather than it being built into the tool.

In addition, our setup involves complex job dependencies. We have many jobs that depend on other jobs that depend on other jobs, and so on. This level of orchestration is less common in Rails due to its traditional background job frameworks, which lack built-in state management. With Temporal, Durable Execution ensures reliability in job dependencies, making it a big factor in our decision to switch.

We use ID-based deduplication to guarantee that only one instance runs for a given input. This prevents parallel executions, eliminating redundant processing. In cases where ETL processes start based on customer actions, such as adding a Virtual Tag, duplication prevention ensures efficiency and consistency.

We also have numerous long-running processes that can take over a day to complete. With Temporal’s ability to support Workflows that last for weeks or even indefinitely, timeouts are no longer a concern. Thereby ensuring that even the longest-running processes complete reliably without intervention.

Scalability

The Temporal architecture is designed to scale, making it well-suited for handling increasing workloads and complex distributed systems. One major advantage is the fan-out pattern, which was cumbersome to manage in Sidekiq due to its lack of a first-class implementation. With Temporal, running 10 or 10,000 parallel Workflows follows the same structure, simplifying engineering efforts and reducing operational complexity.

Temporal supports multiple programming languages, providing flexibility if we ever decide to rewrite or expand our stack beyond Ruby.

Debuggability

One of our main factors in making the switch was debuggability, and not just for engineering, but also for customer success and sales engineers who need visibility into state and ETL processes for customer tasks. As our team grew, we didn’t have time to dig through logs to understand what happened with a job.

Sidekiq has a Web UI that lets you monitor job execution in real time. The dashboard displays running jobs and shows their duration and queue placement; however, it does not offer an out-of-the-box way to trace job hierarchies, meaning if one job enqueues another, there’s no direct visibility into their relationship. Another issue is that there is no record of a job’s history once complete.

Sidekiq dashboard. Image courtesy of Sidekiq documentation.

Errors and job execution details are stored in log/sidekiq.log. You can also retry failed jobs on the Web UI. Note that Sidekiq Enterprise offers additional debugging features, but the free version relies on logs, retries, and queue management for issue resolution. There are several open-source available to help debug Sidekiq workers beyond its built-in Web UI. For example, the pry-remote Ruby gem lets you attach an interactive Pry session to a running worker so that you can inspect state and variables in real time.

Temporal offers a detailed user interface with the entire hierarchical Workflow history and the ability to scope debugging. This was a key advantage we recognized, as it also maintains a comprehensive job history and state tracking. Jobs are stateless, pass in input and get an output at the end; complexity of processing is contained in the Workflow rather than being an application state.

Temporal dashboard. Image courtesy of Temporal documentation.

Temporal has a configurable 1–90-day retention period, which refers to how long Temporal stores data for closed Workflow executions within a namespace. During this time, Workflows remain accessible for inspection and debugging. This debuggability helps to give our other teams visibility into imports and Workflows through the ETL process.

Ease of Use

Another factor, especially as the team grows, and we bring on more junior engineers, is simplicity and the ability for us to quickly train other engineers. Sidekiq does have some advantage in this area: It’s straightforward, easy enough to reason with, and ideal for one-off jobs. Deploying is also relatively straightforward. On the other hand, Temporal requires more setup for deployment; however, Temporal Cloud does help to mitigate deployment complexity.

Where Temporal excels is in its ability to establish reusable patterns for complex Workflows, like ETL. Because Temporal manages input/output storage natively, there’s no need to write intermediate states to a database. This makes multi-step processes easier for us to track and restart without any manual intervention. We recognize that while it takes some time to fully grasp Temporal’s execution model, the long-term benefits for handling stateful Workflows and easy job restarts outweigh the initial learning curve.

Temporal Migration Implementation

We started experimenting with Temporal over two years ago, with the first hello world commit happening in 2023. However, it wasn’t until July 2024 that we actually started migrating over.

At that time, Temporal didn’t officially support Ruby. Fortunately, Coinbase had released a Ruby library for Temporal, which became the unofficial, defacto Ruby client. Temporal has since created its own SDK.

Scoping

We broke the migration process into two phases. The first phase focused on learning how to write Temporal-acceptable code and understanding its Workflow model. Temporal-acceptable code included learning how to make the Workflows deterministic, so that when the Workflow is re-run the same sequence of steps are executed. It also included learning how to ensure Activities were idempotent so that they can be safely re-executed if they fail. This phase also consisted of scoping and planning, such as identifying the specific tasks and Workflows that needed to be migrated.

Migrating

Once we had a clearer picture of Temporal and the scope, we moved on to the second phase: the actual migration of Workflows. This part of the process was particularly time-consuming because we needed to make sure there was no downtime for our customers by thoroughly testing and monitoring.

Regarding the conversion strategy, we opted for an incremental approach. We migrated Workflows one by one, based on domain, in order to verify the functionality of each Workflow independently and pinpoint any issues quickly. We began with lower-risk domains such as cache invalidation and resource syncing, operations that are easy to validate and quick to perform, allowing for fast iteration.

Once we had more experience and confidence writing Temporal Workflows and Activities, establishing patterns around concurrency and error handling, we started to migrate jobs that processed cloud usage data. These jobs vary based on customer but can range from seconds to hours with the resulting data set being millions of rows of data.

Data consistency was the top priority, so we validated by processing the jobs in parallel and comparing outputs at each step to ensure we arrived at the same outcomes. Since our background processing can be initiated by multiple sources (e.g., regular cost updates, provider notifications, and customer configurations) we needed a consistent path to invoke both the existing Sidekiq Jobs and the new Temporal Workflows. In practice, this meant adopting a standard code path for enqueuing an update for a resource. In this path, we could handle feature flags, so we could roll out new Workflows in batches to limit any potential issues.

Monitoring

Finally, we continuously monitored using Datadog dashboards and Temporal Cloud Metrics to track state transitions, service requests, and Temporal Actions. Since Temporal enforces API rate limits based on Actions and requests per second, we paid close attention to these parameters. We also monitored key performance indicators like latency, worker polling success, and overall Workflow metrics, including successes and failures.

We added more granular metrics by leveraging middleware, Workflow, and Activity metadata to get more insight into individual failures by Workflow, domain, customer, and more. Additionally, we used Custom Search Attributes to query Workflows directly in the UI, helping us quickly identify individual failures.

Learnings

While experimenting, we initially created all new Workflows within the same production namespace. Because Temporal allows up to 400 Actions per second (APS) by default within each namespace, we approached rate limits in our single namespace, which was further exacerbated by aggressive worker retry behavior. Eventually, we created a domain-based namespace design, breaking up our workloads into isolated domains and estimating the number of Activities that we expected each namespace to perform to approximate APS, and migrated relevant Workflows accordingly. In hindsight, we would have prioritized namespace design earlier in the process to avoid rate limiting issues.

Conclusion

In the end, we found that Temporal provides stronger reliability, scalability, and debuggability, compared to Sidekiq, which made it the right choice for our infrastructure requirements. While Sidekiq remains useful for simple background jobs, Temporal’s Workflow orchestration gave us better visibility and helped us to ensure that critical ETL processes run efficiently. This transition has improved both our internal engineering operations and our ability to support our customer base.

Cost Reporting

Kubernetes

Virtual Tagging

Network Flow Reports

Cost Allocation

Budgeting

Resource Reports

Usage-Based Reporting

Anomaly Detection

Autopilot for Savings Plans

Cost Recommendations

Commitment Reports

Unit Costs

Savings Planner

Issues

Team Access

Terraform

Jira

AWS

Azure

Google Cloud

Oracle Cloud

Kubernetes

Datadog

Snowflake

Fastly

MongoDB

Databricks

New Relic

Confluent

PlanetScale

Coralogix

GitHub

Linode

OpenAI

Grafana Cloud

ClickHouse Cloud

Temporal Cloud

Twilio

Custom Providers

About

Blog

Customers

Podcasts & Talks

Newsroom

Events

Slack Community

EC2Instances.info

cur.vantage.sh

Cloud Cost Reports

Cloud Cost Handbook

Cloud Cost Leaderboard

Product Changelog

Vantage University

API Documentation

MSPs

Partnerships

Our Partners