Easily build complex reports
Monitoring and efficiency metrics
Custom cost allocation tags
Network cost visibility
Organizational cost hierarchies
Budgeting and budget alerts
Discover active resources
Consumption-based insights
Alerts for unexpected charges
Automated AWS cost savings
Discover cost savings
Unified view of AWS discounts
COGS and business metrics
Model savings plans
Collaborate on cost initiatives
Create and manage your teams
Automate cloud infrastructure
Cloud cost issue tracking
Detect cost spikes
by Danielle Vansia and Emily Dunenfeld
Contents
At Vantage, we process thousands of background jobs daily, from complex ETL workflows to simple task automation. Initially, we relied on Sidekiq to handle these jobs efficiently, but as our scale and complexity grew, we encountered challenges with reliability, debugging, and orchestration. To address these issues, we transitioned to Temporal, a workflow orchestration engine that provides durable, stateful execution with automatic retries.
Sidekiq and Temporal are both used for background job processing, but they have different architectures and ways of handling processing and jobs durability.
Because of these differences, a direct one-to-one comparison isn’t entirely fair. Sidekiq is focused on fast, asynchronous job execution, while Temporal is built for durable, stateful Workflows. The feature comparison below highlights key distinctions rather than suggesting they are interchangeable.
Sidekiq vs Temporal features.
Vantage is a Rails application. We started off with using Sidekiq for background job processing, primarily to handle complex ETL (Extract-Transform-Load) tasks in our cloud cost optimization workflows. ETL processes, which are fundamental to data processing, extract cost data from various sources, transform it into meaningful insights, and load it into our system for analysis. As our operations grew, so did the need for scalability and debugging. Vantage helps both small and large corporations optimize cloud spending across platforms, like AWS, GCP, and Azure, and provides tools for tracking and forecasting cloud costs. We even recently announced an integration for ingesting Temporal Cloud costs. Our internal workflows process vast amounts of data. Currently, we process roughly 25 million jobs per day, across hundreds of worker processes, ranging from complex data processing pipelines (ETLs) to simple email notifications.
That’s when we turned to Temporal; however, we recognize that Temporal is not a direct replacement for Sidekiq. Each tool has its strengths, and they cater to different needs. Yet, this shift allows us to track every step of our workflows and improves visibility for our growing engineering team.
Temporal’s execution is based on “durable, reliable, and scalable function execution.” This means that Temporal ensures that functions continue to run reliably, even in the event of failures. Durable means that Workflows will persist, regardless of a time limit. Reliable means that Temporal automatically retries failed tasks until they succeed. Scalable means that Temporal is able to distribute workloads across workers and handle high loads. Therefore, Temporal is ideal for orchestrating long-running, complex processes and does not get hung up on network failures.
Temporal’s state management and automatic retries make it exceptionally resilient. Since the Workflow execution state is durably stored, failures don’t disrupt progress—Workflows can resume exactly where they left off, or another worker can pick up execution from the persisted state. With our old Sidekiq setup, we ran into instances where jobs were canceled and since they were not automatically resumed, we had to write code to solve it ourselves, rather than it being built into the tool.
In addition, our setup involves complex job dependencies. We have many jobs that depend on other jobs that depend on other jobs, and so on. This level of orchestration is less common in Rails due to its traditional background job frameworks, which lack built-in state management. With Temporal, Durable Execution ensures reliability in job dependencies, making it a big factor in our decision to switch.
We use ID-based deduplication to guarantee that only one instance runs for a given input. This prevents parallel executions, eliminating redundant processing. In cases where ETL processes start based on customer actions, such as adding a Virtual Tag, duplication prevention ensures efficiency and consistency.
We also have numerous long-running processes that can take over a day to complete. With Temporal’s ability to support Workflows that last for weeks or even indefinitely, timeouts are no longer a concern. Thereby ensuring that even the longest-running processes complete reliably without intervention.
The Temporal architecture is designed to scale, making it well-suited for handling increasing workloads and complex distributed systems. One major advantage is the fan-out pattern, which was cumbersome to manage in Sidekiq due to its lack of a first-class implementation. With Temporal, running 10 or 10,000 parallel Workflows follows the same structure, simplifying engineering efforts and reducing operational complexity.
Temporal supports multiple programming languages, providing flexibility if we ever decide to rewrite or expand our stack beyond Ruby.
One of our main factors in making the switch was debuggability, and not just for engineering, but also for customer success and sales engineers who need visibility into state and ETL processes for customer tasks. As our team grew, we didn’t have time to dig through logs to understand what happened with a job.
Sidekiq has a Web UI that lets you monitor job execution in real time. The dashboard displays running jobs and shows their duration and queue placement; however, it does not offer an out-of-the-box way to trace job hierarchies, meaning if one job enqueues another, there’s no direct visibility into their relationship. Another issue is that there is no record of a job’s history once complete.
Sidekiq dashboard. Image courtesy of Sidekiq documentation.
Errors and job execution details are stored in log/sidekiq.log. You can also retry failed jobs on the Web UI. Note that Sidekiq Enterprise offers additional debugging features, but the free version relies on logs, retries, and queue management for issue resolution. There are several open-source available to help debug Sidekiq workers beyond its built-in Web UI. For example, the pry-remote Ruby gem lets you attach an interactive Pry session to a running worker so that you can inspect state and variables in real time.
log/sidekiq.log
Temporal offers a detailed user interface with the entire hierarchical Workflow history and the ability to scope debugging. This was a key advantage we recognized, as it also maintains a comprehensive job history and state tracking. Jobs are stateless, pass in input and get an output at the end; complexity of processing is contained in the Workflow rather than being an application state.
Temporal dashboard. Image courtesy of Temporal documentation.
Temporal has a configurable 1–90-day retention period, which refers to how long Temporal stores data for closed Workflow executions within a namespace. During this time, Workflows remain accessible for inspection and debugging. This debuggability helps to give our other teams visibility into imports and Workflows through the ETL process.
Another factor, especially as the team grows, and we bring on more junior engineers, is simplicity and the ability for us to quickly train other engineers. Sidekiq does have some advantage in this area: It’s straightforward, easy enough to reason with, and ideal for one-off jobs. Deploying is also relatively straightforward. On the other hand, Temporal requires more setup for deployment; however, Temporal Cloud does help to mitigate deployment complexity.
Where Temporal excels is in its ability to establish reusable patterns for complex Workflows, like ETL. Because Temporal manages input/output storage natively, there’s no need to write intermediate states to a database. This makes multi-step processes easier for us to track and restart without any manual intervention. We recognize that while it takes some time to fully grasp Temporal’s execution model, the long-term benefits for handling stateful Workflows and easy job restarts outweigh the initial learning curve.
We started experimenting with Temporal over two years ago, with the first hello world commit happening in 2023. However, it wasn’t until July 2024 that we actually started migrating over.
hello world
At that time, Temporal didn’t officially support Ruby. Fortunately, Coinbase had released a Ruby library for Temporal, which became the unofficial, defacto Ruby client. Temporal has since created its own SDK.
We broke the migration process into two phases. The first phase focused on learning how to write Temporal-acceptable code and understanding its Workflow model. Temporal-acceptable code included learning how to make the Workflows deterministic, so that when the Workflow is re-run the same sequence of steps are executed. It also included learning how to ensure Activities were idempotent so that they can be safely re-executed if they fail. This phase also consisted of scoping and planning, such as identifying the specific tasks and Workflows that needed to be migrated.
Once we had a clearer picture of Temporal and the scope, we moved on to the second phase: the actual migration of Workflows. This part of the process was particularly time-consuming because we needed to make sure there was no downtime for our customers by thoroughly testing and monitoring.
Regarding the conversion strategy, we opted for an incremental approach. We migrated Workflows one by one, based on domain, in order to verify the functionality of each Workflow independently and pinpoint any issues quickly. We began with lower-risk domains such as cache invalidation and resource syncing, operations that are easy to validate and quick to perform, allowing for fast iteration.
Once we had more experience and confidence writing Temporal Workflows and Activities, establishing patterns around concurrency and error handling, we started to migrate jobs that processed cloud usage data. These jobs vary based on customer but can range from seconds to hours with the resulting data set being millions of rows of data.
Data consistency was the top priority, so we validated by processing the jobs in parallel and comparing outputs at each step to ensure we arrived at the same outcomes. Since our background processing can be initiated by multiple sources (e.g., regular cost updates, provider notifications, and customer configurations) we needed a consistent path to invoke both the existing Sidekiq Jobs and the new Temporal Workflows. In practice, this meant adopting a standard code path for enqueuing an update for a resource. In this path, we could handle feature flags, so we could roll out new Workflows in batches to limit any potential issues.
Finally, we continuously monitored using Datadog dashboards and Temporal Cloud Metrics to track state transitions, service requests, and Temporal Actions. Since Temporal enforces API rate limits based on Actions and requests per second, we paid close attention to these parameters. We also monitored key performance indicators like latency, worker polling success, and overall Workflow metrics, including successes and failures.
We added more granular metrics by leveraging middleware, Workflow, and Activity metadata to get more insight into individual failures by Workflow, domain, customer, and more. Additionally, we used Custom Search Attributes to query Workflows directly in the UI, helping us quickly identify individual failures.
While experimenting, we initially created all new Workflows within the same production namespace. Because Temporal allows up to 400 Actions per second (APS) by default within each namespace, we approached rate limits in our single namespace, which was further exacerbated by aggressive worker retry behavior. Eventually, we created a domain-based namespace design, breaking up our workloads into isolated domains and estimating the number of Activities that we expected each namespace to perform to approximate APS, and migrated relevant Workflows accordingly. In hindsight, we would have prioritized namespace design earlier in the process to avoid rate limiting issues.
In the end, we found that Temporal provides stronger reliability, scalability, and debuggability, compared to Sidekiq, which made it the right choice for our infrastructure requirements. While Sidekiq remains useful for simple background jobs, Temporal’s Workflow orchestration gave us better visibility and helped us to ensure that critical ETL processes run efficiently. This transition has improved both our internal engineering operations and our ability to support our customer base.
S3 Tables simplifies analytics by bringing managed Iceberg capabilities to S3.
RDS Extended Support allows customers to continue receiving security updates for older database versions, but it comes at a significant hourly per-vCPU cost that increases over time.
MongoDB Atlas is the cost-effective choice for production workloads where high-availability is a requirement.