Building Flume: An Elixir Job Processing System

November 5, 2024 • 10 min read

At Scripbox, we're in the midst of an exciting architectural transition. As we build new services, we're embracing Elixir for its fault-tolerance and concurrency capabilities. However, this transition brings new challenges—particularly around background job processing.

Coming from a Ruby ecosystem where Sidekiq handles our background jobs beautifully, we need something equally robust for our Elixir services. That's how Flume was born—our answer to distributed job processing in the Elixir world.

The Challenge: Email at Scale

Our first real test case is email delivery. We're using AWS SES, which allows up to 450 API calls per second for sending emails. This sounds straightforward, but managing this rate limit while ensuring reliable delivery is more complex than it initially appears.

We need a system that can:

• Rate limit our requests to stay within AWS SES limits

• Handle back pressure when downstream systems are overwhelmed

• Batch jobs efficiently as our use cases expand

• Provide reliability and fault tolerance

Why Not Just Use Existing Solutions?

The Elixir ecosystem has several job processing libraries, but none quite fit our specific needs. We need fine-grained rate limiting, sophisticated back pressure handling, and the flexibility to batch jobs dynamically. Most importantly, we want something that leverages Elixir's strengths—particularly GenStage for flow control.

Taking inspiration from Sidekiq's elegant design, we're building Flume to be the Elixir equivalent, but with features tailored to our distributed, high-throughput environment.

Architecture: GenStage + Redis

Flume's architecture centers around two key technologies:

GenStage provides the foundation for our processing pipeline. It gives us built-in back pressure handling—when consumers can't keep up with producers, the system automatically slows down job generation. This prevents memory bloat and ensures stable performance under varying loads.

Redis serves as our job store and coordination layer. Jobs are persisted in Redis before processing, ensuring durability. Redis also powers our distributed rate limiting using sliding window counters.

The pipeline flows like this:

1. Producer: Pulls jobs from Redis queues

2. Rate Limiter: Controls job flow based on configured limits

3. Processor: Executes jobs with configurable concurrency

4. Batcher: Groups jobs for efficient bulk processing

Rate Limiting: The Foundation

Our rate limiting implementation is crucial for the AWS SES integration. We're using a sliding window approach with Redis:

Each pipeline can have its own rate limit configuration. For our email service, we set it to 400 requests per second (leaving some headroom below the 450 limit). The rate limiter tracks requests across all nodes in our cluster, ensuring we never exceed limits even as we scale horizontally.

The beauty of integrating this with GenStage is that when we hit rate limits, back pressure naturally flows upstream, pausing job consumption until capacity becomes available again.

Back Pressure: Protecting Our Resources

Back pressure is where Elixir and GenStage truly shine. When our email service can't process jobs fast enough—perhaps AWS SES is responding slowly—GenStage automatically reduces the flow of new jobs into our pipeline.

This prevents several problems we've seen in other systems:

• Memory exhaustion from queued jobs

• Cascading failures when downstream services slow down

• Lost jobs due to system overload

The system self-regulates, maintaining stability even when external services have performance issues.

Batching: Efficiency at Scale

As our use cases expand beyond email to internal system integrations, batching becomes essential. Some operations are much more efficient when performed in groups—database inserts, API calls to external services, or file processing operations.

Flume's batching works by accumulating jobs until either:

• The batch size limit is reached

• A timeout expires

This gives us the best of both worlds: efficiency through batching, but low latency for time-sensitive jobs. Workers can process individual jobs or entire batches, depending on their implementation.

Durability and Fault Tolerance

In a distributed system, failures are inevitable. Flume handles this through several mechanisms:

Job Backup: Before processing, jobs are moved to a backup queue. If a worker crashes, jobs can be restored and retried.

Exponential Backoff: Failed jobs are retried with increasing delays, preventing system overload from repeatedly failing operations.

Dead Letter Queues: Jobs that fail repeatedly are moved to dead letter queues for manual inspection.

Real-World Performance

Our email pipeline is now processing thousands of jobs per hour while maintaining strict rate limits. The system automatically adapts to varying loads—during high email volume periods, back pressure keeps everything stable.

What's particularly impressive is the observability. Through Telemetry events, we can see exactly how the system behaves:

• Queue depths and processing rates

• Rate limiting effectiveness

• Back pressure events and their resolution

Configuration and Usage

Flume is designed to be simple to use despite its sophisticated internals. Configuration happens in config/flume.exs:

config :flume,
  pipelines: [
    email_service: [
      producer: [redis_url: "redis://localhost:6379"],
      rate_limiter: [allowed: 400, period: 1000],
      processor: [stages: 5]
    ]
  ]

Enqueueing jobs is straightforward:

Flume.enqueue(:email_service, EmailWorker, [user_id, template])

Lessons Learned

Building Flume is teaching us valuable lessons about distributed systems and Elixir's capabilities:

GenStage's Power: The built-in back pressure handling eliminates entire classes of problems we've dealt with in other systems.

Redis as Coordination Layer: Using Redis for both job storage and rate limiting creates a simple, effective distributed coordination mechanism.

Observability from Day One: Instrumenting with Telemetry from the beginning gives us insights into system behavior that would be hard to add later.

Gradual Feature Addition: Starting with rate limiting, then adding back pressure, and finally batching allowed us to validate each feature before adding complexity.

Looking Forward

Flume is evolving with our needs. We're exploring features like job prioritization, more sophisticated scheduling, and improved monitoring. The modular architecture makes these additions straightforward.

Most importantly, Flume is proving that Elixir's concurrency model isn't just great for web applications—it's excellent for building robust, scalable infrastructure systems too.

As we continue our transition to Elixir-based services, Flume is becoming a crucial piece of our infrastructure puzzle, handling background processing with the reliability and performance our growing platform demands.

Check out the complete Flume implementation: Flume on GitHub

← Back to Home