Conduit

Overview

Overview

Conduit-lite refers to the conduit setup in cloud environments where typically low volume data streams are generated by producers. Hence hadoop cluster may not provide cost benefit. Instead Amazon S3 service is used to store data. S3 is designed to provide extremely high durability (upto 99.9 nines's). Also note that there is no Job Tracker to execute conduit-lite MR jobs. Instead these jobs are executed in the local runner.

Topology

The recommended topology to setup conduit-lite in a cluster is:

Configure an instance of scribe agent to run locally on the same box as the message publisher. The publisher writes messages to the local scribe agent, thus providing high write throughput and isolation from any downstream failure. Scribe agent should be configured to write to scribe collector(s). The agent is capable of handling various downstream failure scenarios and ensure that there is no data-loss by spooling data to its local disk and periodically de-spooling.
setup below.
Conduit typically supports message sizes of ~2-3K. However it has been successfully used to send messages upto 10K size as well.

Setup on S3/S3N

Conduit-lite setup

The following link describes how to configure conduit-lite: Setup

Configuration for S3/S3N

The following link describes how to make S3/S3N specific configurations: Configuration changes for S3/S3N

Configure Access to users

Give Access to users

Using S3NativeCopyMapper

The S3NativeCopyMapper class is used to do server side copy within the S3 buckets.The copy mapper reduces the need to stream the data from S3 bucket to EC2 client on which CopyMapper runs.

In order to use S3NativeCopyMapper within conduit lite configure the cluster in the conduit.xml to use copy mapper implemenation as following:

  <cluster name="" hdfsurl="s3n:///"
   jturl=""
   jobqueuename=""
   copyMapperClass = "com.inmobi.conduit.local.S3NCopyMapper"/>

Please note that the S3NativeCopyMapper works with S3N filesystem.

Benchmarking

Here are the results of various benchmarking tests done on EC2 cluster:

Latency

Run benchmarks

============ Mean Latency ============

Collector Benchmark 44900 ms
Local Benchmark 206291 ms
Merge Benchmark 293522 ms
Mirror Benchmark 327832 ms

Notes:

The above test covers the scenario of scribe collector writing to S3 bucket and conduit local/merge/mirror stream services processing files in S3. Latency is calculated by different consumers that read data written by collector/local/merge/mirror stream writers.
Merge/mirror stream is configured across different regions.
This test finds the mean latency (average latency to read all messages). In general, the order of mean latency (for collector/local/merge/mirror) is (<1 min/3-4 min/4-5 min/5-6 min).
This test does not cover percentile latency. However it should be possible to find percentile latency by using conduit audit feature exposed through audit dashboard. Please refer to Conduit visualization for more details.

Throughput

A) Refer to producer/agent/collector throughput results .