Data collection is the starting point of Big Data lifecycle capture, curate, store, analyze and visualize.
Data is generated by online serving systems. This could be multiple kind of events. This data is used for different Business applications like Reporting, Business Intelligence, Machine learning, real-time Analytics, and real-time feedback loops.
A system to collect the huge number of events data from online systems, making that available as real-time stream to consumers in a seamless fashion.
Earlier we are using an Rsync based file push mechanism to transfer data directly to the consuming system.
The granularity of files was minute. Every minute, data gets collected from online systems and get pushed to one or a set of machines. Further, these machines upload the data to HDFS. This was a very simple solution and worked well for long time, when data volumes were less and the number of consuming systems was few.
The data volumes grew and the online serving systems span across multiple geographies. The data consumers also became very demanding in nature with complex and disparate needs. The consumer systems could be different Hadoop grids, HBase, the real-time consumers like fraud detection systems, streaming analytics systems etc.
The point to point minutely file push model from producers to end consumers started having lot of trouble. No easy way to monitor or control the flow, if WAN links become flaky, huge backlog builds on producer machines. As the systems were very tightly coupled, any addition of new consumer would require a configuration change on the producer boxes.
All destinations IP addresses need to be known on producer machines. Often the operations team managing the producer systems and consumer systems are different, so became very difficult to manage this. There were lot of operability challenges like if a box on which the Rsync is pushing the file is down, and then data flow can get choked. No easy visibility in late arriving events.
This was also expensive on infrastructure. If more than one consumer in a remote data-center requires the same data stream, there would be lot of WAN bandwidth wastage for duplicate data transfers.
What we were looking for was a data collection system. We defined certain goals and characteristics of this system:
- Decouples producers and consumers. Consumer need not know about the producer systems at all and vice versa.
- Reliable and highly scalable
- Suits both batch as well as low latency streaming consumers. No different data paths. Should be Map-Reduce friendly
- Allows consumers to process the data at their own pace
- Efficient: no duplicate data transfers, uses compression
- Works with distributed producers over WAN, with consumers sitting in local or remote datacenters.
Data is published in the minute directories in HDFS for a given topic. A topic has three views for the consumers local, merged and mirror.
Merged view has data collected from all the data-centers. Merge view is not configured by default for all topics. It is configured on demand in a cluster co-located with the consumer application.
Merge view can also be replicated in other clusters, which is called a mirror. Mirror stream can be used for BCP purpose.
The data is published in HDFS in minute directories as [topicName]/yyyy/MM/dd/HH/mm This layout enables consumer systems (could be workflows) to consume data at different granularities.
The latencies from the time an event is generated to the time it is available for consumption varies between 2 minutes to 5 minutes depending on the view I.e. Local, merge and mirror.