DISCLAIMER:
Scenario | How to capture | Extent of Data Loss |
---|---|---|
Application Bug causes it to crash | recieved_good at agent isn't going up | Inflight messages will be lost as communication between producer library and DatabusAgent is ASYNC |
Conduit Agent is down | monitoring of agent, success count going up | AApplication won't be able to connect to agent and messages are lost. Amount of message loss can be measured through app throughput and down time of agent |
Agent is running under-capacity, spikes upto 50-100x in traffic | monitor denied_for_rate, denied_for_queue_size from agent | Application throughput and interval during which the agent was sending TRY_LATER can be used to measure the extent of loss |
Non gracefull shutdown/Crash of Agent | Process/Education to Operations team | Atmost 1 minute of data which can be cached in memory can get lost |
Non graceful shutdown/Crash of Collector | Process/Education to Operations team | Atmost 1 second of data can get lost |
No space on Agent to Spool Data and all Conduit Collectors are down | Spool space alerting | Again dataloss can be measured by how long all collectors were down and agent started to fail spool data |
No space on Collector to spool and HDFS is down | Spool space alerting | Collector will push back to agents after they reach their memory cache peak, agent spools and then it doesn't have disk. To measure dataloss here we need to find when the agent stopped spooling and what was the application througput |
HDFS all three datanodes went bonkers which have a particular file and they haven't flushed all to disk | HDFS monitoring | Since HDFS doesn't have a POSIX complaint fsync() api this scenario is possible, however the probability is rare as we have a replication factor of 3 Atmost 1 minute of data can be lost in this scenario |
Data is spooled at agent/collector and before it gets despooled the hard-disk of the box goes bad | monitoring disk for bad sectors/other issues | Dataloss is equal to the amount of data spooled |
HDFS write() doesn't throw exception, but sync() fails | to avoid dataloss we will call sync() again and in this case atmost 1sec worth of data can be replayed |
Conduit worker failures due to HDFS errors | If in a run we publish certain set of files and fail before committing the transaction, the same set can be replayed again in the next run to avoid data-loss. In most cases the number of files would be equal to number of collectors in LocalStream, number of collectors multiplied by number of clusters in mergedstream and mirror stream. However if we are processing a backlog and publish a large number of files and fail to commit the transaction due to HDFS unavalibity then the number could be higher. |