Apache Kakfa with Change data capture events
The recent release of the Change data capture (CDC) events capability within Salesforce has provided a much needed functionality for ensuring data synchronization changes are propagated in a more deterministic and scalable manner.
The CDC capability leverages streaming API's to send data updates occurring on Salesforce standard and custom objects to interested subscribers via events and channels.
Typical usecases for this includes full replication of Salesforce data as well as incremental and partial updates sent towards external systems. This is generally needed for external reporting, archiving, compliance reasons as well as transforming and storing data in a canonical form among other applications. The API limits for CDC are different then then ones used for SOAP/REST limits and hence offer more flexibility.
Apache Kafka is the defacto industry standard for large scale event processing applications. It was developed at LinkedIn before becoming an Apache project. It is primarily targeted towards enterprise class applications that need to process enormous amounts of data with high throughput and horizontal scalability. Apache Kafka is generally deployed in a cluster and supports in-built replication and fault tolerance.
As depicted above we can now subscribe to CDC events via Kafka topics and further process them in southbound systems like Hadoop. Apache kafka stores every event in the log file for upto 7 days. New events are appended to the front of the log and their position in the log is defined by an Id field called as the replayId. As the name implies it allows reading of data from an arbitrary point of time which allows recreating the entire state of the data on Salesforce from scratch if desired .
Heroku supports Apache Kafka natively and more information can be found here.