Cross Data Center support #189

mohkar · 2016-08-11T14:16:03Z

Hi antirez,

I would like to know if disque supports cross data center support

mathieulongtin · 2016-08-11T14:22:57Z

It does, as long as the different disque instances can talk to each other. But you can't control which instance get a replication, so there would be scenarios where there is no warranty that a job is replicated to two separate DC.

vaizki · 2017-04-26T09:35:40Z

Adding some bits of information I learned when doing PR #200 for eventual replication (as my use case is 2x DC each with 1x Disque and a bit of latency between the DCs) and what I changed in that PR.

Assuming a RTT of 100ms between the DCs:

ADDJOB .. REPLICATE 2 will block until confirmation is received from both DCs, so it will be a minimum of 101ms in reality.
While the client is blocking and waiting to achieve replication, it will every 50ms try to add new nodes for replication and rebroadcast also to already chosen nodes. This 50ms is currently not configurable but I will put in a PR for that also soon..
With a 101ms RTT to get the GOTJOB from the other DC, it's certain that at least 2 resends of REPLJOB will happen with EVERY ADDJOB. Each of those will result in an extra GOTJOB also. This means there will be 3x the normal amount of replication traffic.
Also if the other DC is down, ADDJOB will eventually fail (as it should). The only way to avoid this is use REPLICATE 2 ASYNC in which case the job might get to both DCs but you will not really ever know.

This is why PR #200 and the SYNC N parameter.

Again with 100ms RTT latency:

ADDJOB .. REPLICATE 2 SYNC 1 will block until confirmation is received from ONE DC, which will be right away from the node receiving the job.
- If that node is under memory pressure ADDJOB will wait for the GOTJOB from the other DC (and the 50ms timer will cause unnecessary resends).
Because of REPLICATE 2, the job will also be sent to the other DC. When a GOTJOB is received from there, the origin node will update it's list of confirmed nodes.
When the list of confirmed nodes includes REPLICATE amount of entries, the job is considered fully replicated.
Every job has a RETRY period which controls things like requeuing, but with PR Eventual replication, including partially async initial confirmations #200 it also now checks that if full replication has not yet been reached, it will try to add nodes. I've used it with RETRY 10 so that happens every 10 seconds..
- Note: this state is lost if the disque-server is restarted (because we can't change a job once it's in the AOF) so in that case the job will never end up on the other DC. Only the origin node (which received the ADDJOB) does this eventual replication so even adding more nodes won't help.
So, in summary with REPLICATE 2 SYNC 1 RETRY 10 if the other DC is down, the DC you are connected to will immediately store the job and unblock the client (1 synchronous replication achieved). It will then every 10 seconds check if there are additional reachable nodes (the other DC) and if yes, replicate the job there as well.

It's not perfect (mainly because restart stops eventual replication attempts) but it works for me and hopefully someone else as well.

vaizki mentioned this issue Apr 26, 2017

Eventual replication, including partially async initial confirmations #200

Closed

vaizki mentioned this issue Dec 13, 2019

Multi-DC / eventual / partially-async replication antirez/disque-module#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross Data Center support #189

Cross Data Center support #189

mohkar commented Aug 11, 2016

mathieulongtin commented Aug 11, 2016

vaizki commented Apr 26, 2017

Cross Data Center support #189

Cross Data Center support #189

Comments

mohkar commented Aug 11, 2016

mathieulongtin commented Aug 11, 2016

vaizki commented Apr 26, 2017