rougth draft (v2) about a disk mirrored via network: ---------------------------------------------------- General: - project name: online disk replicator (aka odr) - Aim: odr is able to share (read/write) a block device among several nodes. if the fs does the locking, several nodes can mount the device at the time. if nodes fail, the remaining ones try not to interupt the service. - this doc is about the protocol part. not the integration in linux kernel. im a protocol guy and need more work to fully understand the possible issues of a kernel implementation. - the main loop will be close to the following steps: 1. build: copy the whole disk in the mirrors disk. 2. init: switch on all the nodes and start the life monitor. 3. run: the writers write on the disk and the mirrors mirror. 4. fail: the life monitor detects the master death and trigger the recovery. 5. recovery: sync it and go to 'run' ---------------------------------------------------------------- Requirement: - it should be as flexible as possible because the feature needed in odr change a lot according to the applications. - possible to have several mirrors and several writers. - possible to mount read/write the device on several cpu assuming the upper layer (i.e. fs) does the proper locking. - the design criteria is the safety. a fs corruption MUST be avoided when possible. efficiency comes after (even if the flexibility of the options may allow to change that order) - must not assume that packet are received in the sent order. required to mirror across internet. ---------------------------------------------------------------- Limitations: - in case of several writers, the fs coherency can't be assured at the block device level only. the lock mecanism required is assumed to be handled by another protocol and is out of the scope of this document. - in 'odr shares a block device across a network', network mostly means LAN, even if the protocol will send routable ip packets, so to mirror across internet is fully supported but not very practical. ---------------------------------------------------------------- Terminologie: - writer: a sender of data. possibly several in the odr system. a writer can be insync or outofsync. a newly elected master is said outofsync. it become insync as soon as it's local image of the block device is syncrohnize with the 'common image'. - mirror: the receiver of the data. one or more in the odr system. nMirror is the number of mirror available. - pool: the set of the writers and the mirrors - node: a member of the pool. it can be either a writer or a mirror. - resyncher: a node currently in resynch phase - takeover: when the writer die and a former mirror become a writer. - life monitor: the software/protocol which monitor the node's life in the pool. if the writer die, a take over is triggered. - local disk: the disk physically in the node box - network disk: the kind of virtual disk shared through the network - nodeid: a uniq identificator for each node. currently it is the ip address. advantage it is easy to gather (the source ip address of each packet). ---------------------------------------------------------------- How to build a mirror from scratch: - the mirror disk is initialized from scratch. copy the whole block device by requesting each and every block from a disk insync. - if the source disk is busy by new operations during the copy, the copy can virtually take forever (e.g. the writer renew the blocks faster than mirror copy them). solution: to slow down the master with a source quench. - to copy the whole disk can be long. it should be avoided when possible. Sometime it is required. e.g. the crash last long and require write operations no more in the log. Moreover, as we arent aware of the fs layout, even the unused blocks will be requested. possible 'outofband' solution: to do a special script for this. freeze all writers, build and exchange a .gz of the device, will be more effective because of the compression (especially in case of 'almost empty' disk) ----------------------------------- way to log the write operations: - a log record is a nodeid(32bits for ipv4), a wseq(64bits, wrap around handled via 'tcp sequence number arithmetic'), a blknb(32bits), a log state(1bit: intend/comitted). - a single log file in which the log records are recorded in the order the blocks are written on the local drive. the current wseq of all the writers are periodically written/flushed in the log file. - IMPORTANT: the log records may be in different orders depending on the hosts. - the flush of all the wseq is used in case of resynch, the resyncher send a request describing all the current writers wseq, the source find the position in the log (all the missing nodes are above this position) and roll the log from here. if the resyncher receive a wseq operation already done, just discard it, it may happen because the log order isnt the same on each hosts. - the block is written in the log with in a 'intend' state before the all the confirmations (from network and local disk), once all confirmations are received, update the log record to a 'commited' state. - the log is a 'normal' file stored on a fs, so to avoid recursive loop, it must not stored in odr device. - the log doesnt store the block data itself, only metadata about the block so a block may have been rewritten. e.g. log (wseq:1, blknb:40) and (wseq:2, blknb:40). you can't have the block used for wseq1. when you replay the log, wseq:1 will give you the block written in wseq:2. a direct consequence: if you replay the log, you must replay it all, you can't stop in the middle of the log and mount the disk. ----------------------------------- how to resync after being off: - the resyncher read its own log file to know what are the last wseq for each writer. chose a peer among the poll (WORK: how ? see read from network) and send him the list of all the wseq. the amount of block to send is likely to be large. - the resynchee is the node which provide the block to the resyncher. - The resynchee find its oldest log record which match the request sent by the resyncher. It is important to get the oldest one because hosts may have log records in different orders. Some block might be transmitted even if the resyncher already got them, but it will discard them because of their older wseq. it is believe that there will be only a 'out of order' blocks. WORK: duplicated with 'way to log' - the resynchee read a given number of block from the log and send them to the resyncher. - the resyncher is in charge to retransmit the request if there is a packet loss. a packet may be a resquest or the one of the replied blocks. - during the resynch, the resyncher doesnt listen to current write operations. - if a read request occurs for a wseq too old to be in the log, 2 possibles actions copy the whole block device or just declare it dead. in any case, we want to avoid it so tune the log size accordingly. WORK: explain the criteria to choose a good log size ----------------------------------- How to write a block on the block device: - the higher layer is assumed to do the proper locks to assure fs coherency. - the writer send the block on the network, and wait for an ack from all nodes. the block request is completed when all the ack has been received. if, after a delay, at least one ack is missing, the block is retransmited. - the block is written on the local disk after sending it to the network to reduce the latency. the local write is likely to be completed before the writer receive all the ack. - there is a delay to wait before to complete the write even if all the ack are received. this delay is used to slow down the writer when needed. the default value is 0. - a ack packet is replied in unicast. ----------------------------------- How to read a block from the block device: - the higher layer is assumed to do the proper lock to assure fs coherency. - if the disk is insync, the block is read on the local drive. - if the local disk is outofsync, the read is made on the network without local cache, so slower but still possible (ha spirit). even a disk less station can use the network disk doing read and write through the network. this feature is used mainly during the disk resync. - if the disk is outofsync, the reader send a request (WORK: to whom ?) with the blknb and wait for the result i.e. block or error code. if, after a timeout, no result has been received, reemit the request. if the destination of the request is no more alive, choose another one. - NOTE: to read a block here is different than to get a block from the network to resynch. here the block is requested by blknb, to resynch the block are requested by wseq. - WORK: to whom send the request ? obviously not to everybody (answer storm). if i know the source, request it from the source, else pick a node at random ? an anycast mecanism ? ----------------------------------- How to receive a block from the network: - each node got a list of all the current wseq for each writter. - if the received wseq is <= than the current the wseq for this nodeid, the block is already there, simply discard it (it is a rexmit for another mirror which hasnt acknowledged) - the ack is sent when the block has been successfully written on the disk. the ack must not be sent when the block is received from the mirror. if the mirror do so, the block write may failed (disk error or crash of the mirror before the write is completed), the writer wont be aware of it. - WORK: if the wseq isnt exactly the next, what to do ? discard ? queue it for latter use (close to ip reassembly problem) ? can it still happen if all block are acked, yes because it doesnt mean there isnt several block unacked at the same time even from a single source. ----------------------------------- How to slow down a writer: - in some cases, it is usefull to slow down a writer. for example: one node try to resync and the writer renew the blocks faster that the resyncer is able to copy them. - each writer got a minimal delay to perform a write operation. this delay is 0 by default. each time a writer receive a source quench it increase this delay and set a timer. if this timer expire the delay is decreased. - the packet is send in unicast to the writer which triggered the source quench. - WORK: how the delay is increased/decreased? how long is the timer? which condition triggers a source quench ? maybe to use a better metric to slow down (more global, not per block...). would be tunned by experience. ------------------------------------------------------------ Transport: - read/write operations require to retransmit packets after a timeout. how to tune the timeout ? a automatic round time trip estimation ala tcp ? a manual configuration ? automatic is better but harder. - the usual ethernet mtu is 1500bytes. so a block greater will be fragmented across ethernet. packet fragmentations wastes bandwidth(specially in case of rexmit), increase the overhead and packet loss probability, it should be avoided when possible. set up a link with with jumbo packets when possible. - either plain IP or UDP. both to be routable. UDP is easily handled from userspace (register a well known port). IP doesnt have the overhead of the udp headers (is it significant when most packets are around 4k?) and require priviledged right to acceed from userspace. if the packet is bigger than the MTU, do the fragmentation at the ip level. - some communications will use multicast (e.g. writer to mirrors). the multicast ip address will be INADDR_ODR_GROUP (register a nonroutable multicast address). In case of mirroring across internet, the address will be configured by the administrator (out of scope of this document). - IPSec is able to provide encryption, authentification and integrity. no security is integrated in the protocol itself. ---------------------------------------------------------------- Difference between the draft v1 to v2: - ability to have several writers at the same time. - ability to read/write the block device even if it is a resync phase, suggested by Marcelo Tosatti. - bugs in log fixed based on Stephen Tweedie's suggestions. - strip all the fancy mecanisms to acknowledge blocks to keep only Wait_N_Ack i.e. each mirror acknowledge each block. the performance isn't that bad because there are several pending write at the same time. Moreover it became too complexe with several writers. (WORK: blabla write by chunk in ext3 WORK: blabla race between fs lock and possible packet loss) - the scope of odr is more clearly defined. odr doesnt handle the quorum problem and the lifemonitor. --------------------------------------------------------------- Credits: - idea of a online disk replicator comes from drbd made by Philipp Reisner. - thanks to Stephen Tweedie and Marcelo Tosatti for their comments. - design of the odr protocol by jerome etienne.