I would appreciate any donations. Wishlist or send e-mail type donations to maekawa AT daemon-systems.org.

Thank you.


WAPBL(9)                   Kernel Developer's Manual                  WAPBL(9)

NAME
     WAPBL, wapbl_start, wapbl_stop, wapbl_begin, wapbl_end, wapbl_flush,
     wapbl_discard, wapbl_add_buf, wapbl_remove_buf, wapbl_resize_buf,
     wapbl_register_inode, wapbl_unregister_inode,
     wapbl_register_deallocation, wapbl_jlock_assert, wapbl_junlock_assert -
     write-ahead physical block logging for file systems

SYNOPSIS
     #include <sys/wapbl.h>

     typedef void (*wapbl_flush_fn_t)(struct mount *, daddr_t *, int *, int);

     int
     wapbl_start(struct wapbl **wlp, struct mount *mp, struct vnode *devvp,
         daddr_t off, size_t count, size_t blksize, struct wapbl_replay *wr,
         wapbl_flush_fn_t flushfn, wapbl_flush_fn_t flushabortfn);

     int
     wapbl_stop(struct wapbl *wl, int force);

     int
     wapbl_begin(struct wapbl *wl, const char *file, int line);

     void
     wapbl_end(struct wapbl *wl);

     int
     wapbl_flush(struct wapbl *wl, int wait);

     void
     wapbl_discard(struct wapbl *wl);

     void
     wapbl_add_buf(struct wapbl *wl, struct buf *bp);

     void
     wapbl_remove_buf(struct wapbl *wl, struct buf *bp);

     void
     wapbl_resize_buf(struct wapbl *wl, struct buf *bp, long oldsz,
         long oldcnt);

     void
     wapbl_register_inode(struct wapbl *wl, ino_t ino, mode_t mode);

     void
     wapbl_unregister_inode(struct wapbl *wl, ino_t ino, mode_t mode);

     void
     wapbl_register_deallocation(struct wapbl *wl, daddr_t blk, int len);

     void
     wapbl_jlock_assert(struct wapbl *wl);

     void
     wapbl_junlock_assert(struct wapbl *wl);

DESCRIPTION
     WAPBL, or write-ahead physical block logging, is an abstraction for file
     systems to write physical blocks in the buffercache(9) to a bounded-size
     log first before their real destinations on disk.  The name means:

         logging         batches of writes are issued atomically via a log

         physical block  only physical blocks, not logical file system
                         operations, are stored in the log

         write-ahead     before writing a block to disk, its new content,
                         rather than its old content for roll-back, is
                         recorded in the log

     When a file system using WAPBL issues writes (as in bwrite(9) or
     bdwrite(9)), they are grouped in batches called transactions in memory,
     which are serialized to be consistent with program order before WAPBL
     submits them to disk atomically.

     Thus, within a transaction, after one write, another write need not wait
     for disk I/O, and if the system is interrupted, e.g. by a crash or by
     power failure, either both writes will appear on disk, or neither will.

     When a transaction is full, it is written to a circular buffer on disk
     called the log.  When the transaction has been written to disk, every
     write in the transaction is submitted to disk asynchronously.  Finally,
     the file system may issue new writes via WAPBL once enough writes
     submitted to disk have completed.

     After interruption, such as a crash or power failure, some writes issued
     by the file system may not have completed.  However, the log is written
     consistently with program order and before file system writes are
     submitted to disk.  Hence a consistent program-order view of the file
     system can be attained by resubmitting the writes that were successfully
     stored in the log using wapbl_replay(9).  This may not be the same state
     just before interruption -- writes in transactions that did not reach the
     disk will be excluded.

     For a file system to use WAPBL, its VFS_MOUNT(9) method should first
     replay any journal on disk using wapbl_replay(9), and then, if the mount
     is read/write, initialize WAPBL for the mount by calling wapbl_start().
     The VFS_UNMOUNT(9) method should call wapbl_stop().

     Before issuing any buffercache(9) writes, the file system must acquire a
     shared lock on the current WAPBL transaction with wapbl_begin(), which
     may sleep until there is room in the transaction for new writes.  After
     issuing the writes, the file system must release its shared lock on the
     transaction with wapbl_end().  Either all writes issued between
     wapbl_begin() and wapbl_end() will complete, or none of them will.

     File systems may also witness an exclusive lock on the current
     transaction when WAPBL is flushing the transaction to disk, or aborting a
     flush, and invokes a file system's callback.  File systems can assert
     that the transaction is locked with wapbl_jlock_assert(), or not
     exclusively locked, with wapbl_junlock_assert().

     If a file system requires multiple transactions to initialize an inode,
     and needs to destroy partially initialized inodes during replay, it can
     register them by ino_t inode number before initialization with
     wapbl_register_inode() and unregister them with wapbl_unregister_inode()
     once initialization is complete.  WAPBL does not actually concern itself
     whether the objects identified by ino_t values are `inodes' or `quaggas'
     or anything else -- file systems may use this to list any objects keyed
     by ino_t value in the log.

     When a file system frees resources on disk and issues writes to reflect
     the fact, it cannot then reuse the resources until the writes have
     reached the disk.  However, as far as the buffercache(9) is concerned, as
     soon as the file system issues the writes, they will appear to have been
     written.  So the file system must not attempt to reuse the resource until
     the current WAPBL transaction has been flushed to disk.

     The file system can defer freeing a resource by calling
     wapbl_register_deallocation() to record the disk address of the resource
     and length in bytes of the resource.  Then, when WAPBL next flushes the
     transaction to disk, it will pass an array of the disk addresses and
     lengths in bytes to a file-system-supplied callback.  (Again, WAPBL does
     not care whether the `disk address' or `length in bytes' is actually
     that; it will pass along daddr_t and int values.)

FUNCTIONS
     wapbl_start(wlp, mp, devvp, off, count, blksize, wr, flushfn,
           flushabortfn)
           Start using WAPBL for the file system mounted at mp, storing a log
           of count disk sectors at disk address off on the block device devvp
           writing blocks in units of blksize bytes.  On success, stores an
           opaque struct wapbl * cookie in *wlp for use with the other WAPBL
           routines and returns zero.  On failure, returns an error number.

           If the file system had replayed the log with wapbl_replay(9), then
           wr must be the struct wapbl_replay * cookie used to replay it, and
           wapbl_start() will register any inodes that were in the log as if
           with wapbl_register_inode(); otherwise wr must be NULL.

           flushfn is a callback that WAPBL will invoke as flushfn (mp,
           deallocblks, dealloclens, dealloccnt) just before it flushes a
           transaction to disk, with the an exclusive lock held on the
           transaction, where mp is the mount point passed to wapbl_start(),
           deallocblks is an array of dealloccnt disk addresses, and
           dealloclens is an array of dealloccnt lengths, corresponding to the
           addresses and lengths the file system passed to
           wapbl_register_deallocation().  If flushing the transaction to disk
           fails, WAPBL will call flushabortfn with the same arguments to undo
           any effects that flushfn had.

     wapbl_stop(wl, force)
           Flush the current transaction to disk and stop using WAPBL.  If
           flushing the transaction fails and force is zero, return error.  If
           flushing the transaction fails and force is nonzero, discard the
           transaction, permanently losing any writes in it.  If flushing the
           transaction is successful or if force is nonzero, free memory
           associated with wl and return zero.

     wapbl_begin(wl, file, line)
           Wait for space in the current transaction for new writes, flushing
           it if necessary, and acquire a shared lock on it.

           The lock is not exclusive: other threads may acquire shared locks
           on the transaction too.  The lock is not recursive: a thread may
           not acquire it again without calling wapbl_end first.

           May sleep.

           file and line are the file name and line number of the caller for
           debugging purposes.

     wapbl_end(wl)
           Release a shared lock on the transaction acquired with
           wapbl_begin().

     wapbl_flush(wl, wait)
           Flush the current transaction to disk.  If wait is nonzero, wait
           for all writes in the current transaction to complete.

           The current transaction must not be locked.

     wapbl_discard(wl)
           Discard the current transaction, permanently losing any writes in
           it.

           The current transaction must not be locked.

     wapbl_add_buf(wl, bp)
           Add the buffer bp to the current transaction, which must be locked,
           because someone has asked to write it.

           This is meant to be called from within buffercache(9), not by file
           systems directly.

     wapbl_remove_buf(wl, bp)
           Remove the buffer bp, which must have been added using
           wapbl_add_buf, from the current transaction, which must be locked,
           because it has been invalidated (or XXX ???).

           This is meant to be called from within buffercache(9), not by file
           systems directly.

     wapbl_resize_buf(wl, bp, oldsz, oldcnt)
           Note that the buffer bp, which must have been added using
           wapbl_add_buf, has changed size, where oldsz is the previous
           allocated size in bytes and oldcnt is the previous number of valid
           bytes in bp.

           This is meant to be called from within buffercache(9), not by file
           systems directly.

     wapbl_register_inode(wl, ino, mode)
           Register ino with the mode mode as commencing initialization.

     wapbl_unregister_inode(wl, ino, mode)
           Unregister ino, which must have previously been registered with
           wapbl_register_inode using the same mode, now that its
           initialization has completed.

     wapbl_register_deallocation(wl, blk, len)
           Register len bytes at the disk address blk as ready for
           deallocation, so that they will be passed to the flushfn that was
           given to wapbl_start().

     wapbl_jlock_assert(wl)
           Assert that the current transaction is locked.

           Note that it might not be locked by the current thread: this
           assertion passes if any thread has it locked.

     wapbl_junlock_assert(wl)
           Assert that the current transaction is not exclusively locked by
           the current thread.

           Users of WAPBL observe exclusive locks only in the flushfn and
           flushabortfn callbacks to wapbl_start().  Outside of such contexts,
           the transaction is never exclusively locked, even between
           wapbl_begin() and wapbl_end().

           There is no way to assert that the current transaction is not
           locked at all -- i.e., that the caller may acquire a shared lock on
           the transaction with wapbl_begin() without danger of deadlock.

CODE REFERENCES
     The WAPBL subsystem is implemented in sys/kern/vfs_wapbl.c, with hooks in
     sys/kern/vfs_bio.c.

SEE ALSO
     buffercache(9), vfsops(9), wapbl_replay(9)

BUGS
     WAPBL works only for file system metadata managed via the buffercache(9),
     and provides no way to log writes via the page cache, as in
     VOP_GETPAGES(9), VOP_PUTPAGES(9), and ubc_uiomove(9), which is normally
     used for file data.

     Not only is WAPBL unable to log writes via the page cache, it is also
     unable to defer buffercache(9) writes until cached pages have been
     written.  This manifests as the well-known garbage-data-appended-after-
     crash bug in FFS: when appending to a file, the pages containing new data
     may not reach the disk before the inode update reporting its new size.
     After a crash, the inode update will be on disk, but the new data will
     not be -- instead, whatever garbage data in the free space will appear to
     have been appended to the file.  WAPBL exacerbates the problem by
     increasing the throughput of metadata writes, because it can issue many
     metadata writes asynchronously that FFS without WAPBL would need to issue
     synchronously in order for fsck(8) to work.

     The criteria for when the transaction must be flushed to disk before
     wapbl_begin() returns are heuristic, i.e. wrong.  There is no way for a
     file system to communicate to wapbl_begin() how many buffers, inodes, and
     deallocations it will issue via WAPBL in the transaction.

     WAPBL mainly supports write-ahead, and has only limited support for
     rolling back operations, in the form of wapbl_register_inode() and
     wapbl_unregister_inode().  Consequently, for example, large writes
     appending to a file, which requires multiple disk block allocations and
     an inode update, must occur in a single transaction -- there is no way to
     roll back the disk block allocations if the write fails in the middle,
     e.g. because of a fault in the middle of the user buffer.

     wapbl_jlock_assert() does not guarantee that the current thread has the
     current transaction locked.  wapbl_junlock_assert() does not guarantee
     that the current thread does not have the current transaction locked at
     all.

     There is only one WAPBL transaction for each file system at any given
     time, and only one WAPBL log on disk.  Consequently, all writes are
     serialized.  Extending WAPBL to support multiple logs per file system,
     partitioned according to an appropriate scheme, is left as an exercise
     for the reader.

     There is no reason for WAPBL to require its own hooks in buffercache(9).

     The on-disk format used by WAPBL is undocumented.

NetBSD 8.0                      March 26, 2015                      NetBSD 8.0