All users email is now on replicated servers. This means that every email delivered or deleted and every email action performed is replicated within a second to a completely separate server with a completely separate copy of all users emails.

We now have at least three levels of redundancy, three copies of every email, and all those copies are on RAID redundant storage themselves.

  1. All users now have their email stored on a system with RAID disks
    and all servers and RAID arrays have dual power supplies.

    This means a single drive or power supply failure should cause no
    interruption to service at all, we just replace the drive/power
    supply while the system is live and online. Hard drives and power
    supplies are the most common failing hardware components in computer

  2. All users now have their email replicated to an identical replica
    system (RAID drives, dual power supplies, etc). Each system is
    completely separate; it’s own operating system, filesystem, drives,
    power, connections, etc. The replication is performed at the
    semantic email level, not at the filesystem level. So a filesystem
    corruption on the source server will not be replicated. This means
    if there is a disk or filesystem corruption on a single machine, we
    can just switch to the replica
    (failover) and it won’t
    cause a multi-day outage.

    The failover is not automatic, it is manual. Thus depending on the
    actual problem that occurs and our ability to analyse and respond,
    it should be on the order of minutes to an hour to failover to a
    replica if we decided it’s needed. In some cases, we may decide it’s
    easier and safer to reboot a frozen or crashed machine than failover
    to the replica, so it might be possible to still have outages up to
    an hour. If we believe the outage is going to go over that time, we
    will most likely failover to the replica.

    We can also use the failover ability to do maintenance on machines
    more easily. If we decide a machine needs servicing (kernel upgrade,
    hardware change, etc), we can just failover to a replica machine
    safely, do the work, start the machine up again and wait for
    replication to catch up, then failback to the machine. For users,
    the only visible downtime will be the controlled failover portion,
    which is usually on the order of 1 minute or so.

  3. All users have their email store backed up incrementally each night
    to a separate system and RAID array. The backups of email are kept
    for 1 week after the email is deleted to allow restoring in case of
    accident. In an emergency situation if both a master and replica
    server should fail catastrophically, we can still perform a restore
    from this backup

We believe that this will provide us the highest possible reliability while still allowing us to continue to grow our user base.