Tuesday, December 6, 2011

HBase Backup to Amazon S3


HSearch is our opensource, NoSQL, distributed, real-time search engine built on Hadoop and HBase. You can find more about it on http://hadoopsearch.net

We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.

Option
Pros
Cons
Backup the Hadoop DFS
Block data files are backed up quickly.
Even if there is no visible external load on HBase, HBase internal processes such as region balancing, compaction goes on updating the HDFS blocks. So a raw copy may result in an inconsistence state.
Secondly, Hadoop, HBase as well as Hadoop HDFS keeps data in memory and flush at periodic intervals. So raw copy may result in an inconsistent state.
HBase Import and Export tool
The Map-Reduce Job downloads data to the given output path.
Providing a path like s3://backupbucket/ fails the program with exceptions like: Jets3tFileSystemStore failed with AWSCredentials.
HBase Table Copy tools
Another parallel replicated setup to switch.
Huge investment to keep running another parallel environment to replicate production data.

After considering these options we developed a simple tool, which backs up  data to Amazon S3 and restore when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.

In case of failures, it should first initiate a clean environment with all tables created and populated with latest full backup data and then apply all incremental backups sequentially. However, in this method deletes are not captured which may lead to some unnecessary data in tables. This is a known disadvantage of this method of backup and restore.
This backup program internally used HBase Import and Export tools to execute the programs in a Map-Reduce method.

Top 10 Features of the backup tool
  1. Export complete data for the given set of tables to S3 bucket.
  2. Export incrementally data for the given set of tables to S3 bucket.
  3. List all complete as well as incremental backup repositories.
  4. Restore a table from backup based on the given backup repository.
  5. Runs in Map-Reduce
  6. In case of connection failure, retries with increasing delays
  7. Handles special characters like _ which creates the export and import activities.
  8. Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
  9. Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix EPOCH time (Time represented as a Number than readabale format as YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE
  10. All parameters are taken from command line which allows the cron job to run this at regular interval.

Setting up the tool

Step # 1 : Download the package from http://hsearch0.94.s3.amazonaws.com/hbackup.install.tar
This package includes the necessary jar files and the source code.

Step # 2 : Setup a configuration file. Download the hbase-site.xml file.
Add to this fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey, fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties

Step # 3 : Setup the class path with all jars existing inside the hbase/lib directory, hbase.jar file, java-xmlbuilder-0.4.jar, jets3t-0.8.1a.jar and hbackup-1.0-core.jar file bundled inside the downloaded hbackup.install.tar. Make sure hbackup-1.0-core.jar at the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.

Running the tool

Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore]
----------------------------------------
mode=backup.full tables="comma separated tables" backup.folder=S3-Path  date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"

Ex. mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"
Ex. Default time is now
mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/

----------------------------------------

mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=In Minutes
            Ex. mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3

This will backup changes happend in last 30 mins

----------------------------------------

mode=backup.history backup.folder=S3-Path

Ex. mode=backup.history backup.folder=s3://S3BucketABC/
This will list all past archives. Incremental one ends with .incr

----------------------------------------

mode=restore  backup.folder=S3-Path/ArchieveDate tables="comma separated tables"

Ex. mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3
This will add the rows arcieved during that date. First apply a full backup and then apply incremental backups.

-------------------------------------

Some sample scripts to run the backup tool.

$ cat setenv.sh
for file in `ls /mnt/hbase/lib`
do
export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file;
done

export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH

export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH



$ cat backup_full.sh
. /mnt/hbackup/bin/setenv.sh

dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"`
echo Backing up for date $dd
for table in `echo table1 table2 table3`
do
/usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd"
sleep 10
done

$ cat list.sh
. /mnt/hbackup/bin/setenv.sh
/usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket 

8 comments:

  1. Sunil:

    This tool looks great. Why write about it over here in this ghetto where only a few will learn of your work. Please post on the apache hbase mailing lists.

    Good stuff,
    St.Ack

    ReplyDelete
  2. @Stack: Thank you for the encouraging comment. I appreciate it. Valid point on sharing

    ReplyDelete
  3. Sunil,

    Backing up HBase is something that is seeing a lot of interest and I'm excited to see more people trying to solve the problem. Can you write up a small post or design document explaining how you do a consistent backup (and restore)? I'd be curious to learn about your approach.

    -Amandeep

    ReplyDelete
  4. Amandeep, sure we will do that. Maybe this week when we get some time, we will publish one.

    -Sunil

    ReplyDelete
  5. I'm curious about if it's possible to use this tool to backup my HBase to
    local storage ? or this tool only support those people who run their HBase on Amazon EC2/s3 ?

    ReplyDelete
    Replies
    1. @Strategist922: Yes it is possible to backup HBase to local storage. In this case, you need to make a simple copy
      from the HBAse shell flush the table and then do a hadoop copyToLocal command. Hope that helps.

      Delete
  6. @Bizosys : You did a pretty decent job.. are these jars compatible with cdh4.* version? Thank you !

    ReplyDelete
    Replies
    1. @Anonymous: Thank you for your encouraging words! I am afraid we havent gotten down to checking compatibility with CDH for this tool. OT, we have tested our search engine HSearch (www.hadoopsearch.net) for CDH 3 update 3. Surely, will update here when we get to it. Cheers!

      Delete