Thursday, December 29, 2011

Starting the HBase Server from Eclipse

Why Start the HBASE server inside Eclipse!
  1. HBase custom filters are a powerful feature which helps to move processing near to data. However deployment of these custom filters require one to compile the dependent classes for the filter, package in a jar and make it available to the region server.  Any new changes to these custom filter code requires the complete cycle of stopping server, packaging new jar, copying to hbase lib folder and restarting it.
  2. How to debug the code writer inside these custom filters by putting a breakpoint.
  3. To see the code execution status, look to the Region Server Log file which is same as run Eclipse + CYGWIN + Notepad to view the log status
    Because of these above shortfalls, I decided to run HBase from Eclipse. Run as a Java Application or Debug as a Java application and setting breakpoints on my filter classes to see the execution path along with stacks.

    Steps to configure HBase for Eclipse
    What we need (All are included in the deployment package - Download HStartup.zip)
    1. chmod.exe program (32 bit is included)
    2. favicon.gif  
    3. HBaseLuncher.class
    4. Unzipping HBase release


    Setting things up
    1. Unzip the attached eclipse project folder.
    2. Add hbase/lib/ folder jars in the project build path libraies.
    3. Add hbase/conf folder to project build path libraies.
    4. Add hbase.jar to project build path libraies.
    5. Add your project to the Required projects in the Java build path.

    Starting the HBase in debug mode
    1. Run HBaseLuncher in debug/run mode.
    2. In the windows tray, you will see an HBase tray icon.
    3. Right click on the tray icon to start/stop the HBase server. 

        
    For any issues, please write to me:- abinash [at] bizosys [dot] com


    Tuesday, December 27, 2011

    Why Code

    On the Interaction Design Association's LinkedIn Group (IXDA), a member posted this query "Do Designers need to be able to code?". Some very good points have been raised and debated on how knowledge of HTML, CSS, even XAML, etc. can help designers understand how their designs are being translated into production, and how it helps improve designer-developer communication. The counter argument is that a designer brings in multiple perspectives (read generalist) and hence should not get bogged down in code level details, instead remain focused on overall design of user experience.

    Without taking sides, my view is that there is a larger question here - "Why code at all".

    Why code is not restricted to designers alone. In the world of desktop apps, for every formal, business application, there are a large number of informal Excel spreadsheet based applications out there. There is some level of code in there too, put in place by analysts and others. There was Coghead, A company with the grand vision to create drag-drop building of entire applications with databases for "tech savvy businesses" (mostly SMB). To quote from its Crunchbase profile "Coghead is a WYSIWYG database driven application service aimed at enabling non-developers to solve problems traditionally requiring programming knowledge"

    We are entering a paradigm where Code is likely to get pushed deeper below the hood simultaneously with the rise of a 'tech savvy' non-developer lot who use powerful browser based, drag drop convenience to cobble together apps - the UI, the server side backend using easy API calls. Why Code is a pertinent question for these consumers.

    At the same time, this is not a magic solution for everything, but may apply to a narrow class of applications that are mostly stand alone, with limited integration to multiple systems. Yet it could provide a replacement for many of those local spreadsheet based apps that are used to save and organize data locally. If you have an idea for such creative uses of browser based online app building, do share your comments.

    So today HTML, CSS, JavaScript, etc. may appeal for the ability to craft,  but tomorrow these may be replaced by need to know what APIs are available and how to use them. With a proliferation of great tools for non-developers, the argument Why code is not so much about replacing the need to master yet another skill or art, but to use the limited time available to focus on what needs building - the user experience, better visualization of all that data one is collecting, etc. With the advent of PaaS (platform as a service) players such as Microsoft Azure, Google App Engine, Salesforce's Force.com, etc. more tools that allow 'tech savvy', non-developers to play with APIs to make server side calls, etc. may become more convenient and easier. To create their own personal Apps. I welcome views and comments on what that future could be.

    Wednesday, December 7, 2011

    Three Reasons for another Prototyping Tool

    I share this view from Gartner analyst Mark McDonald and many other analysts that there is a business-IT gap in typical software development process, especially at the requirements capture stage, which "is the basis for business and IT conversation". The fact is often times as a project progresses from requirements to detailed design and coding, team members change, project locations shift. Not that there is a dearth of PMI best practices in place or CMM models to oversee governance. Projects slip!

    I presume there could be two reasons here. The obvious one is requirements rigor. Design Specs are outlined in lesser time than desirable. Secondly, there is a different granularity of information capture and sharing across team and over project life cycle using a variety of artifacts, with conversations loosely bound together by long chains of emails and  telecons. But still various, relevant perspectives of the analysts, the sponsor, developers, UI designers, don't come together into a single big picture that addresses original project charter. Requirements lie scattered across documents, Visio flows, PowerPoint slides, UI prototypes (requires coding), UI wireframes and sketches (differing fidelity), technical requirements documents, etc.and each follows different conventions, representation styles.

    10Screens was conceptualized to make it easy to create a consistent, high fidelity view within short span of time using drag drop interface. It allows teams to illustrate highly finished looking screen designs and process flow charts in a single space to bring all stakeholders on to the same page and comment in page. 10Screens is being entirely online facilitates sharing and collaboration.

    Another thing about, most prototypes is that they remain as specs and almost never make it to production code. The third reason for another prototyping tool is the opportunity to use the iterated prototype as final, finished in production UI! The impact of this is on overall effort, as it saves precious developer time who need not code the UI and instead focus only on business logic, putting together the server side - making calls to database, retrieving and serving client requests, etc. We have been trying this with simple apps and it seems to work. We are excited about the possibilities and promise to keep you posted out here. We plan to launch a 'Backend as a Service' into which the prototype can directly connect with. Of course, all this is Cloud based.

    10Screens - UPDATE

    We are happy to announce that 10Screens - Powerpoint for Prototyping, is now free!

    If you are one of our registered users, you must have noticed the 'regular changes' we keep making to the website. Please rest assured that the actual tool (launches in a new window) and your saved work is intact. We also thank many of our users worldwide for taking time out to share views and suggestions since we launched 10Screens in March 2011. We plan to take up your suggestions and improve the product soon. Please keep checking here for updates. Follow us on Twitter @bizosys where we also share announcements and updates.

    Meanwhile, happy prototyping with 10Screens!

    Tuesday, December 6, 2011

    HBase Backup to Amazon S3


    HSearch is our opensource, NoSQL, distributed, real-time search engine built on Hadoop and HBase. You can find more about it on http://hadoopsearch.net

    We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.

    Option
    Pros
    Cons
    Backup the Hadoop DFS
    Block data files are backed up quickly.
    Even if there is no visible external load on HBase, HBase internal processes such as region balancing, compaction goes on updating the HDFS blocks. So a raw copy may result in an inconsistence state.
    Secondly, Hadoop, HBase as well as Hadoop HDFS keeps data in memory and flush at periodic intervals. So raw copy may result in an inconsistent state.
    HBase Import and Export tool
    The Map-Reduce Job downloads data to the given output path.
    Providing a path like s3://backupbucket/ fails the program with exceptions like: Jets3tFileSystemStore failed with AWSCredentials.
    HBase Table Copy tools
    Another parallel replicated setup to switch.
    Huge investment to keep running another parallel environment to replicate production data.

    After considering these options we developed a simple tool, which backs up  data to Amazon S3 and restore when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.

    In case of failures, it should first initiate a clean environment with all tables created and populated with latest full backup data and then apply all incremental backups sequentially. However, in this method deletes are not captured which may lead to some unnecessary data in tables. This is a known disadvantage of this method of backup and restore.
    This backup program internally used HBase Import and Export tools to execute the programs in a Map-Reduce method.

    Top 10 Features of the backup tool
    1. Export complete data for the given set of tables to S3 bucket.
    2. Export incrementally data for the given set of tables to S3 bucket.
    3. List all complete as well as incremental backup repositories.
    4. Restore a table from backup based on the given backup repository.
    5. Runs in Map-Reduce
    6. In case of connection failure, retries with increasing delays
    7. Handles special characters like _ which creates the export and import activities.
    8. Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
    9. Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix EPOCH time (Time represented as a Number than readabale format as YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE
    10. All parameters are taken from command line which allows the cron job to run this at regular interval.

    Setting up the tool

    Step # 1 : Download the package from http://hsearch0.94.s3.amazonaws.com/hbackup.install.tar
    This package includes the necessary jar files and the source code.

    Step # 2 : Setup a configuration file. Download the hbase-site.xml file.
    Add to this fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey, fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties

    Step # 3 : Setup the class path with all jars existing inside the hbase/lib directory, hbase.jar file, java-xmlbuilder-0.4.jar, jets3t-0.8.1a.jar and hbackup-1.0-core.jar file bundled inside the downloaded hbackup.install.tar. Make sure hbackup-1.0-core.jar at the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.

    Running the tool

    Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore]
    ----------------------------------------
    mode=backup.full tables="comma separated tables" backup.folder=S3-Path  date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"

    Ex. mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"
    Ex. Default time is now
    mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/

    ----------------------------------------

    mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=In Minutes
                Ex. mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3

    This will backup changes happend in last 30 mins

    ----------------------------------------

    mode=backup.history backup.folder=S3-Path

    Ex. mode=backup.history backup.folder=s3://S3BucketABC/
    This will list all past archives. Incremental one ends with .incr

    ----------------------------------------

    mode=restore  backup.folder=S3-Path/ArchieveDate tables="comma separated tables"

    Ex. mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3
    This will add the rows arcieved during that date. First apply a full backup and then apply incremental backups.

    -------------------------------------

    Some sample scripts to run the backup tool.

    $ cat setenv.sh
    for file in `ls /mnt/hbase/lib`
    do
    export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file;
    done

    export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH

    export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH



    $ cat backup_full.sh
    . /mnt/hbackup/bin/setenv.sh

    dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"`
    echo Backing up for date $dd
    for table in `echo table1 table2 table3`
    do
    /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd"
    sleep 10
    done

    $ cat list.sh
    . /mnt/hbackup/bin/setenv.sh
    /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket 

    Monday, November 28, 2011

    Mapping an app on Google AppEngine (GAE) to a custom domain

    After fiddling for an hour or so trying to figure out this seemingly simple stuff, I finally got to map both root and sub-domains from GoDaddy to my app on Google App Engine. So here goes.


    1. Register the domain on Google Apps (this is required as App Engine uses Google Apps to handle redirection)

    2. Add Google Apps site admin mail id as an admin for the app on GAE.When you do this an email is sent to the site admin mail id.
    3. Add Domain in Application Settings / Domain Setup. It is important to specify the domain as mydomain.com without the www. Otherwise it will error out.
    4. To link Godaddy domain server to Google Apps, the A record has to be setup on GoDaddy, these instructions are available on Domain Setting Page in Google Apps.
    5. Map naked url of mydomain.com to www using Google info on the domain page.
    6. Add www or any sub-domain mapping to GAE app inside Google Apps. This will also require www CNAME record to point to ghs.google.com.
    7. Wait for the domain settings to trickle down and it should work.



    Sunday, November 27, 2011

    No Schema NoSQL database


    There are two parts to scale out:
    • Distributed processing
    • Distributed Storage
    Long ago Grid technology promised this and failed to deliver because network fails in Grid with heavy data flow over the wire. Hadoop HDFS addressed this by intelligently moving processing near to the data.
    NOSQL Database JourneyUsing this technology, pure products like HyperTableHBase NOSQL databases are designed. These databases transparently break the data for distributed storage, replication and processing. The question is why can’t they use the regular databases hosting them at multiple machines and firing queries to each of them and assimilating the result? Yes, it happened and many companies took an easy path of distributed processing using HadoopMap-Reduce framework by arranging data with traditional databases (Postgress, MySQL) - refer to HadoopDBAsterData products for details. This works but availability becomes an issue. If one server availability is 90%, the overall availability for 2 servers is 81% (90 * 90). And this drastically falls as more servers are added to scale out. Replication is a solution to this but it breaks memory caching which many products heavily rely on for vertical scaling.
    In same fashion, KATA and many other products provided distributed processing using Hadoop Map-Reduce framework over open source search engine (Lucene and Solr). These also fail to address high availability requirement.
    Still No FreedomHowever, the rigidity that comes with data structure stays as all these databases need a schema. Early on we envisioned a schema free design which would allow us to keep all data in a single table and on need basis query it. We knew Search Engines usually allow this flexibility. It will help users to start their journey by typing a context they want to find than browsing the rigid menu system; Menu system is often tied to underneath data structure –
     "Freedom comes by arranging data by value than structure"
     But most search engines failed in enterprise where data and permissions change very frequently. The search heavy design fails in write heavy transaction environments.
     We modified and extended HBase to build a Distributed Scalable Search Engine. It allowed us a schema free design, scaling out to process load and support huge amount of writes. We have tested this engine by indexing complete wikipedia documents, 180 millions small records with concurrent 100 users with only 4 machines to prove the linear horizontal scalability capability at Intel Innovation Lab.

    J2EE application server deployment for HBase File Locality


    The data node that shares the same physical host has a copy of all data the region server requires. If you are running a scan or get or any other use-case you can be sure to get the best performance “ (Reference : http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html )
    Now let's consider an Mail Server application which stores all mails of users.
    Usually to handle large concurrent user base, when a user logs on he is routed to one of the available server. This logic could be round robin or load base routing. Once the session gets created in the allocated machine, the user starts accessing his mails from HBase. But the data may not be served from local machine. This is because, user might have logged on in a different machine and a high chance that the record was created to the same region server node of that machine. Now as the requesting machine is not definite, the information will flow through wire and gets served.


    Co-locating the client and the original region server, would minimize this network trip. This is possible by sticky routing the user to the same machine again and again across days as long as the node is available. This will ensure the local data access via same region server to same data node to local disk. But most of the load balancers are not designed like that. In reality they are designed to route based on number of active connections. This model works OK to balance out CPU and memory. A hybrid model will work best for balancing CPU, memory and network together.
    This way of co-locating application server, Hbase region server, hdfs data node may impose a security risk for credit card transactional systems. Those kind of systems may like to have one more firewall between the database and application server. In high traffic that will primarily choke the network. In best interest of security and scalability, information architects need to divide their application’s sensitive data (ex. Credit card information) and the “low risk data” creating the threat model. Based on this, a dedicated remote HBase cluster backed by a firewall could be created for serving sensitive information.

    Value-Name Columnar Structure


    Lucene search engine allows to define one data structure,  NUTCH and other search engines have a pre defined structure ( ex. title, url, body). In other side, in a RDBMS we build different tables for different data structure.
    How can I store various XML formats, documents together and search it?  For Nutch andLucene, we need to remove fieldtypes from filteration criteria. In a database, each table, each column needs to be looked for finding the data. It will be very slow.
    These constraints push us develop a schema which can allow search while preserving the structure and allowing write operations.