Monday, November 28, 2011

Mapping an app on Google AppEngine (GAE) to a custom domain

After fiddling for an hour or so trying to figure out this seemingly simple stuff, I finally got to map both root and sub-domains from GoDaddy to my app on Google App Engine. So here goes.

1. Register the domain on Google Apps (this is required as App Engine uses Google Apps to handle redirection)

2. Add Google Apps site admin mail id as an admin for the app on GAE.When you do this an email is sent to the site admin mail id.
3. Add Domain in Application Settings / Domain Setup. It is important to specify the domain as without the www. Otherwise it will error out.
4. To link Godaddy domain server to Google Apps, the A record has to be setup on GoDaddy, these instructions are available on Domain Setting Page in Google Apps.
5. Map naked url of to www using Google info on the domain page.
6. Add www or any sub-domain mapping to GAE app inside Google Apps. This will also require www CNAME record to point to
7. Wait for the domain settings to trickle down and it should work.

Sunday, November 27, 2011

No Schema NoSQL database

There are two parts to scale out:
  • Distributed processing
  • Distributed Storage
Long ago Grid technology promised this and failed to deliver because network fails in Grid with heavy data flow over the wire. Hadoop HDFS addressed this by intelligently moving processing near to the data.
NOSQL Database JourneyUsing this technology, pure products like HyperTableHBase NOSQL databases are designed. These databases transparently break the data for distributed storage, replication and processing. The question is why can’t they use the regular databases hosting them at multiple machines and firing queries to each of them and assimilating the result? Yes, it happened and many companies took an easy path of distributed processing using HadoopMap-Reduce framework by arranging data with traditional databases (Postgress, MySQL) - refer to HadoopDBAsterData products for details. This works but availability becomes an issue. If one server availability is 90%, the overall availability for 2 servers is 81% (90 * 90). And this drastically falls as more servers are added to scale out. Replication is a solution to this but it breaks memory caching which many products heavily rely on for vertical scaling.
In same fashion, KATA and many other products provided distributed processing using Hadoop Map-Reduce framework over open source search engine (Lucene and Solr). These also fail to address high availability requirement.
Still No FreedomHowever, the rigidity that comes with data structure stays as all these databases need a schema. Early on we envisioned a schema free design which would allow us to keep all data in a single table and on need basis query it. We knew Search Engines usually allow this flexibility. It will help users to start their journey by typing a context they want to find than browsing the rigid menu system; Menu system is often tied to underneath data structure –
 "Freedom comes by arranging data by value than structure"
 But most search engines failed in enterprise where data and permissions change very frequently. The search heavy design fails in write heavy transaction environments.
 We modified and extended HBase to build a Distributed Scalable Search Engine. It allowed us a schema free design, scaling out to process load and support huge amount of writes. We have tested this engine by indexing complete wikipedia documents, 180 millions small records with concurrent 100 users with only 4 machines to prove the linear horizontal scalability capability at Intel Innovation Lab.

J2EE application server deployment for HBase File Locality

The data node that shares the same physical host has a copy of all data the region server requires. If you are running a scan or get or any other use-case you can be sure to get the best performance “ (Reference : )
Now let's consider an Mail Server application which stores all mails of users.
Usually to handle large concurrent user base, when a user logs on he is routed to one of the available server. This logic could be round robin or load base routing. Once the session gets created in the allocated machine, the user starts accessing his mails from HBase. But the data may not be served from local machine. This is because, user might have logged on in a different machine and a high chance that the record was created to the same region server node of that machine. Now as the requesting machine is not definite, the information will flow through wire and gets served.

Co-locating the client and the original region server, would minimize this network trip. This is possible by sticky routing the user to the same machine again and again across days as long as the node is available. This will ensure the local data access via same region server to same data node to local disk. But most of the load balancers are not designed like that. In reality they are designed to route based on number of active connections. This model works OK to balance out CPU and memory. A hybrid model will work best for balancing CPU, memory and network together.
This way of co-locating application server, Hbase region server, hdfs data node may impose a security risk for credit card transactional systems. Those kind of systems may like to have one more firewall between the database and application server. In high traffic that will primarily choke the network. In best interest of security and scalability, information architects need to divide their application’s sensitive data (ex. Credit card information) and the “low risk data” creating the threat model. Based on this, a dedicated remote HBase cluster backed by a firewall could be created for serving sensitive information.

Value-Name Columnar Structure

Lucene search engine allows to define one data structure,  NUTCH and other search engines have a pre defined structure ( ex. title, url, body). In other side, in a RDBMS we build different tables for different data structure.
How can I store various XML formats, documents together and search it?  For Nutch andLucene, we need to remove fieldtypes from filteration criteria. In a database, each table, each column needs to be looked for finding the data. It will be very slow.
These constraints push us develop a schema which can allow search while preserving the structure and allowing write operations.