There are two parts to scale out:
- Distributed processing
- Distributed Storage
Long ago Grid technology promised this and failed to deliver because network fails in Grid with heavy data flow over the wire. Hadoop HDFS addressed this by intelligently moving processing near to the data.
NOSQL Database JourneyUsing this technology, pure products like HyperTable, HBase NOSQL databases are designed. These databases transparently break the data for distributed storage, replication and processing. The question is why can’t they use the regular databases hosting them at multiple machines and firing queries to each of them and assimilating the result? Yes, it happened and many companies took an easy path of distributed processing using HadoopMap-Reduce framework by arranging data with traditional databases (Postgress, MySQL) - refer to HadoopDB, AsterData products for details. This works but availability becomes an issue. If one server availability is 90%, the overall availability for 2 servers is 81% (90 * 90). And this drastically falls as more servers are added to scale out. Replication is a solution to this but it breaks memory caching which many products heavily rely on for vertical scaling.
In same fashion, KATA and many other products provided distributed processing using Hadoop Map-Reduce framework over open source search engine (Lucene and Solr). These also fail to address high availability requirement.
Still No FreedomHowever, the rigidity that comes with data structure stays as all these databases need a schema. Early on we envisioned a schema free design which would allow us to keep all data in a single table and on need basis query it. We knew Search Engines usually allow this flexibility. It will help users to start their journey by typing a context they want to find than browsing the rigid menu system; Menu system is often tied to underneath data structure –
"Freedom comes by arranging data by value than structure"
But most search engines failed in enterprise where data and permissions change very frequently. The search heavy design fails in write heavy transaction environments.
We modified and extended HBase to build a Distributed Scalable Search Engine. It allowed us a schema free design, scaling out to process load and support huge amount of writes. We have tested this engine by indexing complete wikipedia documents, 180 millions small records with concurrent 100 users with only 4 machines to prove the linear horizontal scalability capability at Intel Innovation Lab.