Thursday, May 1, 2014

World's 10 Big HBase Database Cluster Details

For the last couple of years there has been lots of conferences on big data database. HBase has emerged as a closely integrated hadoop database in the eco-system.

Specially at  Facebook, month by month hundred of terabytes of data is flowing to HBase clusters. I have compiled these sessions to analyze the similarities among various implementations, configurations and take the learning to apply it for productionizing HBase.


      Facebook Adobe Ebay Groupon OCLC Gap Inc Pinterest Magnify causata Experian
Use Case Social Media
Web Traffic
Search Index
Business use cases Facebook Message Infrastructure
(SMSes, Messages, Im/Chat, Email)
Web Traffic
Business Events
User Interactions
Infra Data
> Data storage for listed item drives eBay > Search Indexing
Data storage for ranking data in the future
Deal Personalization System
User clicks
Service Logs
Delivers single-search-box-access to 943 million itels.
It hosts all these contents
Serving Apparel Catalog from HBase for Live Website

Inventory Updates
String PINs Internet Memory Research Customer Experience Management (Real-time Offer)  
Data Volume Data in cluster (Compression as specified) Records as well as Disk size > 300TB/Month ( Compressed, Unreplicated)
> Search indexes (Extra)
1.9B Records
Billion Rows
1.8 Billion Ownerships
663Million Articles
      1 Billion rows
15 bye/each compressed Events (Type, timestamp, identifier, attributes)
500+ TB
Why HBase Motivation Architectural constraints High write throughput
Horizontal scalability
Automatic Failover
Atomic read-modify-write operations
Bulk importing
                Limitation of SQL
Schema Flexibility
Key lookup
Cluster Servers Components used in the solution User Directory Service
Application Server
    Email System
Online System
    Scrapers, Parsers, Cleansers, Validators, pyres, HBase, Python web servers      
Hardware Data Center               Amazon
h1.4xlarge + SSD Backed
  # Racks   5+ (Per Capsule) 1     3          
  Network           10GB Interconnect         10GigE 
  #Slaves   15 7 225   44 16 10 to 50     20
  Slaves CPU   16 Cores 16 Cores 24 cores (HT) 24 Virtual Core 8 CPU     Dual Core CPU 2*6 core
Intel X5650
24 core 
  Slaves Memory   48GB 32 GB 72GB RAM 96GB RAM 32 GB RAM 8-16GB RAM   8GB RAM 48GB RAM  
  HBasse Mameory   24GB   15GB 25GB            
  Slaves Disk   12 * 1 TB 12 2TB  12 * 2TB 8 * 2TB Disks 8TB Disk     15TB/Node 4 x 10K SAS Disk  
  # Masters (NN+HM+ZK)   5 ZK
2 (NN + Backup)
2 (HM + Backup)
1 JT
  5 ZK   6
3 Controls and 3 Edges
3 3 ZK      
  Masters CPU                      
  Masters Memory                      
  Masters Disk                      
Configuration OS OS Flavour             Ubuntu      
    File System             ext4/nodiratime/nodealloc/lazy_itable_init      
    OS Scheduling             noop      
  Master JVM                      
  Region Server JVM XX:MaxNewSize    
  XX:NewSize     100m              
  MSLAB           Enabled        
  CMSInitiatingOccupanyFactor           Yes        
  XX:MaxNewSize     512m              
  HBase Settings Pre Split Tables       Yes            
  hbase.regionserver.handler.count     50            300000 Increased            
  hbase.hregion.max.filesize     53687091200 10GB            
  hbase.hregion.majorcompaction     0              
  hfile.block.cache.size     0.65     0.6      0.1            0.09              
  hbase.client.scanner.caching     200              
  hbase.hregion.memstore.block.multiplier     4 4            
  hbase.rpc.timeout     600000 Increased   Less        
  hbase.client.pause     3000              
  hbase.hregion.memstore.flush.size       134217728            
  hbase.hstore.blockingStoreFiles       100            
  Zookeeper Settings     5000              
  zookeeper.session.timeout           Less        
  hbase.zookeeper.blockingStoreFiles       False ( 0.94.2)            
  HDFS  Setting dfs.block.size     134217728              
  dfs.heartbeat.recheck.interval           Modified        
  dfs.datanode.max.xcievers     131072              
  Jobtracker     8              
  mapred.tasktracker.reduce.tasks.maximum     6              
Administrative Backup   Uses Scribe
Double Logging
(Application Server and Region Server)
          Synced to S3 with s3-apt plugin
S3 + HBase snapshot + Export Snapshot
  Manage Splitting               w/presplit tables Custom    
  Compaction               Manual Rolling      
  Reverse DNS               Yes on EC2      

No comments:

Post a Comment