September 20th, 2011
Live Blogging from the Big Data DC Meetup# 4 – Chris Burroughs is presenting on Kafka.
Essentially, Kafka is a distributed publish-subscribe messaging system. It provides persistent messaging (that is, protection against restarts and shut downs. It also provides a constant time, that is, O(1) disk structures that provide constant time performance even with many TB of stored messages. (This is sort of impressive as is already).
The main advantage of Kafka seems to be the high-throughput: even using simple hardware Kafka can support hundreds of thousands of messages per second.
August 3rd, 2011
OK, here at big data# 3 – 3rd meetup in the big data meetup group – will be live blogging. Two main presentations – Joey Echevarria from Cloudera presenting on HBase and Ted Dunning presenting on MapR.
Due to the explosion in the analytical requirements and the limitations of traditional RDBS based solutions, big data is the way that most of the systems are moving to, and HBase and MapReduce are some key components to grasp.
Key points from Joey’s presentation
- Column families
- Table regions
Reasons for using HBase – variable schema in each record, and row access to each column family.
- LILY, OpenTSDB
- Real-time ad optimizations – capturing impressions and serving ads. HBase front-end, and HBase back-end. User model is about 40 attributes.
- Click stream sessionization
- Mozilla Soccorro – to gather Firefox crashes (which going by my recent experience, happens a lot )
- Navteq – Location based content serving
- Cloudera – Gathers data about customer clusters, where each customer node is a key with Avro values
Key Points from Ted Dunning’s talk
Some Motivations for MapR system
- Read-only files assumption, which doesn’t hold in enterprise setting
- Shuffle was based on HTTP
MapR Improvements (Changes to things that exist in Hadoop)
- Faster file system, with fewer copies, multiple NICS, NO file descriptor or page-buf competition
- Faster map reduce – Direct RPC to receiver, and very wide merges
MapR Innovations (Things that don’t exist in Hadoop)
- Read/write random access file system that allows distributed meta-data
- Application (framework) level NIC bonding, instead of switch level (Q/A: I asked what is really the benefit, considering that performance is not likely to be changed. As per Ted, the benefit here is on the virtualization of RPC receivers. So, essentially, the main innovation here is the abstraction. This idea of abstraction is very similar to how NTELX’s RTS transaction handler engines scale in the PREDICT system.)
- MapR Containers - Containers are about 16-32 GB. Each container can hold up to 1B files and directories. 100 M containers = ~ 2 Exabytes. 25GB to cache all containers for 2EB cluster
MapR’s Streaming Performance
Seems to be about twice as fast in reading and writing, and about twice as fast for Terasort.
November 9th, 2010
Going to the next Hadoop meetup on Thursday, November 17th. Brendan McAdams will be presenting tips and tricks on doing analytics and ETL on large datasets with the ability to load and save data against MongoDB. One of the items that Brendan will be talking about will be of interest to Java and Scala programmers, who can use Hadoop MapReduce to find a native solution to process their data with MongoDB.
Looking forward to it!
September 20th, 2010
Just received from Amazon – “Hadoop: The Definitive Guide”. Now on to the difficult part of actually reading it.