Faceți căutări pe acest blog

luni, 30 mai 2011

My GSoC project for this year: Apache James mailbox over Hadoop

The guys at Google where kind enough to select my application for Google Summer of Code program and turn it into a project, which is kind of cool :B.
This is old news but I just started blogging and I am planning to make a habit.

The Project: Mailbox over Hadoop HDFS for Apache James

My project is with the Apache Foundation. I will work on Apache James - Java Apache Enterprise Mail Server - . By the end of this summer I will have to deliver mailbox storage over Hadoop HDFS. Storing email over a distributed file system such as HDFS This will enable James to handle many email accounts and many emails in a

The details and project about the project is handled with Jira.

The people

My mentor is Eric Charles. He's been around with the project for some time and knows his way around it. He helped me get up some inside knowledge. I also met other people on the list (Robert Burrell Donkin and Norman Maurer) and they also seem very passionate.

I think things are going to go well.

The status

It's been a week now since the project officially started. It's been quite busy. I had a lot of new things to learn. The video presentations about Hadoop, HDFS and HBase from Cloudera on vimeo where very helpful.

We discussed on the list and decided to implement the mailbox storage using HBase in order to avoid some performance issues related to the fact that HDFS does not support random file writes. It has only file append feature so once you create a file you can only append to it. This may seem weird for a file system, but you have to keep in mind that HDFS was designed to store huge amounts of data while providing high throughput access to this data. Having random reads and writes is not that easy to achieve if you want to keep the performance high.

Hbase is a NoSQL database implementation. It avoids these issues by appending the changes to a log file in a way journal based file systems do. When the log file gets too large, the changes are committed to the files.

There is a similar mailbox implementation based on Cassandra (another NoSQL store) implemented by IBM. You can find more about it here.

Niciun comentariu:

Trimiteți un comentariu