Software for humans: NIO Iterator over messages in mbox file

I've started working on a small project called mbox-iterator that I wish to integrate with mime4j later, when it's more usable.

The idea is to provide an Iterator over all the messages in a mbox file and provide access to the raw data. You can then use mime4j to parse the message and do whatever. It's very good for use cases when you don't need the data to be processed and would like to do your own processing or just need access to the raw data.

The project uses java NIO Memory Mapped files to map the file into memory. This will use the OS to manage the memory and file loading for you you get a ByteBuffer that you can use to accces the date. We are not using this ByteBuffer directly, because it's bytes and we need character sets so we need to decode the bytes. We get use a Charset to get a CharsetDecoder for the encoding we need. CharsetDecoder returns a CharBuffer instance and because it implements CharSequence we can use a regex to determine boundaries between messages.

I was hoping that CharBuffer would share the memory with the MappedByteBuffer that we get initially but it seems that this is not the case. We have some memory copying here because the CharBuffer we get is an instance of java.nio.HeapCharBuffer. I was thinking we could have zero Java heap memory and use just the O.S. pages for holding the mbox data but as it turns out, the process of translating bytes to chars needs some memory to keep the chars.

It would have been nice to use asCharBuffer() and provide zero-copy access, but darn, maybe with a future Java version.

Let's return to our cattle:

When we find such a boundary, we return a slice()ed CharBuffer to that message. We also set position and limit so that we get just the message. This has the advantage of using the same memory as the CharBuffer ww do matching on and avoids unnecessary memory copy operations.

I have tested it on a small mbox (135kb) and performs ok. I'm planning more tests with a 2gb mbox. Using NIO memory mapped files we can map very large files and use the OS cache and buffer memory so we can avoid GC activity and unnecessary memory copy operations.

The following areas need improvements before the project is usable:

- testing for different types of mbox files
- better regex to match different mbox formats (mboxcl, mboxrd and the likes)
- better regex for matching From_ lines (now we miss some From_ lines that have MAILER_DAEMON instead of email address)
- support for mboxes larger than 2GB - use more ByteBuffers to map portions of the files, watch out for mails that spill over boundaries.

I'm pretty new to NIO so if you have suggestions of how to do this better I'm open to suggestions and pull requests.

You will fins a simple example that splits one mbox into individual messages in the project sources.

Happy hacking.

References and links:

[1] http://www.kdgregory.com/index.php?page=java.byteBuffer
[2] http://en.wikipedia.org/wiki/Mbox
[3] http://qmail.org/man/man5/mbox.html
[4] http://james.apache.org/mime4j/index.html
[5] https://github.com/ieugen/mbox-iterator

Software for humans

Faceți căutări pe acest blog

miercuri, 8 februarie 2012

NIO Iterator over messages in mbox file

4 comentarii: