This site contains information and resources related to Andrew Lampert's email text segmentation and classification research.
In the early days of email, widely-used conventions for indicating quoted reply content and email signatures made it easy to segment email messages into their functional parts. Today, the explosion of different email formats and styles, coupled with the ad hoc ways in which people vary the structure and layout of their messages, means that simple techniques for identifying quoted replies that used to yield 95% accuracy now find less than 10% of such content.
Zebra is an SVM-based system for segmenting the body text of email messages into nine zone types based on graphic, orthographic and lexical cues. Zebra performs this task with an accuracy of 87.01%; when the number of zones is abstracted to two or three zone classes, this increases to 93.60% and 91.53% respectively.
Our dataset consists of 11881 annotated lines from almost 400 email messages drawn at random from the Enron email corpus. We use the database dump of the Enron corpus (219Mb) released by Andrew Fiore and Jeff Heer. This version of the corpus has been processed to remove duplicate messages and to normalise sender and recipient names, resulting in just over 250,000 email messages. No attachments are included. Our annotations are made by a single annotator.
Our annotated dataset is available as a set of database tables that link back to the message data in the Fiore and Heer dataset. Find out more about the structure of our dataset or go ahead and download it.
The Zebra Email Segmentation Tool will be available for download. Right now, I'm working through some licensing issues before I can make it available. If you're keen to get a status update, please contact me.
Our annotated data is licensed under a Creative Commons Attribution-Noncommercial 2.0 Generic License.
If you make use of any of these resources, please cite the following paper:
Andrew Lampert, Robert Dale and Cécile Paris (2009) - Segmenting Email Message Text into Zones, In Proceedings of Empirical Methods in Natural Language Processing (EMNLP 2009), pp. 919-928, Singapore, August 6-7.