This site contains information and resources related to Andrew Lampert's email text segmentation and classification research.
Our dataset consists of 11881 annotated lines from almost 400 email messages drawn at random from the Enron email corpus. We use the database dump of the Enron corpus (219Mb) released by Andrew Fiore and Jeff Heer. This version of the corpus has been processed to remove duplicate messages and to normalise sender and recipient names, resulting in just over 250,000 email messages. No attachments are included. Our annotations are made by a single annotator.
You can download our dataset by filling out the dataset request form.
Our annotated dataset is available as a set of database tables that link back to the message data in the Fiore and Heer dataset. Our annotated dataset includes the following tables:
- id
- A unique id for each annotation record
- messageid
- The id of the message from which this annotated line is extracted
- lineid
- The id of the exact line which this annotation refers to
- userid
- The id of the annotator. In this dataset, all annotations are made by the same annotator (userid=1).
- datetime
- Timestamp of when the annotation was recorded.
- annvalue
- The zone annotation, encoded as an integer. The label corresponding to the annotation value can be found in the zonetypes table.
- errorid
- The error annotation, encoded as an integer. The label corresponding to the error value can be found in the errortypes table.
- id
- A unique id for each email zone. These ids are the values stored in the annvalue field in the zoneannotations table.
- name
- The name of each email zone.
- id
- A unique id for each line of email text. These ids are the values stored in the lineid field in the zoneannotations table.
- messageid
- The id of the message from which this line is extracted
- linetext
- The text contained in this line of email text.
- lineorder
- The position of this line in the email body. The first line in a message is given the value 1. This value is incremented for each following line in the message.
- id
- A unique id for each error class. These ids are the values stored in the errorid field in the zoneannotations table.
- name
- The name of each error class.
Our
annotated data is licensed under a Creative Commons
Attribution-Noncommercial 2.0 Generic License.
If you make use of any of these resources, please cite the following paper:
Andrew Lampert, Robert Dale and Cécile Paris (2009) - Segmenting Email Message Text into Zones, In Proceedings of Empirical Methods in Natural Language Processing (EMNLP 2009), pp. 919-928, Singapore, August 6-7.