Email Zoning Dataset

This site contains information and resources related to Andrew Lampert's email text segmentation and classification research.


Zebra
Zebra Image by arnolouise, licensed under Creative Commons

Annotated Email Dataset

Our dataset consists of 11881 annotated lines from almost 400 email messages drawn at random from the Enron email corpus. We use the database dump of the Enron corpus (219Mb) released by Andrew Fiore and Jeff Heer. This version of the corpus has been processed to remove duplicate messages and to normalise sender and recipient names, resulting in just over 250,000 email messages. No attachments are included. Our annotations are made by a single annotator.

You can download our dataset by filling out the dataset request form.

Dataset Structure

Our annotated dataset is available as a set of database tables that link back to the message data in the Fiore and Heer dataset. Our annotated dataset includes the following tables:

Zoneannotations

This table contains the zone annotations for the lines of email text. It includes the following fields:
id
A unique id for each annotation record
messageid
The id of the message from which this annotated line is extracted
lineid
The id of the exact line which this annotation refers to
userid
The id of the annotator. In this dataset, all annotations are made by the same annotator (userid=1).
datetime
Timestamp of when the annotation was recorded.
annvalue
The zone annotation, encoded as an integer. The label corresponding to the annotation value can be found in the zonetypes table.
errorid
The error annotation, encoded as an integer. The label corresponding to the error value can be found in the errortypes table.

Zonetypes

This table contains the labels for each of our email zones that are applied to our lines of email text. It includes the following fields:
id
A unique id for each email zone. These ids are the values stored in the annvalue field in the zoneannotations table.
name
The name of each email zone.

Zonelines

This table contains the lines of email text extracted from the body text of our randomly selected email messages. It includes the following fields:
id
A unique id for each line of email text. These ids are the values stored in the lineid field in the zoneannotations table.
messageid
The id of the message from which this line is extracted
linetext
The text contained in this line of email text.
lineorder
The position of this line in the email body. The first line in a message is given the value 1. This value is incremented for each following line in the message.

Errortypes

This table contains the labels for each of our error classes that are applied to our lines of email text during annotation. Errors are intended to capture systematic problems in the underlying email data. It includes the following fields:
id
A unique id for each error class. These ids are the values stored in the errorid field in the zoneannotations table.
name
The name of each error class.

Bodies, Headers, Messages, People, Recipients

These tables contains the body text, recipient and other header information related to the annotated messages. Note that these tables are not included in our annotated dataset, but may be obtained from the Fiore and Heer version of the Enron corpus.

Creative Commons LicenseOur annotated data is licensed under a Creative Commons Attribution-Noncommercial 2.0 Generic License.


If you make use of any of these resources, please cite the following paper:

Andrew Lampert, Robert Dale and Cécile Paris (2009) - Segmenting Email Message Text into Zones, In Proceedings of Empirical Methods in Natural Language Processing (EMNLP 2009), pp. 919-928, Singapore, August 6-7.