Email Request and Commitment Dataset

This site contains information and resources related to Andrew Lampert's email text segmentation and classification research.


Zebra
Email Image by m-c, licensed under Creative Commons

Annotated Email Dataset

Our dataset consists of text spans annotated in 1000 email messages drawn at random from the Enron email corpus. We use the database dump of the Enron corpus (219Mb) released by Andrew Fiore and Jeff Heer. This version of the corpus has been processed to remove duplicate messages and to normalise sender and recipient names, resulting in just over 250,000 email messages. No attachments are included. Our annotations are made by a single annotator.

You can download our dataset by filling out the dataset request form.

Dataset Structure

Our annotated dataset is available as a set of database tables that link back to the message data in the Fiore and Heer dataset. Our annotated dataset includes the following tables:

Annotationspans

This table contains the span annotations for the email text. It includes the following fields:
id
A unique id for each annotation record
messageid
The id of the message from which this annotated line is extracted
startoffset
The character offset from the beginning of the message body text at which the marked span begins.
endoffset
The character offset from the beginning of the message body text at which the marked span ends.
userid
The id of the annotator. In this , all annotations are made by two different annotators (userid=1 and 9).
datetime
Timestamp of when the annotation was recorded.
annvalue
The zone annotation, encoded as an integer. The label corresponding to the annotation value can be found in the spantypes table.
selectedtext
The text of the span that was highlighted and marked by the annotator.

Spantypes

This table contains the labels for each of the request and commitment categories that are applied to the highlighted email spans. It includes the following fields:
id
A unique id for each annotation type. These ids are the values stored in the annvalue field in the annotationspans table.
name
The name of each annotation type (e.g., Request Only, Request and Commitment, Phatic Commitment).

Bodies, Headers, Messages, People, Recipients

These tables contains the body text, recipient and other header information related to the annotated messages. Note that these tables are not included in our annotated dataset, but may be obtained from the Fiore and Heer version of the Enron corpus.

Creative Commons LicenseOur annotated data is licensed under a Creative Commons Attribution-Noncommercial 2.0 Generic License.


If you make use of any of these resources, please cite one of the following papers:

Andrew Lampert, Robert Dale and Cécile Paris (2010) - Detecting emails containing requests for action, In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Processing (HLT/NAACL 2010), pp. 984-992, Los Angeles, USA.
Andrew Lampert, Robert Dale and Cécile Paris (2008) - Requests and Commitments in Email are More Complex Than You Think: Eight Reasons to be Cautious, In Proceedings of Australasian Language Technology Association Workshop, pp. 64-72. Hobart, Australia.
Andrew Lampert, Robert Dale and Cécile Paris (2008) - The Nature of Requests and Commitments in Email Messages, In Proceedings of EMAIL-08: the AAAI Workshop on Enhanced Messaging, pp. 42-47, Chicago, USA.