This site contains information and resources related to Andrew Lampert's email text segmentation and classification research.
Our dataset consists of text spans annotated in 1000 email messages drawn at random from the Enron email corpus. We use the database dump of the Enron corpus (219Mb) released by Andrew Fiore and Jeff Heer. This version of the corpus has been processed to remove duplicate messages and to normalise sender and recipient names, resulting in just over 250,000 email messages. No attachments are included. Our annotations are made by a single annotator.
You can download our dataset by filling out the dataset request form.
Our annotated dataset is available as a set of database tables that link back to the message data in the Fiore and Heer dataset. Our annotated dataset includes the following tables:
- A unique id for each annotation record
- The id of the message from which this annotated line is extracted
- The character offset from the beginning of the message body text at which the marked span begins.
- The character offset from the beginning of the message body text at which the marked span ends.
- The id of the annotator. In this , all annotations are made by two different annotators (userid=1 and 9).
- Timestamp of when the annotation was recorded.
- The zone annotation, encoded as an integer. The label corresponding to the annotation value can be found in the spantypes table.
- The text of the span that was highlighted and marked by the annotator.
- A unique id for each annotation type. These ids are the values stored in the annvalue field in the annotationspans table.
- The name of each annotation type (e.g., Request Only, Request and Commitment, Phatic Commitment).
Our annotated data is licensed under a Creative Commons Attribution-Noncommercial 2.0 Generic License.
If you make use of any of these resources, please cite one of the following papers:
Andrew Lampert, Robert Dale and Cécile Paris (2010) - Detecting emails containing requests for action, In Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics Processing (HLT/NAACL 2010), pp. 984-992, Los Angeles, USA.
Andrew Lampert, Robert Dale and Cécile Paris (2008) - Requests and Commitments in Email are More Complex Than You Think: Eight Reasons to be Cautious, In Proceedings of Australasian Language Technology Association Workshop, pp. 64-72. Hobart, Australia.
Andrew Lampert, Robert Dale and Cécile Paris (2008) - The Nature of Requests and Commitments in Email Messages, In Proceedings of EMAIL-08: the AAAI Workshop on Enhanced Messaging, pp. 42-47, Chicago, USA.