Email Zone Classes

This site contains information and resources related to Andrew Lampert's email text segmentation and classification research.

Zebra Image by arnolouise, licensed under Creative Commons

Email Zones

The nine email zones we use for classification are:

New content from the current email sender. This specifically excludes any text authored by the sender that is included from previous messages.
Terms of address and recipient names at the beginning of a message (e.g., Dear/Hi/Hey Noam).
The message closing (e.g., Thanks/Cheers/Regards, John).
Content quoted from a previous message in the same conversation thread, including any embedded signatures, attachments, advertising, disclaimers, author content and forwarded content. Content in a reply content zone may include previously sent content authored by the current sender.
Content from an email message outside the current conversation thread that has been forwarded by the current email sender, including any embedded signatures, attachments, advertising, disclaimers, author content and reply content.
Content containing contact or other information that is automatically inserted in a message. In contrast to disclaimer or advertising content, signature content is usually templated content written once by the email author, and automatically or semiautomatically included in email messages. A user may also use a Signature in place of a Signoff; in such cases, we still mark the text as a Signature.
Advertising material in an email message. Such material often appears at the end of a message (e.g., Do you Yahoo!?), but may also appear prefixed or inline with the content of the message, (e.g., in sponsored mailing lists).
Legal disclaimers and privacy statements, often automatically appended.
Automated text indicating or referring to attached documents, such as that shown in line 16 of Figure 1. Note that this zone does not apply to manually authored reference to attachments, nor to the actual content of attachments (which we do not classify).
Note that while we recognise the need for the Quoted Text zone proposed by Estival et al. (2007), no such data occurs in our collection of annotated email messages. We therefore omit this zone from our current set of zone types.

Creative Commons LicenseOur annotated data is licensed under a Creative Commons Attribution-Noncommercial 2.0 Generic License.

If you make use of any of these resources, please cite the following paper:

Andrew Lampert, Robert Dale and Cécile Paris (2009) - Segmenting Email Message Text into Zones, In Proceedings of Empirical Methods in Natural Language Processing (EMNLP 2009), pp. 919-928, Singapore, August 6-7.