This site contains information and resources related to Andrew Lampert's
email text segmentation and classification research.
Email Zones
The nine email zones we use for classification are:
- Author:
- New content from the current email
sender. This specifically excludes any text
authored by the sender that is included from
previous messages.
- Greeting:
- Terms of address and recipient
names at the beginning of a message (e.g.,
Dear/Hi/Hey Noam).
- Signoff:
- The message closing (e.g.,
Thanks/Cheers/Regards, John).
- Reply:
- Content quoted from a previous message
in the same conversation thread, including
any embedded signatures, attachments,
advertising, disclaimers, author content and
forwarded content. Content in a reply content
zone may include previously sent content authored
by the current sender.
- Forward:
- Content from an email message
outside the current conversation thread that
has been forwarded by the current email
sender, including any embedded signatures,
attachments, advertising, disclaimers, author
content and reply content.
- Signature:
- Content containing contact or
other information that is automatically inserted
in a message. In contrast to disclaimer
or advertising content, signature content is
usually templated content written once by
the email author, and automatically or semiautomatically
included in email messages. A
user may also use a Signature in place of a
Signoff; in such cases, we still mark the text
as a Signature.
- Advertising:
- Advertising material in an
email message. Such material often appears
at the end of a message (e.g., Do you Yahoo!?),
but may also appear prefixed or inline
with the content of the message, (e.g., in
sponsored mailing lists).
- Disclaimer:
- Legal disclaimers and privacy
statements, often automatically appended.
- Attachment:
- Automated text indicating or
referring to attached documents, such as that
shown in line 16 of Figure 1. Note that this
zone does not apply to manually authored reference
to attachments, nor to the actual content
of attachments (which we do not classify).
Note that while we recognise the need for the Quoted Text zone
proposed by Estival et al. (2007), no such data occurs in our
collection of annotated email messages. We therefore omit
this zone from our current set of zone types.
Our
annotated data is licensed under a Creative Commons
Attribution-Noncommercial 2.0 Generic License.
If you make use of any of these resources, please cite the following paper:
Andrew Lampert, Robert Dale and Cécile Paris (2009) - Segmenting Email Message Text into
Zones, In Proceedings of Empirical Methods in Natural Language
Processing (EMNLP 2009), pp. 919-928, Singapore, August 6-7.