aerogramme.deuxfleurs.fr/content/documentation/development/dataset.md

21 lines
927 B
Markdown
Raw Normal View History

2023-06-06 14:44:11 +00:00
+++
title = "Datasets"
weight = 20
+++
To debug / fuzz Aerogramme, we seek some datasets.
## Emails datasets
- [stalwartlabs/mail-parser](https://github.com/stalwartlabs/mail-parser/tree/main/tests)
- [basecamp/mail](https://github.com/basecamp/mail/tree/master/spec/fixtures)
- [Enron dataset - 500k entries](https://www.cs.cmu.edu/~enron/)
- [Jeb Bush dataset - 290k entries](https://ab21www.s3.amazonaws.com/JebBushEmails-Text.7z)
- [spambase dataset](https://archive.ics.uci.edu/ml/datasets/spambase) (also contains legit emails)
- mailing lists
- [W3C](https://lists.w3.org/Archives/Public/)
- [Wikimedia](https://lists.wikimedia.org/hyperkitty/)
- [Apache](https://commons.apache.org/mail-lists.html) - [tomcat](https://lists.apache.org/list.html?dev@tomcat.apache.org), [kafka](https://lists.apache.org/list.html?dev@kafka.apache.org).
- [Linux](https://marc.info/?l=linux-kernel)
- your own inbox