eml-codec/README.md

142 lines
5.9 KiB
Markdown
Raw Normal View History

2023-07-23 10:33:49 +00:00
# eml-codec
2023-06-08 19:59:41 +00:00
2023-07-24 10:45:31 +00:00
**⚠️ This is currently only a decoder (ie. a parser), encoding is not yet implemented.**
2023-07-23 10:33:49 +00:00
2023-07-24 09:02:49 +00:00
## Example
```rust
2023-07-24 09:17:47 +00:00
let input = br#"Date: 7 Mar 2023 08:00:00 +0200
2023-07-24 09:02:49 +00:00
From: deuxfleurs@example.com
To: someone_else@example.com
Subject: An RFC 822 formatted message
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
This is the plain text body of the message. Note the blank line
2023-07-24 09:17:47 +00:00
between the header information and the body of the message."#;
2023-07-24 09:02:49 +00:00
let email = eml_codec::email(input).unwrap();
2023-07-24 09:17:47 +00:00
println!(
"{} just sent you an email with subject \"{}\"",
2023-07-24 10:26:53 +00:00
email.imf.from[0].to_string(),
email.imf.subject.unwrap().to_string(),
2023-07-24 09:17:47 +00:00
);
2023-07-24 09:02:49 +00:00
```
2023-07-23 10:33:49 +00:00
## About the name
This library does not aim at implementing a specific RFC, but to be a swiss-army knife to decode and encode ("codec") what is generaly considered an email (generally abbreviated "eml"), hence the name: **eml-codec**.
2023-06-12 14:05:06 +00:00
2023-06-17 09:43:54 +00:00
## Goals
2023-06-12 14:05:06 +00:00
2023-07-03 15:05:17 +00:00
- Maintainability - modifying the code does not create regression and is possible for someone exterior to the project. Keep cyclomatic complexity low.
- Composability - build your own parser by picking the relevant passes, avoid work that is not needed.
- Compatibility - always try to parse something, do not panic or return an error.
2023-07-04 12:37:53 +00:00
- Exhaustivity - serve as a common project to encode knowledge about emails (existing mime types, existing headers, etc.).
2023-06-17 09:43:54 +00:00
## Non goals
2023-07-03 15:05:17 +00:00
- Parsing optimization that would make more complicated to understand the logic.
- Optimization for a specific use case, to the detriment of other use cases.
2023-07-24 09:02:49 +00:00
- Pipelining/streaming/buffering as the parser can arbitrarily backtrack + our result contains reference to the whole buffer, eml-codec must keep the whole buffer in memory. Avoiding the sequential approach would certainly speed-up a little bit the parsing, but it's too much work to implement currently.
2023-06-17 09:43:54 +00:00
## Missing / known bugs
Current known limitations/bugs:
- Resent Header Fields are not implemented
- Return-Path/Received headers might be hard to use as their order is important, and it's currently lost in the final datastructure.
- Datetime parsing of invalid date might return `None` instead of falling back to the `bad_body` field
2023-07-03 15:05:17 +00:00
- Comments contained in the email headers are dropped during parsing
- No support is provided for message/external-body (read data from local computer) and message/partial (aggregate multiple fragmented emails) as they seem obsolete and dangerous to implement.
2023-06-17 09:43:54 +00:00
## Design
2023-06-13 07:03:51 +00:00
2023-07-23 10:33:49 +00:00
Speak about parser combinators.
2023-06-17 09:43:54 +00:00
## Testing strategy
2023-07-24 09:02:49 +00:00
eml-codec aims to be as much tested as possible against real word data.
2023-06-19 14:09:11 +00:00
### Unit testing: parser combinator independently (done)
### Selected full emails (expected)
### Existing datasets
**Enron 500k** - Took 20 minutes to parse ~517k emails and check that
RFC5322 headers (From, To, Cc, etc.) are correctly parsed.
From this list, we had to exclude ~50 emails on which
the From/To/Cc fields were simply completely wrong, but while
some fields failed to parse, the parser did not crash and
parsed the other fields of the email correctly.
2023-06-19 14:10:02 +00:00
Run it on your machine:
```bash
cargo test -- --ignored --nocapture enron500k
```
2023-06-19 14:09:11 +00:00
Planned: jpbush, my inbox, etc.
### Fuzzing (expected)
### Across reference IMAP servers (dovevot, cyrus) (expected)
2023-06-17 09:43:54 +00:00
2023-07-03 16:25:51 +00:00
## Targeted RFC and IANA references
2023-06-13 07:03:51 +00:00
2023-06-19 10:15:05 +00:00
| 🚩 | # | Name |
|----|---|------|
| 🟩 |822 | ARPA INTERNET TEXT MESSAGES|
2023-06-19 10:15:45 +00:00
| 🟩 |2822 | Internet Message Format (2001) |
2023-06-19 10:15:05 +00:00
| 🟩 |5322 | Internet Message Format (2008) |
2023-07-23 15:14:58 +00:00
| 🟩 |2045 | ↳ Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies |
| 🟩 |2046 | ↳ Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types |
2023-07-14 08:44:03 +00:00
| 🟩 |2047 | ↳ MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text |
2023-07-23 15:14:58 +00:00
| 🟩 |2048 | ↳ Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures |
| 🟩 |2049 | ↳ Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples |
2023-07-04 13:50:45 +00:00
| | | **Headers extensions** |
| 🔴 |2183 | ↳ Communicating Presentation Information in Internet Messages: The Content-Disposition Header Field |
2023-07-19 08:41:51 +00:00
| 🔴 |6532 | ↳ Internationalized Email Headers |
2023-07-04 13:50:45 +00:00
| 🔴 |9228 | ↳ Delivered-To Email Header Field |
| | | **MIME extensions** |
| 🔴 |1847 | ↳ Security Multiparts for MIME: Multipart/Signed and Multipart/Encrypted |
| 🔴 |2387 | ↳ The MIME Multipart/Related Content-type |
| 🔴 |3462 | ↳ The Multipart/Report Content Type for the Reporting of Mail System Administrative Messages |
| 🔴 |3798 | ↳ Message Disposition Notification |
| 🔴 |6838 | ↳ Media Type Specifications and Registration Procedures |
2023-06-17 09:43:54 +00:00
2023-07-03 16:25:51 +00:00
IANA references :
- (tbd) MIME subtypes
- [IANA character sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml)
2023-07-23 10:33:49 +00:00
## State of the art / alternatives
2023-06-17 09:43:54 +00:00
`stalwartlab/mail_parser`
2023-07-23 10:33:49 +00:00
2023-07-24 10:44:21 +00:00
## Support
`eml-codec`, as part of the [Aerogramme project](https://nlnet.nl/project/Aerogramme/), was funded through the NGI Assure Fund, a fund established by NLnet with financial support from the European Commission's Next Generation Internet programme, under the aegis of DG Communications Networks, Content and Technology under grant agreement No 957073.
![NLnet logo](https://aerogramme.deuxfleurs.fr/images/nlnet.svg)
## License
```
eml-codec
Copyright (C) The eml-codec Contributors
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
```