eml-codec/README.md

# imf-codec

**Work in progress, do not use in production**
**This is currently only a decoder (parser), encoding is not supported.**

## Goals

- Maintainability - modifying the code does not create regression and is possible for someone exterior to the project. Keep cyclomatic complexity low.
- Composability - build your own parser by picking the relevant passes, avoid work that is not needed.
- Compatibility - always try to parse something, do not panic or return an error.

## Non goals

  - Parsing optimization that would make more complicated to understand the logic.
  - Optimization for a specific use case, to the detriment of other use cases.
  - Pipelining/streaming/buffering as the parser can arbitrarily backtrack + our result contains reference to the whole buffer, imf-codec must keep the whole buffer in memory. Avoiding the sequential approach would certainly speed-up a little bit the parsing, but it's too much work to implement currently.
  - Zerocopy. It might be implementable in the future, but to quickly bootstrap this project, I avoided it for now.

## Missing / known bugs

Current known limitations/bugs:

 - Resent Header Fields are not implemented
 - Return-Path/Received headers might be hard to use as their order is important, and it's currently lost in the final datastructure.
 - Datetime parsing of invalid date might return `None` instead of falling back to the `bad_body` field
 - Comments contained in the email headers are dropped during parsing
 - No support is provided for message/external-body (read data from local computer) and message/partial (aggregate multiple fragmented emails) as they seem obsolete and dangerous to implement.

## Design

Multipass design: each pass is in charge of a specific work.
*Having multiple pass does not necessarily lead to abyssmal performances.
For example, the [Chez Scheme compiler](https://legacy.cs.indiana.edu/~dyb/pubs/commercial-nanopass.pdf) 
pioneered the "Nanopass" concept and showcases excellent performances.*

Currently, you can use the following passes:
 - `segment.rs` - Extract the header section by finding the `CRLFCRLF` token.
 - `guess_charset.rs` - Find the header section encoding (should be ASCII or UTF8 but some corpus contains ISO-8859-1 headers)
 - `extract_fields.rs` - Extract the headers line by lines, taking into account Foldable White Space.
 - `field_lazy.rs` - Try to recognize the header fields (`From`, `To`, `Date`, etc.) but do not parse their value.  
 - `field_eager.rs` - Parse the value of each known header fields.  
 - `header_section.rs` - Aggregate the various fields in a single structure.  


## Testing strategy

imf-codec aims to be as much tested as possible against reald

### Unit testing: parser combinator independently (done)

### Selected full emails (expected)

### Existing datasets

**Enron 500k** - Took 20 minutes to parse ~517k emails and check that 
RFC5322 headers (From, To, Cc, etc.) are correctly parsed.
From this list, we had to exclude ~50 emails on which
the From/To/Cc fields were simply completely wrong, but while
some fields failed to parse, the parser did not crash and
parsed the other fields of the email correctly.

Run it on your machine:

```bash
cargo test -- --ignored --nocapture enron500k
```

Planned: jpbush, my inbox, etc.

### Fuzzing (expected)

### Across reference IMAP servers (dovevot, cyrus) (expected)

## Development status

Early development. Not ready.
Do not use it in production or any software at all.

Todo:
 - [X] test over the enron dataset
 - [X] convert to multipass parser
 - [X] fix warnings, put examples, refactor the code
 - [ ] implement mime part 3 (encoded headers)
 - [ ] implement mime part 1 (new headers)
 - [ ] review part 2 (media types) and part 4 (registration procedure) but might be out of scope
 - [ ] implement some targeted testing as part of mime part 5
 - [ ] implement fuzzing through cargo fuzz
 - [ ] test over other datasets (jpbush, ml, my inbox)
 - [ ] backport to aerogramme

## Targeted RFC and IANA references

| 🚩 | # | Name |
|----|---|------|
| 🟩 |822	| ARPA INTERNET TEXT MESSAGES| 
| 🟩 |2822	| Internet Message Format (2001) | 	
| 🟩 |5322	| Internet Message Format (2008) | 	
| 🔴 |2045	| ↳ Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies |
| 🔴 |2046	| ↳ Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types | 
| 🔴 |2047	| ↳ MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text | 
| 🔴 |2048	| ↳ Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures | 
| 🔴 |2049	| ↳ Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples |
| 🔴 |2183  | Communicating Presentation Information in Internet Messages: The Content-Disposition Header Field |
| 🟩 |6532	| Internationalized Email Headers |
| 🔴 |9228   | Delivered-To Email Header Field |

IANA references :
 - (tbd) MIME subtypes
 - [IANA character sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml)

## Alternatives

`stalwartlab/mail_parser`
Add a README.md 2023-06-08 19:59:41 +00:00			`# imf-codec`

implement comment foldable whitespace 2023-06-12 14:05:06 +00:00			`Work in progress, do not use in production`
improve readme, wip datetime 2023-06-17 09:43:54 +00:00			`This is currently only a decoder (parser), encoding is not supported.`
implement comment foldable whitespace 2023-06-12 14:05:06 +00:00
improve readme, wip datetime 2023-06-17 09:43:54 +00:00			`## Goals`
implement comment foldable whitespace 2023-06-12 14:05:06 +00:00
parse version header 2023-07-03 15:05:17 +00:00			`- Maintainability - modifying the code does not create regression and is possible for someone exterior to the project. Keep cyclomatic complexity low.`
			`- Composability - build your own parser by picking the relevant passes, avoid work that is not needed.`
			`- Compatibility - always try to parse something, do not panic or return an error.`
improve readme, wip datetime 2023-06-17 09:43:54 +00:00
			`## Non goals`

parse version header 2023-07-03 15:05:17 +00:00			`- Parsing optimization that would make more complicated to understand the logic.`
			`- Optimization for a specific use case, to the detriment of other use cases.`
improve readme, wip datetime 2023-06-17 09:43:54 +00:00			`- Pipelining/streaming/buffering as the parser can arbitrarily backtrack + our result contains reference to the whole buffer, imf-codec must keep the whole buffer in memory. Avoiding the sequential approach would certainly speed-up a little bit the parsing, but it's too much work to implement currently.`
			`- Zerocopy. It might be implementable in the future, but to quickly bootstrap this project, I avoided it for now.`

			`## Missing / known bugs`

			`Current known limitations/bugs:`

			`- Resent Header Fields are not implemented`
			`- Return-Path/Received headers might be hard to use as their order is important, and it's currently lost in the final datastructure.`
			- Datetime parsing of invalid date might return `None` instead of falling back to the `bad_body` field
parse version header 2023-07-03 15:05:17 +00:00			`- Comments contained in the email headers are dropped during parsing`
			`- No support is provided for message/external-body (read data from local computer) and message/partial (aggregate multiple fragmented emails) as they seem obsolete and dangerous to implement.`
improve readme, wip datetime 2023-06-17 09:43:54 +00:00
			`## Design`
rfc table in readme 2023-06-13 07:03:51 +00:00
parse version header 2023-07-03 15:05:17 +00:00			`Multipass design: each pass is in charge of a specific work.`
			`*Having multiple pass does not necessarily lead to abyssmal performances.`
			`For example, the [Chez Scheme compiler](https://legacy.cs.indiana.edu/~dyb/pubs/commercial-nanopass.pdf)`
			`pioneered the "Nanopass" concept and showcases excellent performances.*`

			`Currently, you can use the following passes:`
			- `segment.rs` - Extract the header section by finding the `CRLFCRLF` token.
			- `guess_charset.rs` - Find the header section encoding (should be ASCII or UTF8 but some corpus contains ISO-8859-1 headers)
			- `extract_fields.rs` - Extract the headers line by lines, taking into account Foldable White Space.
			- `field_lazy.rs` - Try to recognize the header fields (`From`, `To`, `Date`, etc.) but do not parse their value.
			- `field_eager.rs` - Parse the value of each known header fields.
			- `header_section.rs` - Aggregate the various fields in a single structure.

improve readme, wip datetime 2023-06-17 09:43:54 +00:00
			`## Testing strategy`

validate enron 2023-06-19 14:09:11 +00:00			`imf-codec aims to be as much tested as possible against reald`

			`### Unit testing: parser combinator independently (done)`

			`### Selected full emails (expected)`

			`### Existing datasets`

			`Enron 500k - Took 20 minutes to parse ~517k emails and check that`
			`RFC5322 headers (From, To, Cc, etc.) are correctly parsed.`
			`From this list, we had to exclude ~50 emails on which`
			`the From/To/Cc fields were simply completely wrong, but while`
			`some fields failed to parse, the parser did not crash and`
			`parsed the other fields of the email correctly.`

add cli for enron 2023-06-19 14:10:02 +00:00			`Run it on your machine:`

			```bash
			`cargo test -- --ignored --nocapture enron500k`
			```

validate enron 2023-06-19 14:09:11 +00:00			`Planned: jpbush, my inbox, etc.`

			`### Fuzzing (expected)`

			`### Across reference IMAP servers (dovevot, cyrus) (expected)`
improve readme, wip datetime 2023-06-17 09:43:54 +00:00
			`## Development status`

			`Early development. Not ready.`
			`Do not use it in production or any software at all.`

wip enron, todo list 2023-06-19 09:22:51 +00:00			`Todo:`
validate enron 2023-06-19 14:09:11 +00:00			`- [X] test over the enron dataset`
remove obsolete fragments/header.rs 2023-06-22 13:05:52 +00:00			`- [X] convert to multipass parser`
update todo list 2023-06-22 13:11:11 +00:00			`- [X] fix warnings, put examples, refactor the code`
wip enron, todo list 2023-06-19 09:22:51 +00:00			`- [ ] implement mime part 3 (encoded headers)`
			`- [ ] implement mime part 1 (new headers)`
			`- [ ] review part 2 (media types) and part 4 (registration procedure) but might be out of scope`
			`- [ ] implement some targeted testing as part of mime part 5`
			`- [ ] implement fuzzing through cargo fuzz`
			`- [ ] test over other datasets (jpbush, ml, my inbox)`
			`- [ ] backport to aerogramme`

implement own charset system 2023-07-03 16:25:51 +00:00			`## Targeted RFC and IANA references`
rfc table in readme 2023-06-13 07:03:51 +00:00
add some implementation indicators 2023-06-19 10:15:05 +00:00			`\| 🚩 \| # \| Name \|`
			`\|----\|---\|------\|`
			`\| 🟩 \|822 \| ARPA INTERNET TEXT MESSAGES\|`
fix typo readme 2023-06-19 10:15:45 +00:00			`\| 🟩 \|2822 \| Internet Message Format (2001) \|`
add some implementation indicators 2023-06-19 10:15:05 +00:00			`\| 🟩 \|5322 \| Internet Message Format (2008) \|`
			`\| 🔴 \|2045 \| ↳ Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies \|`
			`\| 🔴 \|2046 \| ↳ Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types \|`
			`\| 🔴 \|2047 \| ↳ MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text \|`
			`\| 🔴 \|2048 \| ↳ Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures \|`
			`\| 🔴 \|2049 \| ↳ Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples \|`
WIP mime headers 2023-07-03 09:40:02 +00:00			`\| 🔴 \|2183 \| Communicating Presentation Information in Internet Messages: The Content-Disposition Header Field \|`
add some implementation indicators 2023-06-19 10:15:05 +00:00			`\| 🟩 \|6532 \| Internationalized Email Headers \|`
			`\| 🔴 \|9228 \| Delivered-To Email Header Field \|`
improve readme, wip datetime 2023-06-17 09:43:54 +00:00
implement own charset system 2023-07-03 16:25:51 +00:00			`IANA references :`
			`- (tbd) MIME subtypes`
			`- [IANA character sets](https://www.iana.org/assignments/character-sets/character-sets.xhtml)`

improve readme, wip datetime 2023-06-17 09:43:54 +00:00			`## Alternatives`

			`stalwartlab/mail_parser`