[racket] windows-1252 charset decoding

Discussion:

John Clements

2015-03-04 00:22:26 UTC

I'm trying to process a bunch of e-mail, and I've discovered that lots of
it is encoded using the "windows-1252" charset. It looks pretty
straightforward to map this to unicode, but I thought I'd check: has anyone
written this code already?

John Clements

Matthew Flatt

2015-03-04 00:31:13 UTC

Permalink

You can use "windows-1252" as an encoding name with, for example,

(read-line (reencode-input-port (open-input-bytes #"\xA3")

"windows-1252"))
"£"

For handling e-mail, see also `generalize-encoding` from `net/unihead`.

____________________
Racket Users list:
h

John Clements

2015-03-04 20:04:15 UTC

Permalink

I see that the documentation suggests that (entity-charset) is supposed to
return a symbol. However, it nearly always returns a string. In particular,
it appears to me that it returns a symbol only when it returns its default,
'us-ascii.

I feel compelled to repair this, but there are two ways to fix it:
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.

It looks to me like #2 will break (less) code, though it's certainly
possible that people depend on the default value's being a string.

Opinions? In my tree, I've added contract checks on the structure exports
and changed the documentation and default to always return a string. If
people like this, I can just submit it as a pull request.

John

Post by Matthew Flatt
You can use "windows-1252" as an encoding name with, for example,

(read-line (reencode-input-port (open-input-bytes #"\xA3")

"windows-1252"))
âÂ£"

Perfect!
I went looking for a place where I might add a âwindows-1252â search term,
but it looks like it might be hard, since the list of supported encodings
is apparently platform dependent. Would it make sense simply to attach a
free-floating search tag of âwindows-1252â to this part of the
documentation?

Post by Matthew Flatt
For handling e-mail, see also `generalize-encoding` from `net/unihead`.

That probably saved me another half-hour of searching and head-scratching.
Thanks!
John
(p.s.: no one whose mailer checks DMARC records will get this e-mail,
sadly. Canât wait to change to google groups.)

Post by Matthew Flatt

I'm trying to process a bunch of e-mail, and I've discovered that lots

Post by Matthew Flatt

it is encoded using the "windows-1252" charset. It looks pretty
straightforward to map this to unicode, but I thought I'd check: has

anyone

Post by Matthew Flatt

written this code already?
John Clements
____________________
http://lists.racket-lang.org/users

Sam Tobin-Hochstadt

2015-03-04 20:13:44 UTC

Permalink

Post by John Clements
I see that the documentation suggests that (entity-charset) is supposed to
return a symbol. However, it nearly always returns a string. In particular,
it appears to me that it returns a symbol only when it returns its default,
'us-ascii.
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.
It looks to me like #2 will break (less) code, though it's certainly
possible that people depend on the default value's being a string.

It seems like option #3, document the current behavior, will break the
least code, and that we should do that.

Sam

John Clements

2015-03-05 18:37:49 UTC

Permalink

Urghh.... really? The existing behavior is clearly broken, and this library
is--to the best of my knowledge--used by a relatively small number of
people. Francisco, as the original author of this code, do you have an
opinion?

Post by Sam Tobin-Hochstadt

Post by John Clements
I see that the documentation suggests that (entity-charset) is supposed
to return a symbol. However, it nearly always returns a string. In
particular, it appears to me that it returns a symbol only when it returns
its default, 'us-ascii.
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.
It looks to me like #2 will break (less) code, though it's certainly
possible that people depend on the default value's being a string.

It seems like option #3, document the current behavior, will break the
least code, and that we should do that.
Sam