Discussion:
[racket] windows-1252 charset decoding
John Clements
2015-03-04 00:22:26 UTC
Permalink
I'm trying to process a bunch of e-mail, and I've discovered that lots of
it is encoded using the "windows-1252" charset. It looks pretty
straightforward to map this to unicode, but I thought I'd check: has anyone
written this code already?

John Clements
Matthew Flatt
2015-03-04 00:31:13 UTC
Permalink
You can use "windows-1252" as an encoding name with, for example,
(read-line (reencode-input-port (open-input-bytes #"\xA3")
"windows-1252"))
"£"

For handling e-mail, see also `generalize-encoding` from `net/unihead`.
I'm trying to process a bunch of e-mail, and I've discovered that lots of
it is encoded using the "windows-1252" charset. It looks pretty
straightforward to map this to unicode, but I thought I'd check: has anyone
written this code already?
John Clements
____________________
http://lists.racket-lang.org/users
____________________
Racket Users list:
h
John Clements
2015-03-04 20:04:15 UTC
Permalink
I see that the documentation suggests that (entity-charset) is supposed to
return a symbol. However, it nearly always returns a string. In particular,
it appears to me that it returns a symbol only when it returns its default,
'us-ascii.

I feel compelled to repair this, but there are two ways to fix it:
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.

It looks to me like #2 will break (less) code, though it's certainly
possible that people depend on the default value's being a string.

Opinions? In my tree, I've added contract checks on the structure exports
and changed the documentation and default to always return a string. If
people like this, I can just submit it as a pull request.

John
Post by Matthew Flatt
You can use "windows-1252" as an encoding name with, for example,
(read-line (reencode-input-port (open-input-bytes #"\xA3")
"windows-1252"))
“£"
Perfect!
I went looking for a place where I might add a “windows-1252” search term,
but it looks like it might be hard, since the list of supported encodings
is apparently platform dependent. Would it make sense simply to attach a
free-floating search tag of “windows-1252” to this part of the
documentation?
Post by Matthew Flatt
For handling e-mail, see also `generalize-encoding` from `net/unihead`.
That probably saved me another half-hour of searching and head-scratching.
Thanks!
John
(p.s.: no one whose mailer checks DMARC records will get this e-mail,
sadly. Can’t wait to change to google groups.)
Post by Matthew Flatt
I'm trying to process a bunch of e-mail, and I've discovered that lots
of
Post by Matthew Flatt
it is encoded using the "windows-1252" charset. It looks pretty
straightforward to map this to unicode, but I thought I'd check: has
anyone
Post by Matthew Flatt
written this code already?
John Clements
____________________
http://lists.racket-lang.org/users
Sam Tobin-Hochstadt
2015-03-04 20:13:44 UTC
Permalink
Post by John Clements
I see that the documentation suggests that (entity-charset) is supposed to
return a symbol. However, it nearly always returns a string. In particular,
it appears to me that it returns a symbol only when it returns its default,
'us-ascii.
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.
It looks to me like #2 will break (less) code, though it's certainly
possible that people depend on the default value's being a string.
It seems like option #3, document the current behavior, will break the
least code, and that we should do that.

Sam
John Clements
2015-03-05 18:37:49 UTC
Permalink
Urghh.... really? The existing behavior is clearly broken, and this library
is--to the best of my knowledge--used by a relatively small number of
people. Francisco, as the original author of this code, do you have an
opinion?
Post by Sam Tobin-Hochstadt
Post by John Clements
I see that the documentation suggests that (entity-charset) is supposed
to return a symbol. However, it nearly always returns a string. In
particular, it appears to me that it returns a symbol only when it returns
its default, 'us-ascii.
1) make it match the docs and always return a symbol, or
2) change the docs and the default to return a string.
It looks to me like #2 will break (less) code, though it's certainly
possible that people depend on the default value's being a string.
It seems like option #3, document the current behavior, will break the
least code, and that we should do that.
Sam
Loading...