TL; DR: Browser encoding can change from one request to the next. When you set content-type, be sure to also specify a charset. Being in the US, that great, wide birthplace of the Internet, it's very easy to take encoding for granted. Anyone who's not a web developer can happily inhabit ASCII for their entire lives, and at most raise a few eyebrows when they see funny mis-encoding like "fiancÃ©e" in a few Latin Extended words here and there.
But We've Taken the Red Pill
Of course, if we're making product for the web, we're making it for the world. And if we're responsible, we’re very careful to be familiar with unicode, in general (and for a lot of us, UTF-8 in particular). So, with that in mind, when one of our customers in China started getting messages messages like this from their customers, it was a bit of a surprise:
¨¨¡¥¡¤ ¨¦?????¨¨?¡è?3???1???¨¨????¡è?13??¡¥??£¤???¨¦?¡è¨¨?????????o¡é¨¨?2?????¡ã??? ??��??��???��?��?????1?????????
We've engineered Olark with Unicode in mind throughout our system, and do what we can to be able to get messages delivered from all across the globe. But sometimes, in the middle some of this customer's conversations, all of a visitor's messages would start showing in complete gibberish. It was like a switch had been flipped.
This is where the trace game began, and I spent a while looking through all the parts of our stack where encoding matters. I had some clues--since it would show up garbled to the operator in any client, we knew it was before our transport layer.
Since it was also garbled in the transcripts, we knew it was somewhere in our pre-transport layer (we call it NRPC, it's where the embedded chatbox event requests terminate). Logs might have it, but they rolled pretty fast (a few million chat events a week will do that). We had no way to reproduce the issue locally, and to complicate things, this user lived on the opposite side of the clock, timezone-wise -- if he got a sample conversation and told us about it, we've had twelve hours of logs before we're even awake to take a look.
So, I did some experimentation to try to eliminate some chaff. I started filtering based on bytes that were pretty uncommon in our system. In narrowing it down, I got some pretty cool stats (we have customers using Devangari, Batak, and Tagbanwa. SO COOL), but no luck with the customer’s issue.
Eventually, I had worked my way up the stack. There was no earlier place to log--every place encoding was a factor, things came through as they went in. This was very peculiar--a browser would of course always submit its requests in UTF-8, right? And even if it didn't, it wouldn't change that mid-conversation, right?
So I took a different approach: I logged EVERY request by the particular account having the problem, and every part of the request. And then I found something interesting:
sendmessage got these parameters from process filter: ...some uninteresting garbled content... 'accept-charset':'**GBK**,UTF-8;q=0.7,*;q=0.7' sendmessage got these parameters from process filter: ...some uninteresting ungarbled content... 'accept-charset': '**GB2312**,UTF-8;q=0.7,*;q=0.7'
Wait, what was that? Requests from the same end-user, but with different accept-charset? And a 1-to-1 relationship between the requests with accept-charset of GB2312 and garbled text?!
Even though this wasn't exactly a smoking gun (UTF-8 was in both, and even the requests with GBK in the accept-charset were actually getting submitted in UTF-8), this was definitely a clue that, somewhere on the way, the browser was seeing some clue and deciding to simply change the encoding it submitted requests with.
"Well that's easy," I thought, "We should just provide an encoding for the submission". Except, for various reasons, we use JSONP for all of these requests, meaning it has to be a GET request, meaning we cannot use form attributes to set request headers (more here).
And so, the fix: An extrastring added on the server side:
That's it. Once this was in production, the customer never saw the problem again (well, almost never, but that was a whole other can of worms).
The Moral of the Story
It is always the responsibility of the server to specify its charset when it returns content where encoding might matter. Browser determination might be a bit magical, but it will try to make the server authoritative, even though that authority will be based on a response not necessarily having any direct relationship to the request in question. It's our job to let the browser know what we expect.