I wish to download a web page, which may be in any possible text encoding, and save it as UTF16LE. Assuming I can determine the text's encoding (by examining the HTTP header, HTML header, and/or BOM), how do I convert the text?
I am using Delphi 2009. Unfortunately, the help files do not explain how to get from any encoding to a Unicode (UTF16LE) string. Specific questions:
Can I complete the conversion, simply by setting the correct encoding on an AnsiString and assigning that to a UnicodeString?
If so, how do I translate the various "charset" descriptions that may label the web page (Big5, Shift-JIS, UTF-32, etc) into the right format to initialize the AnsiString?
Thanks for your suggestions.
I have a preference for straight Win32 and VCL, but answers involving ActiveX controls may also be helpful.
how are you going to access the page? Embedded Internet Explorer, INDY, third party tool, ...? That might influence the answer because it determines the format of the input string.
Part 1: Getting the page
If you use the Embedded Internet Explorer (TWebBrowser) to access the page things are pretty straightforward:
var htmlElement:IHTMLElement;
myText:String;
begin
// Get access to the HTML element of the document:
htmlElement:=(WebBrowserControl.DefaultInterface.Document as IHTMLDocument3).documentElement;
// Receive the full HTML of the web page:
myText:=htmlElement.OuterHTML;
The encoding of the web page should be handled properly by the IE and by Delphi and you end up with a UnicodeString containing the result (myText in the examples).
Part 2: Saving in UTF-16LE
Regardless where your string came from - you can save it like this in the desired encoding:
var s:TStringStream;
begin
s:=TStringStream.Create(myText, TEncoding.Unicode, false);
s.SaveToFile('yourFileToSaveTo.txt');
FreeAndNil(s);
end;
TEncoding.Unicode is UTF-16LE, but you could also use any other encoding.
Hope this helps.
In D2009 and later, Indy 10's TIdHTTP component automatically decodes a received webpage to UTF-16 for you.
Doing a charset-to-Unicode conversion on Windows requires the use of codepages (unless you use the ICONV library), so you have to first convert a charset name to a suitable codepage, and then you can use TEncoding.GetEncoding() and TEncoding.GetString(), or call SetCodePage() on a RawByteString (not an AnsiString) that you then assign to a UnicodeString, to do the conversion (internally, Indy uses TEncoding and has its own charset-to-codepage lookup tables).
Related
We want to rewrite a large web project. To make the work more safe we want to cover it by numerous API tests that will be extracted from peeking at the real web calls. (And let us be honest, from the code analysis, too).
Thus I am trying to extract the Json strings sent by different requests. The problem is that the tool provided by the browser (it is practically the same for both FF and Chrome) gives me the Json in a structured form. And I need to use it as strings.
To rewrite all large and deeply structured strings from more than a hundred of requests manually is a horror. How can I copypaste the string representation of request parameters?
I have found that in Chrome - near the "request payload" header there is a switch: view source <-> view parsed. The first variant shows the Json string. BTW, IE has buttons for that and FF... has nothing?
In Firefox: Right click > Copy > Copy POST Data.
You can also "Copy All As HAR" to get the raw body (of every request and response in the list), and "Edit and Resend" will show you the raw body in the UI.
I am trying to use the following command to get the page's source code
requests.get("website").text
But I get an error:
UnicodeEncodeError: 'gbk' codec can't encode character '\xe6' in position 356: illegal multibyte sequence
Then I tried to change the page code to utf-8
requests.get("website").text.encode('utf-8')
But in addition to English will become the following form
\xe6°\xb8\xe4\xb9\x85\xe6\x8f\x90\xe4\xbe\x9b\xe5\x85\x8dè\xb4\xb9VPN\xe5\xb8\x90\xe5\x8f·\xe5\x92\x8c\xe5\x85\x8dè\xb4\xb
How can I do?
Thank you for your help
You can query the encoding of the requests.Response object by accessing its Response.content attribute.
Whenever you call requests.Response.text, the response object uses requests.Response.encoding to decode the bytes.
This may, however, not always be the correct encoding, hence you sometimes have to set it manually by looking it up in the content attribute, since websites usually specify the encoding there, if it's not utf-8 or similar (this is from experience, I'm not sure if this is actual standard behavior).
See more on requests.Response contents here
Is it necessary to percent encode a URI before using it in the browser i.e when we write a URI in a browser should it already be percent encoded or it is the responsibility of the browser to encode the URI and send the request to the server?
You'll find that most modern browsers will accept a non-encoded URL and they will generally be able to encode reserved characters themselves.
However, it is bad practice to rely on this because you can end up with unpredictable results. For instance, if you were sending form data to a server using a GET request and someone had typed in a # symbol, the browser will interpret that differently if it was encoded or non-encoded.
In short, it's always best to encode data manually to get predictable results if you're expecting reserved characters in a request. Fortunately most programming languages used on the web have built in functions for this.
Just to add, you don't need to encode the whole URL - it's usually the data you're sending in a GET request which gets encoded. For example:
http://www.foo.com?data=This%20is%20my%20encoded%20string%20%23
Recently came across the term JSONC in a YouTube API. I browsed the Web, but found nothing much about it. Can someone explain whether these two are the same or different?
There is also jsonc aka "JSON with comments", created by Microsoft and used by Visual Studio Code. The logic for it can be found here, alas without exhaustive specification (though I'd like to be proven wrong on this).
On top of that there is this project with an actual specification which is also called jsonc, but also does far more than just adding comments.
While there definitely is a use for these technologies, some critical thinking is advised. JSON containing comments is not JSON.
JSON-C seems to just be a variation of JSON mainly targeted at C development. I.e., from the open source docs, "JSON-C implements a reference counting object model that allows you to easily construct JSON objects in C, output them as JSON formatted strings and parse JSON formatted strings back into the C representation of JSON objects."ref^1
From the YouTube API perspective (specifically, version 2, not the new version 3), The JSON-C response is just a condensed version of the JSON response (removing "duplicate, irrelevant or easily calculated values").ref^2
Why would the JSON response have "duplicate, irrelevant or easily calculated values" values anyway? Because it is converting the original ATOM XML format directly to JSON in a loseless conversion. You can find out more details here.
However, I would suggest using version 3 of the YouTube Data API. It is much easier to use. =)
JSONC is an open-source, Javascript API created by Tomás Corral Casas for reducing the size of the amount of JSON data that is transported between clients and servers. It uses two different approaches to achieve this, JSONC.compress and JSONC.pack. More information can be found on the JSONC Github page:
https://github.com/tcorral/JSONC
I would like to send erlang terms (erlang-based back end) to the web browser. It is easy enough to encode a term on the erlang side using something like:
term_to_binary(Term)
or:
binary_to_list(term_to_binary(Term))
The problem of course is that scrambled garbage shows up on the browser end.
Question: Is there either some encoding I can use on the browser end, or more likely, some Content-Type I can accept on the browser end to unscramble this?
Thanks.
Use io_lib:format("~p",[Term]). It will produce a string representation of the erlang term which can be showed on a web page. Consider also checking out this question and its answer.
There is piqi that provides extensive mapping mechanisms between .piqi (its record definition language), json, xml and protobuf. It's a really cool tool, that we use all the time to map between all of these formats.
Typically when I build something (in Erlang) that needs to provide some sort of data to something else, I start with a piqi definition file that defines the structure. The piqic compiler generates Erlang record definitions and conversion code to do conversions easily.
Highly recommended, but it might be overkill for what you're doing.
Encode it with base64. Get it via ajax, then decode either with native window.atob or any of numerous available libs.
If it is for a web browser, I would go for a Json string, it's unicode and browsers support it nativaly.
Maybe consider Json and do something like this for strings:
1> HelloJerome = "Hello Jérôme".
"Hello Jérôme"
2> HelloJeromeBin = list_to_binary(HelloJerome).
<<"Hello Jérôme">>
3> HelloJeromeJson = << <<"{\"helloJerome\":\"">>/bits, HelloJeromeBin/bits, $\", $} >>.
<<"{\"helloJerome\":\"Hello Jérôme\"}">>
In the browser console:
jerome = JSON.parse('{\"hello\":\"Hello Jérôme\"}')
Now
jerome.hello == "Hello Jérôme"
There are some good lib out there ejson or mochijson2 are the classic ones but ktuo or ...