Best and efficient way to parse both XML and HTML in C - html

folks!
I'm looking for the best and efficient way to parse server responds that content both HTML and XML stuff. The respond come from servers I need to poll each 5 minutes (it's about half a thousand of them in list currently, but it will double very soon). Respond stored in buffer as plane text (got from socket). So, I need to parse HTML part and in case of success (mandatory things found) I should then try to parse XML part and get statistics information to store in DB. The responses are like this:
HTTP/1.0 200 OK
Connection: close
Content-Length: 682
Content-Type: text/xml; charset=utf-8
Date: Sun, 09 Mar 2014 15:44:52 GMT
Last-Modified: Sun, 09 Mar 2014 15:44:52 GMT
Server: DrWebAV-DeskServer/REL-610-AV-6.02.0.201311040 Linux/x86_64 Lua/5.1.4 OpenSSL/1.0.0e
<?xml version="1.0" encoding="utf-8"?><avdesk-xml-api API='2.1.0' API_BUILD='20130709' branch='REL-610-AV' oper='get-server-info' rc='true' timestamp='20140309154452987' version='6.02.0.201311040'><server><id>00c1d140-d21d-b211-a828-b62919c4250d</id><platform>Linux 2.6.39-gentoo-r3 x86_64 (4 SMP Mon Oct 24 11:04:40 YEKT 2011)</platform><version>6.02.0.201311040</version><statistics from='20140301000000000' till='20140309235959999'><noviruses/><stations total='101'><online>5</online><deinstalled>21</deinstalled><blocked>0</blocked><expired>81</expired><offline>96</offline><activated>74</activated><unactivated>27</unactivated></stations></statistics></server></avdesk-xml-api>
And could be smth. like this
HTTP/1.0 401 Authorization Required
Cache-Control: post-check=0, pre-check=0
Connection: close
Content-Length: 421
Content-Type: text/html; charset=utf-8
Date: Sun, 09 Mar 2014 15:44:22 GMT
Expires: Date: Sat, 27 Nov 2004 10:18:15 GMT
Last-Modified: Date: Sat, 27 Nov 2004 10:18:15 GMT
Pragma: no-cahe
Server: DrWebAV-DeskServer/REL-610-AV-6.02.0.201311040 Linux/x86_64 Lua/5.1.4 OpenSSL/1.0.1
WWW-Authenticate: Basic realm="Dr.Web XML API area"
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML><TITLE>Unauthorized</TITLE><BODY><STRONG>Unauthorized</STRONG><P>The error "401 Unauthorized" occured while processing request you had sent.<P><BR><BR><I>Access denied or your browser does not support HTTP authentication!</I><BR><P><BR><BR><HR><P>Dr.Web ® AV-Desk Server REL-610-AV 6.02.0.201311040 Linux/x86_64 Lua/5.1.4 OpenSSL/1.0.1</BODY></HTML>
Concerning HTML part I'm basically interested in HTTP/1.0 STRING and Server: STRING stuff, and then need per-tag XML parsing, if authorization succeeded.
I have found, that libxml2 is suitable for parsing both HTML/XML stuff, but unable to find any real examples how to use it, just some major interface description. So, help needed.

Code examples for libxml2 are here
The mailing list is friendly, and the code is mature and good quality.
However, nothing in your example suggests you need to parse HTML. You need to parse (I think) HTTP to process the headers (and detect the 401 error from the HTTP response), then parse the XML content. Parsing HTTP headers to the level you require it is trivial (just strtok the response separating on line breaks and the first line has the answer you need). The body of the response starts after a double line break (I think your second example has a paste error). This reduces your task to simply processing HTTP headers and XML (no HTML parsing).

Related

How to force Servant return JSON errors instead of plain strings?

By default, Servant returns plain string requests even if the requested endpoint returns JSON
$ http $S/signup email=mail#domain.com
HTTP/1.1 400 Bad Request
Connection: keep-alive
Date: Tue, 14 Apr 2020 15:59:32 GMT
Server: nginx/1.17.9 (Ubuntu)
Transfer-Encoding: chunked
Error in $: parsing Credentials.Credentials(Credentials) failed, key "password" not found
I am trying to wrap such strings into simple JSON dictionaries:
$ http $S/signup email=mail#domain.com
HTTP/1.1 400 Bad Request
Connection: keep-alive
Date: Tue, 14 Apr 2020 15:59:32 GMT
Server: nginx/1.17.9 (Ubuntu)
Transfer-Encoding: chunked
{"error": "Error in $: parsing Credentials.Credentials(Credentials) failed, key \"password\" not found"}
But it looks like it's not that easy.
This question states possible solutions but I can't make them work today Custom JSON errors for Servant-server
Another approach is discussed in this thread https://github.com/haskell-servant/servant/issues/732 but it looks like overkill to such a simple task.
I wonder if there is a simple and robust solution in 2020?
There is a library called servant-errors. It provides a middleware that does exactly what you are looking for – transforms error responses to have a uniform structure of your choice, JSON being one of the built-in options.
See the documentation for details, but the basic usage is as straightforward as wrapping
errorMw #JSON #["error", "status"]
around your appilcation.

"Last-modified" date of a document through a web-browser

if you inspect a document (e.g. a pdf) on a web-browser, you can obtain a "last-modified date" of the document itself:
<embed id="plugin" type="application/x-google-chrome-pdf" src="http://mywebsite.org/mydocument.pdf" headers="Connection: Keep-Alive Content-Length: 144303 Content-Type: application/pdf Date: Thu, 22 Nov 2018 09:09:44 GMT; Keep-Alive: timeout=6, max=70 Last-Modified: Fri, 9 Nov 2018 09:43:03 GMT Server: Apache X-Content-Type-Option: nosniff " background-color="0xFF525659" top-toolbar-height="56" javascript="allow" full-frame="">
My question is: this "Last-Modified" date is referred to the time of the last change of the document as it stands before loading into the website, or to the time of load of the document into the website ?
Thank you,
best
From Mozilla:
The Last-Modified response HTTP header contains the date and time at
which the origin server believes the resource was last modified.

Why are my plain text + HTML emails being displayed as plain text in Gmail? [duplicate]

My Rails 3 application sends out emails in both plain text and HTML formats. I have tested it locally using RoundCube and Squirrel Mail clients and they both display HTML version with images, links, etc. GMail on the other hand chooses plain text format. Any idea what's causing this?
Delivered-To: test#gmail.com
Received: by 10.42.166.2 with SMTP id m2cs16081icy;
Thu, 3 Mar 2011 17:01:48 -0800 (PST)
Received: by 10.229.211.138 with SMTP id go10mr1544841qcb.195.1299200507499;
Thu, 03 Mar 2011 17:01:47 -0800 (PST)
Return-Path: <info#example.com>
Received: from beta.example.com (testtest.test.com [69.123.123.123])
by mx.google.com with ESMTP id j14si1690118qcu.136.2011.03.03.17.01.46;
Thu, 03 Mar 2011 17:01:46 -0800 (PST)
Received-SPF: neutral (google.com: 69.123.123.123 is neither permitted nor denied by best guess record for domain of info#example.com) client-ip=69.123.123.123;
Authentication-Results: mx.google.com; spf=neutral (google.com: 69.123.123.123 is neither permitted nor denied by best guess record for domain of info#example.com) smtp.mail=info#example.com
Received: from localhost.localdomain (localhost [127.0.0.1])
by beta.example.com (Postfix) with ESMTP id F3C273A3EC
for <test#gmail.com>; Fri, 4 Mar 2011 01:01:45 +0000 (UTC)
Date: Fri, 04 Mar 2011 01:01:45 +0000
From: info#example.com
To: test#gmail.com
Message-ID: <4d7039f9e9d3e_3449482ab7831658#test.mail>
Subject: Your example account was activated.
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="--==_mimepart_4d7039f9e6967_3449482ab7831370";
charset=UTF-8
Content-Transfer-Encoding: 7bit
----==_mimepart_4d7039f9e6967_3449482ab7831370
Date: Fri, 04 Mar 2011 01:01:45 +0000
Mime-Version: 1.0
Content-Type: text/html;
charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-ID: <4d7039f9e95ed_3449482ab7831519#test.mail>
<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type" />
</head>
<body>
<p><img border="0" src="http://example.com/images/logo.png" alt="example logo" /></p>
<p>Congratulations, Test!</p>
<p>
Your <a style="text-decoration:none;color:#ef4923;" href="http://example.com/">example</a> account was activated.
</p>
</body>
</html>
----==_mimepart_4d7039f9e6967_3449482ab7831370
Date: Fri, 04 Mar 2011 01:01:45 +0000
Mime-Version: 1.0
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-ID: <4d7039f9e8b0e_3449482ab78314b7#test.mail>
Congratulations, Test!
Your example.com account was activated.
----==_mimepart_4d7039f9e6967_3449482ab7831370--
Try switching the order of the parts of the message, putting the HTML part after the plain-text part. It might work :).
NOTE: I cannot remember now where I read this (or if I for sure even
did), but the reason switching might help is because I think the
preferred part of the message may be the last part.
Update: I found a place where it says that parts in a multipart MIME message should be in order of increasing preference -- here, in section 7.2.3 (edit: latest version here; thanks #ALEXintlsos!), starting with the third to last paragraph.
Update: Here is a quote of section 7.2.3, (see https://stackoverflow.com/help/referencing):
7.2.3 The Multipart/alternative subtype
The multipart/alternative type is syntactically identical to multipart/mixed,
but the semantics are different. In particular, each of the parts is an
"alternative" version of the same information. User agents should recognize
that the content of the various parts are interchangeable. The user agent
should either choose the "best" type based on the user's environment and
preferences, or offer the user the available alternatives. In general, choosing
the best type means displaying only the LAST part that can be displayed. This
may be used, for example, to send mail in a fancy text format in such a way
that it can easily be displayed anywhere:
From: Nathaniel Borenstein <nsb#bellcore.com>
To: Ned Freed <ned#innosoft.com>
Subject: Formatted text mail
MIME-Version: 1.0
Content-Type: multipart/alternative; boundary=boundary42
--boundary42
Content-Type: text/plain; charset=us-ascii
...plain text version of message goes here....
--boundary42
Content-Type: text/richtext
.... richtext version of same message goes here ...
--boundary42
Content-Type: text/x-whatever
.... fanciest formatted version of same message goes here
...
--boundary42--
In this example, users whose mail system understood the "text/x-whatever"
format would see only the fancy version, while other users would see only the
richtext or plain text version, depending on the capabilities of their system.
In general, user agents that compose multipart/alternative entities should
place the body parts in increasing order of preference, that is, with the
preferred format last. For fancy text, the sending user agent should put the
plainest format first and the richest format last. Receiving user agents should
pick and display the last format they are capable of displaying. In the case
where one of the alternatives is itself of type "multipart" and contains
unrecognized sub-parts, the user agent may choose either to show that
alternative, an earlier alternative, or both.
NOTE: From an implementor's perspective, it might seem more sensible to reverse
this ordering, and have the plainest alternative last. However, placing the
plainest alternative first is the friendliest possible option when
multipart/alternative entities are viewed using a non-MIME- compliant mail
reader. While this approach does impose some burden on compliant mail readers,
interoperability with older mail readers was deemed to be more important in
this case.
It may be the case that some user agents, if they can recognize more than one
of the formats, will prefer to offer the user the choice of which format to
view. This makes sense, for example, if mail includes both a nicely-formatted
image version and an easily-edited text version. What is most critical, however,
is that the user not automatically be shown multiple versions of the same data.
Either the user should be shown the last recognized version or should
explicitly be given the choice.

Service Stack Json Response Contains Extra Characters

I'm converting a Web Api project to service stack and in json responses I'm getting an extra line of text before and after the json content. I'm using fiddler to capture the response.
Edited for brevity, here is an example:
18d
[{"id": ... }]
0
What are these lines? I can't find any configuration option that would seem to correspond to keep this from happening.
Edit
I went back and started with the basic hello service stack example, and here's what I got for a response:
HTTP/1.1 200 OK
Transfer-Encoding: chunked
Content-Type: application/json; charset=utf-8
Server: Microsoft-HTTPAPI/1.0
X-Powered-By: ServiceStack/3.943 Win32NT/.NET
Date: Thu, 18 Apr 2013 15:48:49 GMT
1b
{"Result":"Hello, JRandom"}
0
I'm assuming the extra response lines are the result of the Transfer-Encoding: chunked header.

PDF links getting stuck while loading in Chrome PDF Viewer

On a page of a website we're building http://ovsd.nutrislice.com/wellness/ , pdf download links ("Download the Issue") get stuck while loading in Chrome's PDF Viewer but work in all other browsers by triggering a download. Right click + "Save as" works in Chrome. I realize Chrome is the only browser with a built-in, default pdf viewer.
I figure we can instruct people to right click and then "save as", but I wanted to see if anyone can see a problem with either the html, or in the server response, which would cause chrome to fail like that.
Its not a traditional pass-thru file download sitting on a server somewhere. We use Heroku, and I'm currently storing the pdf's in the DB (I realize the downsides of this, but it was a simpler system than managing off-site files on S3 for now). I'm generating the response dynamically via a Django View, so I wonder if there's something i'm missing in the response headers or something.
Thanks!
Looks like a bad content-type:
Content-Type:('application/pdf', None)
Check your code where you are assigning a content-type to the response. Looks like you're sending a tuple instead of just application/pdf.
Like #dgel mentioned, your content type is incorrect:
$ curl -I http://ovsd.nutrislice.com/dbfiles/cms/resources/Vol5_Issue1_5_Dos_and_Donts_for_Supermarket_Survival.pdf
HTTP/1.1 200 OK
Access-Control-Allow-Methods: POST,GET,OPTIONS,PUT,DELETE
Access-Control-Allow-Origin: *
Cache-Control: max-age=90000
Content-Type: ('application/pdf', None) # <- Incorrect
Date: Fri, 09 Nov 2012 19:25:06 GMT
Expires: Fri, 09 Nov 2012 23:20:28 GMT
Last-Modified: Thu, 08 Nov 2012 22:20:28 GMT
Server: gunicorn/0.14.6
Connection: keep-alive
Also it might be a good idea to add Content-Length header.