how to parse an html source code

how to parse an html source code - html

I'm trying to parse a html source code. this is the webpage address I'm trying to parse. I have wrote the code below but it doesn't work at the last step that I wanna pull-out content of meta:
int main(int argc, char *argv[])
{
QApplication a(argc, argv);
QNetworkAccessManager manager;
QNetworkReply *reply = manager.get(QNetworkRequest(QUrl("https://www.instagram.com/p/BTwnRykl6EM/")));
QEventLoop event;
QObject::connect(reply, SIGNAL(finished()), &event, SLOT(quit()));
event.exec();
QString me = reply->readAll();
QString x;
//-------------------------------------------------------------------------------------------------------
//qDebug()<<me;
//-------------------------------------------------------------------------------------------------------
QXmlStreamReader reader(me);
if(reader.readNextStartElement()){
if(reader.name()=="html"){
while (reader.readNextStartElement()) {
if(reader.name()=="head"){
while (reader.readNextStartElement()) {
if(reader.name()=="meta" && reader.attributes().hasAttribute("property") && reader.attributes().value("property").toString()=="og:image")
x = reader.attributes().value("content").toString();
else{
qDebug()<<"why?";
reader.skipCurrentElement();
}
}
}
else
reader.skipCurrentElement();
}
}
else
reader.skipCurrentElement();
}
qDebug()<<x;
return 0;
}
and this part doesn't work:
if(reader.name()=="meta" && reader.attributes().hasAttribute("property") && reader.attributes().value("property").toString()=="og:image")
x = reader.attributes().value("content").toString();
else{
qDebug()<<"why?";
reader.skipCurrentElement();
}
and prints
why?
what is wrong with my code?

HTML is not a valid XML, so you can't use XML parsers. Options for HTML you can find on this wiki page.
Shortly, you can use Qt's Scribe framework or QtWebKit for automatic parsing and rendering HTML, or external libraries for manual parsing:
libxml2 (Win, Mac, Linux)
htmlcxx (Win, Linux)
libhtml (Linux)
libxml2 and libhtml are C libraries, htmlcxx is C++ library, that allows build dom-tree and iterate through it.

Related

How to pass a JSON object inside a QTextEdit

I have a small GUI that I use to load/save json configuration files, the most important parameters are in the gui below:
![conf]
The problem I have been trying to solve is that I am not able to create an object inside a QTextEdit and am not sure why despite I am following official documentation on how to do that.
Below a snippet of code both for the load and save button.
Also for the sake of brevity I only kept how I did the spinbox and, of course, the textedit:
void SettingsForm::on_loadBtn_clicked()
{
// Opening file dialog....
if(listDocksConfig.isEmpty())
{
QMessageBox::information(this, tr("Message"), tr("Please Open Configuration"));
}
else
{
QJsonDocument doc;
QJsonObject obj;
QByteArray data_json;
QFile input(listDocksConfig);
if(input.open(QIODevice::ReadOnly | QIODevice::Text))
{
data_json = input.readAll();
doc = doc.fromJson(data_json);
obj = doc.object();
const double xposValue = obj["X Pos"].toDouble();
QTextEdit textEdit = QTextEdit::setText(obj["comments"]); // <- Error Here
ui->doubleSpinBox_XPos->setValue(xposValue);
ui->textEdit->setText(textEdit); // <- Error Here
}
else
{
// do something
}
}
}
void SettingsForm::on_saveBtn_clicked()
{
// saving configuration with file dialog....
if(listDocksConfig.isEmpty())
{
// do something...
}
else
{
const double xposValue = ui->doubleSpinBox_XPos->value();
QTextEdit textEdit = ui->textEdit->setPlainText(); // <- Error Here
QJsonDocument doc;
QJsonObject obj;
obj["X Pos"] = xposValue;
obj["comments"] = textEdit.toString(); // <- Error Here
doc.setObject(obj);
QByteArray data_json = doc.toJson();
QFile output(listDocksConfig);
}
}
What I have done so far:
I consulted the official documentation on how to solve this problem, but could not figure out why that was not working. I also went ahead and try to use an alternative such as setText but still no luck.
I came across this source which I used as guidance for my example and solved almost all problems but the QTextEdit one.
This additional post was useful but still couldn't solve the problem.
Thanks for pointing to the right direction for solving this problem.

this line is wrong!!
QTextEdit textEdit = ui->textEdit->setPlainText();
setPlainText() needs const QString &text as parameter
you cant do that, read the official doc here
the method is void, ie. it returns nothing so you can not use void to init a QTextEdit object
update:
you already have a textEdit in the layout, so no reason to redefine one...
you can do:
ui->textEdit->setPlainText(obj["comments"].toString());

Sphinx Integration in Qt

I would like to integrate Sphinx documentation functionality to help with my Qt project. However, when including the HTML files for Sphinx, the formatting appears differently and no file links work. For example:
QFile file("/home/user1/project/Sphinx/build/html/intro.html");
if (!file.open(QIODevice::Readonly))
qDebug() << "Didn't open file";
QTextStream in(&file);
ui->textBrowser->setText(in.readAll());
Error: QTextBrowser: No document for _sources/intro.txt
This will cause the textBrowser to open the correct file, but will not end up displaying the page with the correct HTML coding, and will not follow the links even though those HTML files are contained in the same path (as I have copied the entire Sphinx project into the Qt project).
Is there some way to package the entire Sphinx project so that inclusion of multiple files is unnecessary or is the multiple file inclusion the way to go and I'm just handling it incorrectly?

Instead of reading all text and setting it with setText() you must use the setSource() method and pass it to the QUrl using the QUr::fromLocalFile() method.
main.cpp
#include <QtWidgets>
class Widget: public QWidget
{
Q_OBJECT
public:
Widget(QWidget *parent=nullptr):
QWidget(parent),
m_text_browser(new QTextBrowser)
{
m_lineedit = new QLineEdit;
auto button = new QPushButton("Load");
auto lay = new QVBoxLayout{this};
auto hlay = new QHBoxLayout;
lay->addLayout(hlay);
hlay->addWidget(m_lineedit);
hlay->addWidget(button);
lay->addWidget(m_text_browser);
connect(button, &QPushButton::clicked, this, &Widget::on_clicked);
}
private slots:
void on_clicked(){
QString fileName = QFileDialog::getOpenFileName(this,
tr("Open Image"),
QDir::homePath(),
tr("HTML Files (*.html)"));
m_lineedit->setText(fileName);
m_text_browser->setSource(QUrl::fromLocalFile(fileName));
}
private:
QTextBrowser *m_text_browser;
QLineEdit *m_lineedit;
};
int main(int argc, char *argv[])
{
QApplication a(argc, argv);
Widget w;
w.showMaximized();
return a.exec();
}
#include "main.moc"

Writing a full website to socket with microncontroller

I'm using a web server to control devices in the house with a microcontroller running .netMF (netduino plus 2). The code below writes a simple html page to a device that connects to the microcontroller over the internet.
while (true)
{
Socket clientSocket = listenerSocket.Accept();
bool dataReady = clientSocket.Poll(5000000, SelectMode.SelectRead);
if (dataReady && clientSocket.Available > 0)
{
byte[] buffer = new byte[clientSocket.Available];
int bytesRead = clientSocket.Receive(buffer);
string request =
new string(System.Text.Encoding.UTF8.GetChars(buffer));
if (request.IndexOf("ON") >= 0)
{
outD7.Write(true);
}
else if (request.IndexOf("OFF") >= 0)
{
outD7.Write(false);
}
string statusText = "Light is " + (outD7.Read() ? "ON" : "OFF") + ".";
string response = WebPage.startHTML(statusText, ip);
clientSocket.Send(System.Text.Encoding.UTF8.GetBytes(response));
}
clientSocket.Close();
}
public static string startHTML(string ledStatus, string ip)
{
string code = "<html><head><title>Netduino Home Automation</title></head><body> <div class=\"status\"><p>" + ledStatus + " </p></div> <div class=\"switch\"><p>On</p><p>Off</p></div></body></html>";
return code;
}
This works great, so I wrote a full jquery mobile website to use instead of the simple html. This website is stored on the SD card of the device and using the code below, should write the full website in place of the simple html above.
However, my problem is the netduino only writes the single HTML page to the browser, with none of the JS/CSS style files that are referenced in the HTML. How can I make sure the browser reads all of these files, as a full website?
The code I wrote to read the website from the SD is:
private static string getWebsite()
{
try
{
using (StreamReader reader = new StreamReader(#"\SD\index.html"))
{
text = reader.ReadToEnd();
}
}
catch (Exception e)
{
throw new Exception("Failed to read " + e.Message);
}
return text;
}
I replaced string code = " etc bit with
string code = getWebsite();

How can I make sure the browser reads all of these files, as a full website?
Isn't it already? Use an HTTP debugging tool like Fiddler. As I read from your code, your listenerSocket is supposed to listen on port 80. Your browser will first retrieve the results of the getWebsite call and parse the HTML.
Then it'll fire more requests, as it finds CSS and JS references in your HTML (none shown). These requests will, as far as we can see from your code, again receive the results of the getWebsite call.
You'll need to parse the incoming HTTP request to see what resource is being requested. It'll become a lot easier if the .NET implementation you run supports the HttpListener class (and it seems to).

Render HTML on headless server to produce screenshots

I would like to create screenshots of web pages from a given URL. While it's possible to use tools like Selenium RC, that requires a graphical environment. I am running a headless Gentoo server.
This will be part of a tool chain that works like:
Fetch URL
Render HTML
Export render as image file
Store image file

You can run an application with framebuffer X-Server like xvfb - one simple approach is a Qt based app to render the page in a webkit widget and save as an image. Here's a blog post outlining how this can be done with Python.
Here's a quick command line tool I've used with Qt. It's a while since I used it but it should still work!
#include <QtCore/QCoreApplication>
#include <QtGui>
#include <QtWebKit>
#include <QTextStream>
#include <QSize>
QWebView *view;
QString outfile;
void QWebView::loadFinished(bool ok)
{
QTextStream out(stdout);
if (!ok) {
out << "Page loading failed\n";
return;
}
view->page()->setViewportSize(view->page()->currentFrame()->contentsSize());
QImage *img = new QImage(view->page()->viewportSize(), QImage::Format_ARGB32);
QPainter *paint = new QPainter(img);
view->page()->currentFrame()->render(paint);
paint->end();
if(!img->save(outfile, "png"))
out << "Save failure\n";
QApplication::quit();
return;
}
int main(int argc, char *argv[])
{
QTextStream out(stdout);
if(argc < 3) {
out << "USAGE: " << argv[0] << " <url> <outfile>\n";
return -1;
}
outfile = argv[2];
QApplication app(argc, argv);
view = new QWebView();
view->load(QUrl(argv[1]));
return app.exec();
}

Calling wkhtmltopdf to generate PDF from HTML

I'm attempting to create a PDF file from an HTML file. After looking around a little I've found: wkhtmltopdf to be perfect. I need to call this .exe from the ASP.NET server. I've attempted:
Process p = new Process();
p.StartInfo.UseShellExecute = false;
p.StartInfo.FileName = HttpContext.Current.Server.MapPath("wkhtmltopdf.exe");
p.StartInfo.Arguments = "TestPDF.htm TestPDF.pdf";
p.Start();
p.WaitForExit();
With no success of any files being created on the server. Can anyone give me a pointer in the right direction? I put the wkhtmltopdf.exe file at the top level directory of the site. Is there anywhere else it should be held?
Edit: If anyone has better solutions to dynamically create pdf files from html, please let me know.

Update:
My answer below, creates the pdf file on the disk. I then streamed that file to the users browser as a download. Consider using something like Hath's answer below to get wkhtml2pdf to output to a stream instead and then send that directly to the user - that will bypass lots of issues with file permissions etc.
My original answer:
Make sure you've specified an output path for the PDF that is writeable by the ASP.NET process of IIS running on your server (usually NETWORK_SERVICE I think).
Mine looks like this (and it works):
/// <summary>
/// Convert Html page at a given URL to a PDF file using open-source tool wkhtml2pdf
/// </summary>
/// <param name="Url"></param>
/// <param name="outputFilename"></param>
/// <returns></returns>
public static bool HtmlToPdf(string Url, string outputFilename)
{
// assemble destination PDF file name
string filename = ConfigurationManager.AppSettings["ExportFilePath"] + "\\" + outputFilename + ".pdf";
// get proj no for header
Project project = new Project(int.Parse(outputFilename));
var p = new System.Diagnostics.Process();
p.StartInfo.FileName = ConfigurationManager.AppSettings["HtmlToPdfExePath"];
string switches = "--print-media-type ";
switches += "--margin-top 4mm --margin-bottom 4mm --margin-right 0mm --margin-left 0mm ";
switches += "--page-size A4 ";
switches += "--no-background ";
switches += "--redirect-delay 100";
p.StartInfo.Arguments = switches + " " + Url + " " + filename;
p.StartInfo.UseShellExecute = false; // needs to be false in order to redirect output
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.RedirectStandardError = true;
p.StartInfo.RedirectStandardInput = true; // redirect all 3, as it should be all 3 or none
p.StartInfo.WorkingDirectory = StripFilenameFromFullPath(p.StartInfo.FileName);
p.Start();
// read the output here...
string output = p.StandardOutput.ReadToEnd();
// ...then wait n milliseconds for exit (as after exit, it can't read the output)
p.WaitForExit(60000);
// read the exit code, close process
int returnCode = p.ExitCode;
p.Close();
// if 0 or 2, it worked (not sure about other values, I want a better way to confirm this)
return (returnCode == 0 || returnCode == 2);
}

I had the same problem when i tried using msmq with a windows service but it was very slow for some reason. (the process part).
This is what finally worked:
private void DoDownload()
{
var url = Request.Url.GetLeftPart(UriPartial.Authority) + "/CPCDownload.aspx?IsPDF=False?UserID=" + this.CurrentUser.UserID.ToString();
var file = WKHtmlToPdf(url);
if (file != null)
{
Response.ContentType = "Application/pdf";
Response.BinaryWrite(file);
Response.End();
}
}
public byte[] WKHtmlToPdf(string url)
{
var fileName = " - ";
var wkhtmlDir = "C:\\Program Files\\wkhtmltopdf\\";
var wkhtml = "C:\\Program Files\\wkhtmltopdf\\wkhtmltopdf.exe";
var p = new Process();
p.StartInfo.CreateNoWindow = true;
p.StartInfo.RedirectStandardOutput = true;
p.StartInfo.RedirectStandardError = true;
p.StartInfo.RedirectStandardInput = true;
p.StartInfo.UseShellExecute = false;
p.StartInfo.FileName = wkhtml;
p.StartInfo.WorkingDirectory = wkhtmlDir;
string switches = "";
switches += "--print-media-type ";
switches += "--margin-top 10mm --margin-bottom 10mm --margin-right 10mm --margin-left 10mm ";
switches += "--page-size Letter ";
p.StartInfo.Arguments = switches + " " + url + " " + fileName;
p.Start();
//read output
byte[] buffer = new byte[32768];
byte[] file;
using(var ms = new MemoryStream())
{
while(true)
{
int read = p.StandardOutput.BaseStream.Read(buffer, 0,buffer.Length);
if(read <=0)
{
break;
}
ms.Write(buffer, 0, read);
}
file = ms.ToArray();
}
// wait or exit
p.WaitForExit(60000);
// read the exit code, close process
int returnCode = p.ExitCode;
p.Close();
return returnCode == 0 ? file : null;
}
Thanks Graham Ambrose and everyone else.

OK, so this is an old question, but an excellent one. And since I did not find a good answer, I made my own :) Also, I've posted this super simple project to GitHub.
Here is some sample code:
var pdfData = HtmlToXConverter.ConvertToPdf("<h1>SOO COOL!</h1>");
Here are some key points:
No P/Invoke
No creating of a new process
No file-system (all in RAM)
Native .NET DLL with intellisense, etc.
Ability to generate a PDF or PNG (HtmlToXConverter.ConvertToPng)

Check out the C# wrapper library (using P/Invoke) for the wkhtmltopdf library: https://github.com/pruiz/WkHtmlToXSharp

There are many reason why this is generally a bad idea. How are you going to control the executables that get spawned off but end up living on in memory if there is a crash? What about denial-of-service attacks, or if something malicious gets into TestPDF.htm?
My understanding is that the ASP.NET user account will not have the rights to logon locally. It also needs to have the correct file permissions to access the executable and to write to the file system. You need to edit the local security policy and let the ASP.NET user account (maybe ASPNET) logon locally (it may be in the deny list by default). Then you need to edit the permissions on the NTFS filesystem for the other files. If you are in a shared hosting environment it may be impossible to apply the configuration you need.
The best way to use an external executable like this is to queue jobs from the ASP.NET code and have some sort of service monitor the queue. If you do this you will protect yourself from all sorts of bad things happening. The maintenance issues with changing the user account are not worth the effort in my opinion, and whilst setting up a service or scheduled job is a pain, its just a better design. The ASP.NET page should poll a result queue for the output and you can present the user with a wait page. This is acceptable in most cases.

You can tell wkhtmltopdf to send it's output to sout by specifying "-" as the output file.
You can then read the output from the process into the response stream and avoid the permissions issues with writing to the file system.

My take on this with 2018 stuff.
I am using async. I am streaming to and from wkhtmltopdf. I created a new StreamWriter because wkhtmltopdf is expecting utf-8 by default but it is set to something else when the process starts.
I didn't include a lot of arguments since those varies from user to user. You can add what you need using additionalArgs.
I removed p.WaitForExit(...) since I wasn't handling if it fails and it would hang anyway on await tStandardOutput. If timeout is needed, then you would have to call Wait(...) on the different tasks with a cancellationtoken or timeout and handle accordingly.
public async Task<byte[]> GeneratePdf(string html, string additionalArgs)
{
ProcessStartInfo psi = new ProcessStartInfo
{
FileName = #"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe",
UseShellExecute = false,
CreateNoWindow = true,
RedirectStandardInput = true,
RedirectStandardOutput = true,
RedirectStandardError = true,
Arguments = "-q -n " + additionalArgs + " - -";
};
using (var p = Process.Start(psi))
using (var pdfSream = new MemoryStream())
using (var utf8Writer = new StreamWriter(p.StandardInput.BaseStream,
Encoding.UTF8))
{
await utf8Writer.WriteAsync(html);
utf8Writer.Close();
var tStdOut = p.StandardOutput.BaseStream.CopyToAsync(pdfSream);
var tStdError = p.StandardError.ReadToEndAsync();
await tStandardOutput;
string errors = await tStandardError;
if (!string.IsNullOrEmpty(errors)) { /* deal/log with errors */ }
return pdfSream.ToArray();
}
}
Things I haven't included in there but could be useful if you have images, css or other stuff that wkhtmltopdf will have to load when rendering the html page:
you can pass the authentication cookie using --cookie
in the header of the html page, you can set the base tag with href pointing to the server and wkhtmltopdf will use that if need be

Thanks for the question / answer / all the comments above. I came upon this when I was writing my own C# wrapper for WKHTMLtoPDF and it answered a couple of the problems I had. I ended up writing about this in a blog post - which also contains my wrapper (you'll no doubt see the "inspiration" from the entries above seeping into my code...)
Making PDFs from HTML in C# using WKHTMLtoPDF
Thanks again guys!

The ASP .Net process probably doesn't have write access to the directory.
Try telling it to write to %TEMP%, and see if it works.
Also, make your ASP .Net page echo the process's stdout and stderr, and check for error messages.

Generally return code =0 is coming if the pdf file is created properly and correctly.If it's not created then the value is in -ve range.

using System;
using System.Diagnostics;
using System.Web;
public partial class pdftest : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
}
private void fn_test()
{
try
{
string url = HttpContext.Current.Request.Url.AbsoluteUri;
Response.Write(url);
ProcessStartInfo startInfo = new ProcessStartInfo();
startInfo.FileName =
#"C:\PROGRA~1\WKHTML~1\wkhtmltopdf.exe";//"wkhtmltopdf.exe";
startInfo.Arguments = url + #" C:\test"
+ Guid.NewGuid().ToString() + ".pdf";
Process.Start(startInfo);
}
catch (Exception ex)
{
string xx = ex.Message.ToString();
Response.Write("<br>" + xx);
}
}
protected void btn_test_Click(object sender, EventArgs e)
{
fn_test();
}
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

how to parse an html source code - html

Related

How to pass a JSON object inside a QTextEdit

Sphinx Integration in Qt

Writing a full website to socket with microncontroller

Render HTML on headless server to produce screenshots

Calling wkhtmltopdf to generate PDF from HTML

Categories

Resources