How to read a specific HTML part in a C++ program? - html

I am a very beginner. I have a C++ program in CLI, with a HTML/XML input file thanks to a ifstream file("C:/.......").
The main problem is I want to take some text in this file, and put into variable. But the difficulties I meet is that I only want to take some part of the file, for instance in a var1. I want text which is between the HTML tag <name> or the one I choose.
I already tried to put some getline with condition or move cursors but I only have all the text or nothing.
//here some part of the code that i'm sure of
string info(""), line(""), system("");
ifstream file("C:/Users/[...]/file.xml");
if (file.is_open())
{
while (getline(file, line))
{
cout << line << endl;
}
file.close();
}
else
cout << "file is not open" << endl;
Then I call the var with the the text, sorry for English mistakes or code mistakes, and thanks in advance if you could give me some clues.

Related

How to load .txt file into separate .html files? c++

I need to read an HTML file and then separate specific parts of it into individuals HTML files.
For example:
<html lang="en">
<head></head>
<body>
<ul>something 123</ul>
<p>something else 123</p>
<p>blabla</p>
<table>example</table>
</body>
</html>
Everything between <ul> and </ul> should be saved in another HTML file, same with everything between <p> and </p>.
I need to use <fstream> library, and I do not know how to use vectors, so I need to do this probably without them unless there's a simple solution.
The main problem, for now, is, how to read a file until a string is found?
I mean, for example - string table = "<table>" is found and then the program is saving everything after <table> until it finds string end_table = "</table>".
Thanks for your help.
You can use find to locate the beginning and ending body tag with the following:
#include <iostream>
#include <string>
using namespace std;
int main(int argc, char* argv[]) {
string line = "some line with <body> in it";
string bodytag = "<body>";
if(line.find(bodytag) != string::npos) {
cout << "found" << endl;
}
return 0;
}
Then just read lines in from the file until you find the <body> tag and output them until you find the </body> tag. You might need to modify this if content that needs to be saved appears after the opening body tag or before the closing body tag on the same line. Your input doesn't contain this, so this isn't likely a problem.

How to parse an HTML file with QT?

The goal is to achieve a QDomDocument or something similar with the content of an HTML (not XML) document.
The problem is that some tags, especially script trigger errors:
<!DOCTYPE html>
<html>
<head>
<script type="text/javascript">
var a = [1,2,3];
var b = (2<a.length);
</script>
</head>
<body/>
</html>
Not well formed: Element type "a.length" must be followed by either attribute specifications, ">" or "/>".
I understand that HTML is not the same as XML, but it seems reasonable that Qt has a solution for this:
Setting the parser to accept HTML
Another class for HTML
A way to set some tags name as CDATA.
My current try only achieves normal XML parsing:
QString mainHtml;
{
QFile file("main.html");
if (!file.open(QIODevice::ReadOnly)) qDebug() << "Error reading file main.html";
QTextStream stream(&file);
mainHtml = stream.readAll();
file.close();
}
QQDomDocument doc;
QString errStr;
int errLine=0, errCol=0;
doc.setContent( mainHtml, false, &errStr, &errLine, &errCol);
if (!errStr.isEmpty())
{
qDebug() << errStr << "L:" << errLine << ":" << errCol;
}
std::function<void(const QDomElement&, int)> printTags=
[&printTags](const QDomElement& elem, int tab)
{
QString space(3*tab, ' ');
QDomNode n = elem.firstChild();
for( ;!n.isNull(); n=n.nextSibling())
{
QDomElement e = n.toElement();
if(e.isNull()) continue;
qDebug() << space + e.tagName();
printTags( e, tab+1);
}
};
printTags(doc.documentElement(), 0);
Note: I would like to avoid including the full webkit for this.
I recommend to use htmlcxx. It is licensed under LPGL. It works on Linux and Windows. If you use windows compile with msys.
To compile it just extract the files and run
./configure --prefix=/usr/local/htmlcxx
make
make install
In your .pro file add the include and library directory.
INCLUDEPATH += /usr/local/htmlcxx/include
LIBS += -L/usr/local/htmlcxx/lib -lhtmlcxx
Usage example
#include <iostream>
#include "htmlcxx/html/ParserDom.h"
#include <stdlib.h>
int main (int argc, char *argv[])
{
using namespace std;
using namespace htmlcxx;
//Parse some html code
string html = "<html><body>heymyhome</body></html>";
HTML::ParserDom parser;
tree<HTML::Node> dom = parser.parseTree(html);
//Print whole DOM tree
cout << dom << endl;
//Dump all links in the tree
tree<HTML::Node>::iterator it = dom.begin();
tree<HTML::Node>::iterator end = dom.end();
for (; it != end; ++it)
{
if (strcasecmp(it->tagName().c_str(), "A") == 0)
{
it->parseAttributes();
cout << it->attribute("href").second << endl;
}
}
//Dump all text of the document
it = dom.begin();
end = dom.end();
for (; it != end; ++it)
{
if ((!it->isTag()) && (!it->isComment()))
{
cout << it->text() << " ";
}
}
cout << endl;
return 0;
}
Credits for the example:
https://github.com/bbxyard/sdk/blob/master/examples/htmlcxx/htmlcxx-demo.cpp
You can't use an XML parser for HTML. You either use htmlcxx or convert the HTML to valid XML. Then you are free to use QDomDocument, Qt XML parsers, etc.
QWebEngine has also parsing functionality, but brings a large overhead with the application.

How to remove duplicate WORDS in Notepad++?

I have a big text file that looks like:
Mitchel-2
Anna-2
Witold-4
Serena-3
Serena-9
Witros-3
I need so the first word before "-" never duplicates. Any way to remove all except the first one. So if I have like 3000 lines starting with "Serena" but there's always a different number after "-", is there a way to remove 2999 lines of Serena and leave just the first one?
Also Serena is just an example, I have over 200 other words that duplicate.
I don't think you can do it with notepad++. You could use a regex for every name, but since you have over 200, that would be unpractical.
But you can write a program that do it for you. Basically you go through 2 steps:
1) You search for every unique name and save it on a set (doesn't allow for duplicate entries).
2) For every unique name on the set, you search for the duplicates on the file.
I've wrote a simple c++ program that finds the duplicates on a string variable. You can adapt it to a langue of your preference. I compiled it with Microsoft Visual Studio Community 2015 (it doesn't work in cpp.sh)
#include "stdafx.h"
#include <regex>
#include <string>
#include <iostream>
#include <set>
using namespace std;
int main()
{
typedef match_results<const char*> cmatch;
set<string> names;
string notepad_text = "Serena-1\nSerena-2\nSerena-3\nSerena-4\nAna-1\nSerena-7\nWilson-1\nAna-2\nJohn-1\nAna-3\nJohn-2\nWilson-2";
regex regex_find_names("^\\w+"); //double slashes are needed because this is in a string
// 1) Let's find every name
//sregex_iterator it_beg(notepad_text.begin(), notepad_text.end(), regex_find_names);
sregex_iterator find_names_itit(notepad_text.begin(), notepad_text.end(), regex_find_names);
sregex_iterator it_end; //defaults to the end condition
while (find_names_itit != it_end) {
names.insert(find_names_itit->str()); //automatically deletes duplicates
++find_names_itit;
}
// 2) For demonstration purposes, let's print what we've found
cout << "---printing the names we've found:\n\n";
set<string>::const_iterator names_it; // declare an iterator
names_it = names.begin(); // assign it to the start of the set
while (names_it != names.end()) // while it hasn't reach the end
{
cout << *names_it << " ";
++names_it;
}
// 3) Let's find the duplicates
cout << "\n\n---printing the regex matches:\n";
string current_name;
set<string>::const_iterator current_name_it; //this iterates over every name we've found
current_name_it = names.begin();
while (current_name_it != names.end())
{
// we're building something like "^Serena.*"
current_name = "^";
current_name += *current_name_it;
current_name += ".*";
cout << "\n-Lets find duplicates of: " << *current_name_it << endl;
++current_name_it;
// let's iterate through the matches
regex regex_obj(current_name); //double slashes are needed because this is in a string
sregex_iterator it_beg(notepad_text.begin(), notepad_text.end(), regex_obj);
sregex_iterator it(notepad_text.begin(), notepad_text.end(), regex_obj); //this iterates over the match results
sregex_iterator it_end;
//string res = *it;
while (it != it_end) {
if (it != it_beg)
{
cout << it->str() << endl;
}
++it;
}
}
int i; //depending on the compaling getting this additional char is necessary to see the console window
cin >> i;
return 0;
}
Input string is:
Serena-1
Serena-2
Serena-3
Serena-4
Ana-1
Serena-5
Wilson-1
Ana-2
John-1
Ana-3
John-2
Wilson-2
Here it prints
---printing the names we've found:
Ana John Serena Wilson
---printing the regex matches:
-Lets find duplicates of: Ana
Ana-2
Ana-3
-Lets find duplicates of: John
John-2
-Lets find duplicates of: Serena
Serena-2
Serena-3
Serena-4
Serena-5
-Lets find duplicates of: Wilson
Wilson-2

Using QWebkit to retrieve divs with a specific class

I posted the question below, trying to use the QDomDocument classes. I was advised to use the QWebkit instead, but I'm very confused how to do what I need to do with QWebkit. I've never used it before so I'm rather unsure with it. Could anyone please offer any advice? Thanks!
For the record, the function is using a QByteArray that when translated to text is a standard HTML file.
ORIGINAL QUESTION:
I have several divs in an HTML file with different classes, like this:
<div class='A'>...</div>
<div class='B'>...</div>
<div class='C'>...</div>
I have a Qt (4.7) program where I need to be able to get a certain div out of this based on the class. I need to use QDomDocument in this program. I know from the documentation that that class has a function elementById(), but I can't get that to work with classes, just ids. This isn't a HTML file a made or anything, so I don't have any control over whether it's class or id. Is there a way to do this that I'm missing? Thanks!
.pro
QT += webkitwidgets
main.cpp
#include <QApplication>
#include <QDebug>
#include <QWebView>
#include <QWebFrame>
#include <QWebElement>
int main( int argc, char *argv[] ) {
QApplication a(argc, argv);
QString l_html( "<html><body>"
"<div class='A'>div with class A</div>"
"<div class='B'>div with class B</div>"
"<div class='C'>div with class C</div>"
"<span class='A'>span with class A</span>"
"</body></html>" );
QWebView l_webView; // you can skip the QWebView if you dont want to show any widget
l_webView.page()->mainFrame()->setHtml( l_html );
QWebElement l_root( l_webView.page()->mainFrame()->documentElement() );
QWebElementCollection l_elements( l_root.findAll( ".a" ) );
foreach ( QWebElement l_e, l_elements ) {
// do what you want here
}
return a.exec();
}

Scraped HTML is not written at the beginning of text file

Currently, I'm scraping the HTML code of a page, and writing it to a text file.
My problem is, why must there be empty spaces or empty lines at the beginning? The HTML codes written to the txt file do not seem to start at the beginning of the text file. This means that the '<' is not located at the position 0 of the txt file.
After a few runs, my HTML is always written a few lines down inside the text file.
Can anyone tell me why?
Below is my code. I'm doing it under Visual C++ .
UINT32 LOG(wstring log, UINT32 flag)
{
wfstream file (LOG_FILE, ios_base::app);
file << log;
file.close();
return 1;
}
My problem is, the HTML code copied to my text file is always down a couple of lines, then will find the '<' tag. What I want is, the HTML's first '<' is written at the position 0 of my text file :)
Below is my code. I'm doing it under Visual C++ .
UINT32 LOG(wstring log, UINT32 flag)
{
if(flag == 0)
{
wfstream file (LOG_FILE, ios_base::app);
if (file.is_open())
{
file << log <<endl;
file.close();
wcout << endl << log << endl;
return 0;
}
else wcout << "\nUnable to open LOG file\n";
return 1;
}
My problem is, the HTML code copied to my text file is always down a couple of lines, then will find the '<' tag. What I want is, the HTML's first '<' is written at the position 0 of my text file :)