Get full href list from a QWebPage - html

I am trying to use a QWebPage (from QWebKit) to list all the href attributes from A tags with the full URL. At the moment, I do this:
QWebElementCollection collection = webPage->mainFrame()->findAllElements("a");
foreach (QWebElement element, collection)
{
QString href = element.attribute("href");
if (!href.isEmpty())
{
// Process
}
}
But the problem is that href could be a full URL, just a page, a URL with / at the front, or a URL with ../ at the front. Is there a way to parse all these different URLs to produce the full URL in a QString or a QUrl?

QWebFrame has a function named baseUrl which will provide a QUrl object for helping you to resolve the urls in the page.
With it you can call the resolved function with a separate QUrl (built from the href) to resolve the url. If the url is relative, it converts it to the resolved absolute url. If it isn't relative, it returns it with no modifications instead.
Here's an (untested) example based on the code you provided:
QUrl baseUrl = webPage->mainFrame()->baseUrl();
QWebElementCollection collection = webPage->mainFrame()->findAllElements("a");
foreach (QWebElement element, collection)
{
QString href = element.attribute("href");
if (!href.isEmpty())
{
QUrl relativeUrl(href);
QUrl absoluteUrl = baseUrl.resolved(relativeUrl);
// Process
}
}

Related

Can i load data referenced by a Web Component dynamically, with caching?

I'm currently learning Web Components and I wonder if it is possible to have a Component load its own data dynamically, similar to how <img> does from its src attribute, i.e. something like this:
<my-fancy-thingy src='/stuff.json'></my-fancy-thingy>
Obviously this functionality would be useful if stuff.json could be rather large, so it should also be possible to make use of the browser's caching mechanism so the referenced file doesn't get reloaded every time we request the page, unless changed.
Can this be done?
Sure, take inspiration from <load-file> See Dev.to Post
/*
defining the <load-file> Web Component,
yes! the documenation is longer than the code
License: https://unlicense.org/
*/
customElements.define("load-file", class extends HTMLElement {
// declare default connectedCallback as sync so await can be used
async connectedCallback(
// attach a shadowRoot if none exists (prevents displaying error when moving Nodes)
// declare as parameter to save 4 Bytes: 'let '
shadowRoot = this.shadowRoot || this.attachShadow({mode:"open"})
) {
// load SVG file from src="" async, parse to text, add to shadowRoot.innerHTML
shadowRoot.innerHTML = await (await fetch(this.getAttribute("src"))).text()
// append optional <tag [shadowRoot]> Elements from inside <load-svg> after parsed <svg>
shadowRoot.append(...this.querySelectorAll("[shadowRoot]"))
// if "replaceWith" attribute
// then replace <load-svg> with loaded content <load-svg>
// childNodes instead of children to include #textNodes also
this.hasAttribute("replaceWith") && this.replaceWith(...shadowRoot.childNodes)
}
})
Change .text() to .json() and it parses JSON files
Caching can be done by storing the String in localStorage (but a 5MB limit total, I think):
https://en.wikipedia.org/wiki/Web_storage
https://developer.mozilla.org/en-US/docs/Web/API/Window/localStorage
You need to come up with "data has changed" strategy; as the Client has no clue when data actually was changed. Maybe an extra semaphore file/endpoint that provides info if the (large) JSON file was changed.
This works like a charm
export class MonElement extends HTMLElement {
constructor(){
super();
this.attachShadow({mode:'open'});
(...)
this.shadowRoot.appendChild(atemplate);
}
connectedCallback(){...}
static get observedAttributes(){
return ['src'];
}
attributeChangedCallback(nameattr,oldval,newval)
{
if (nameattr==='src') {
this[nameattr]=newval;
here do the fetch for the src value which is newval then update what you got in the innerdom
}
(...)

How to remove (anonymise) all links in dom

Is there any way how to remove (anonymise - just place some empty string instead of url) all links in document?
A tags, and some javascripts actions which can open new url.
I have html in document and I am using nodejs, I can use puppeteer or some dom tool.
With javascript you can do this:
var anchors = document.getElementsByTagName('a');
for (var anchor of anchors) {
anchor.href = 'javascript:void(0)';
}
I've read that 'javascript:void(0)' is the best substitute.

How to create a string map "location" and "location Value Text" of this string text?

I developed project using anguler with ngRx framework. I used TypeScript with HTML for developing front-end.My db have saved 'HTML' format texts like below.
"<html><body>A.txt
B.txt
D.txt
www.facebook.com
</body></html>"
This text priviouly , I drectly render in html file using <dev INNERHTML ={{stringText }} \> like wise.
But my project using JXBrowser and as it's configuration , this can't be directly open in default browser clicking just link.
For that work ,I need to take href location as URL and when click it passed to .ts file.
I thought ,it change as like this <a role="button" click='getLink(myText)'> {{getLink(value}} </a>'. so ,create this ,I need that text put a array with contain 'location' and value.Next ,I though ,Iterate that array in HTML file.
I need some expert help to do this ? I am struggle with map above text to such kind of string array (eg :array[hrfeLink][value]). Hope some expert help me.
------------Updated---------------
According to the comment, I will try this way, and I can take the link location. But still couldn't take value.
let parser = new DOMParser();
let doc = parser.parseFromString(info, "text/html");
let x = doc.getElementsByTagName('a');
for (let i = 0; i < x.length ; i++) {
console.log(x[i].getAttribute('href'));
}
What is the value that you want? Is it the anchor text of the link?
We create an interface Link with the properties that we want from each link
interface Link {
location: string;
value: string;
}
Then we create a function that extracts all links from an html string and converts them to an array of Link objects.
function parseLinks( stringHTML: string ): Link[] {
// create a parser object
const parser = new DOMParser();
// turn the string into a Document
const doc = parser.parseFromString( stringHTML, "text/html" );
// get all links
const linkNodes = doc.getElementsByTagName('a');
// convert from HTMLCollection to array to use .map()
const linksArray = [...linkNodes];
// map from HTMLAnchorElement to Link object
return linksArray.map( element => ({
location: element.href,
value: element.innerText,
}))
}
Now you can do whatever with the links from your text
const text = `<html><body>A.txt
B.txt
D.txt
www.facebook.com
</body></html>`;
const links: Link[] = parseLinks( text );
// can use like this
links.map( ({location, value}) => {
// do something here
})
Typescript Playground Link

Parsing html page content without using selector

I am going to parse some web pages using Java program. For this purpose I wrote a small code for parsing page content by using xpath as selector. For parsing different sites you need to find the appropriate xpath per each site. The problem is for doing that you need an operator to find the write xpath for you. (for example using firepath firefox addon) Suppose you dont know what page you should parse or the number of sites get really big for operator to find right xpath. In this case you need a way for parsing pages without using any selector. (same scenario exist for CSS selector) Or there should be a way to find xpath automatically! I was wondering what is the method of parsing web pages in this way?
Here is the small code which I wrote for this purpose, please feel free to extend that in presenting your solutions.
public downloadHTML(String url) throws IOException{
CleanerProperties props = new CleanerProperties();
// set some properties to non-default values
props.setTranslateSpecialEntities(true);
props.setTransResCharsToNCR(true);
props.setOmitComments(true);
// do parsing
TagNode tagNode = new HtmlCleaner(props).clean(
new URL(url)
);
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
tagNode, "c:\\TEMP\\clean.xml", "utf-8"
);
}
public static void testJavaxXpath(String pattern)
throws ParserConfigurationException, SAXException, IOException,
FileNotFoundException, XPathExpressionException {
DocumentBuilder b = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream(
"c:\\TEMP\\clean.xml"));
// Evaluate XPath against Document itself
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xPath.evaluate(pattern,
doc.getDocumentElement(), XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
System.out.println(e.getFirstChild().getTextContent());
}
}

Use local files with Browser extensions (kango framework)

I'm working on a "browser extension" using "Kango Framework" (http://kangoextensions.com/)
When i want to link a css file i have to use external source (href='http://mysite.com/folder/mysite.css), how should i change the href to make is source from the plugin folder ? (ex: href='mylocalpluginfolder/localfile.css')
i've tried 'localfile.css' and putting the file in the same folder as the JS file.
$("head").append("");
How should i change the json file to make it work ? Should i declare the files as "extended_scripts" or "content_scripts" ?
I've a hard time finding support for this framework, even though the admins are awesome !
Thanks for your help. (please do not suggest to use other solutions, because i won't be able to code plugins for IE and Kango is my only option for this). I didn't find any samples matching my need as the only example available on their site is linking to outside content (christmas tree).
If you want to add CSS in page from content script you should:
Get CSS file contents
Inject CSS code in page
function addStyle(cssCode, id) {
if (id && document.getElementById(id))
return;
var styleElement = document.createElement("style");
styleElement.type = "text/css";
if (id)
styleElement.id = id;
if (styleElement.styleSheet){
styleElement.styleSheet.cssText = cssCode;
}else{
styleElement.appendChild(document.createTextNode(cssCode));
}
var father = null;
var heads = document.getElementsByTagName("head");
if (heads.length>0){
father = heads[0];
}else{
if (typeof document.documentElement!='undefined'){
father = document.documentElement
}else{
var bodies = document.getElementsByTagName("body");
if (bodies.length>0){
father = bodies[0];
}
}
}
if (father!=null)
father.appendChild(styleElement);
}
var details = {
url: 'styles.css',
method: 'GET',
async: true,
contentType: 'text'
};
kango.xhr.send(details, function(data) {
var content = data.response;
kango.console.log(content);
addStyle(content);
});
I do it another way.
I have a JSON containing the styling for specified web sites, when i should change the css.
Using jQuery's CSS gives an advantage on applying CSS, as you may know css() applying in-line css and inline css have a priority over classes and IDs defined in default web pages files and in case of inline CSS it will override them. I find it fine for my needs, you should try.
Using jQuery:
// i keep info in window so making it globally accessible
function SetCSS(){
$.each(window.skinInfo.css, function(tagName, cssProps){
$(tagName).css(cssProps);
});
return;
}
// json format
{
"css":{
"body":{"backgroundColor":"#f0f0f0"},
"#main_feed .post":{"borderBottom":"1px solid #000000"}
}
}
As per kango framework structure, resources must be placed in common/res directory.
Create 'res' folder under src/common folder
Add your css file into it and then access that file using
kango.io.getResourceUrl("res/style.css");
You must add this file into head element of the DOM.
This is done by following way.
// Common function to load local css into head element.
function addToHead (element) {
'use strict';
var head = document.getElementsByTagName('head')[0];
if (head === undefined) {
head = document.createElement('head');
document.getElementsByTagName('html')[0].appendChild(head);
}
head.appendChild(element);
}
// Common function to create css link element dynamically.
function addCss(url) {
var css_tag = document.createElement('link');
css_tag.setAttribute('type', 'text/css');
css_tag.setAttribute('rel', 'stylesheet');
css_tag.setAttribute('href', url);
addToHead(css_tag);
}
And then you can call common function to add your local css file with kango api
// Add css.
addCss(kango.io.getResourceUrl('res/style.css'));