How to parse div> main>div - html

Image of the source code I want to parse
How to parse div class="flight-selector-listing"?
How to open "main[ui-view]" and going next
But I only have
Element masthead = doc.select("div.FR>main[ui-view]").first();
and Output:
<main ui-view="mainView"></main>

How to parse div class="flight-selector-listing"?
Use this CSS query:
div.flight-selector-listing
How to open "main[ui-view]" and going next
Jsoup is an HTML parser. He won't be able to "open" anything. If you want to open "main[ui-view]" then use tools like HTMLUnit, Selenium or ui4j.
(...) and Output:
I bet div.FR>main[ui-view] is populated by some Javascript code running on the page. If it is the case, Jsoup can't help here.

Related

Scraping HTML elements between ::before and ::after with scrapy and xpath

I am trying to scrape some links from a webpage in python with scrapy and xpath, but the elements I want to scrape are between ::before and ::after so xpath can't see them as they do not exist in the HTML but are dynamically created with javascript. Is there a way to scrape those elements?
::before
<div class="well-white">...</div>
<div class="well-white">...</div>
<div class="well-white">...</div>
::after
This is the actual page http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/amif/calls/amif-2018-ag-inte.html#c,topics=callIdentifier/t/AMIF-2018-AG-INTE/1/1/1/default-group&callStatus/t/Forthcoming/1/1/0/default-group&callStatus/t/Open/1/1/0/default-group&callStatus/t/Closed/1/1/0/default-group&+identifier/desc
I can't replicate your exact document state.
However if you load the page you can see some template language loaded in the same format your example data is:
Also if you check XHR network inpector you can see some AJAX requests for json data is being made:
So you can download the whole data you are looking for in handy json format over here:
http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json
scrapy shell "http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json"
> import json
> data = json.loads(response.body_as_unicode())
> data['topicData']['Topics'][0]
{'topicId': 1259874, 'ccm2Id': 31081390, 'subCallId': 910867, ...
Very very easy!
you just use the "Absolute XPath" and "Relative XPath" (https://www.guru99.com/xpath-selenium.html) together.By this trick you can pass form ::before (and maybe ::after). For example in your case (I supposed that,:
//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] is before your "div".
FindField='your "id" associated to the "div"'
driver.find_element_by_xpath ( "//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] / div")
NOTE:only one "/" must be use.
Also you can use only "Absolute XPath" in all addressing (Note:must be use "//" at the first Address.

How do I get Mithril.js v0.2.5 to render raw HTML extracted from json? [duplicate]

Suppose I have a string <span class="msg">Text goes here</span>.I need to use this string as a HTML element in my webpage. Any ideas on how to do it?
Mithril provides the m.trust method for this. At the place in your view where you want the HTML output, write m.trust( '<span class="msg">Text goes here</span>' ) and you should be sorted.
Mithril it's powerfull thanks to the virtual dom, in the view you if you want to create a html element you use:
m("htmlattribute.classeCss" , "value");
So in your case:
m("span.msg" , "Text goes here");
Try creating a container you wish to hold your span in.
1. Use jQuery to select it.
2. On that selection, call the jQuery .html() method, and pass in your HTML string.
($('.container').html(//string-goes-here), for example)
You should be able to assign the inner HTML of the container with the string, resulting in the HTML element you want.
Docs here.

Meteor {{#markdown}}

I am making a forum with markdown support.
I've been using meteor's markdown parser {{#markdown}} and have found something disturbing that I can't seem to figure out.
I am using {{#markdown}}{{content}}{{/markdown}} to render the content inserted into database.
The disturbing thing, for example, if someone writes up html without inserting it into the code block in the content...
example
<div class = "col-md-12">
Content Here
</div>
This will render as a column. They could also make buttons and etc through writing the HTML for it.
How to disable this behaviour so that when HTML is written it will not render into HTML but just simply show it as text?
You can write global helper, which will strip all html tags:
function stripHTML(string){
s = string.replace(/(<([^>]+)>)/ig, '');
return s;
}
Template.registerHelper('stripHTML', stripHTML)
Usage :
{{#markdown}}{{stripHTML content}}{{/markdown}}
Test it in console:
stripHTML("<div>Inside dive</div> Text outside")

how i can use html code into a message of messages.properties in grails?

MyCode (a line in a gsp of grails)
<h3><g:message code="view.hello"/><span style="color:orange"><g:message code="view.world"/></span><h3></h3>
Output
HelloWorld(in orange World)
But dont like how i put this line of my code. I wish do something like this.
the code i want
messages.properties (put inside string+html)
view.helloword= hello>span style="color:orange">World>/span>
But output is:
hello>span style="color:orange">World>/span>
Dont look html code. how i can resolve?
Can be done as below:
//messages.properties
view.helloworld=Hello <span style="color:orange">World</span>
//gsp
<h3><g:message code="view.helloworld"/></h3>
You can add any html to messages and render message in view quite easy. Answer below works in grails 3.0.1
${raw(message(code:"view.hello"))}
You can use same approach for any html-like string.

How do you parse a web page and extract all the href links?

I want to parse a web page in Groovy and extract all of the href links and the associated text with it.
If the page contained these links:
Google<br />
Apple
the output would be:
Google, http://www.google.com<br />
Apple, http://www.apple.com
I'm looking for a Groovy answer. AKA. The easy way!
Assuming well-formed XHTML, slurp the xml, collect up all the tags, find the 'a' tags, and print out the href and text.
input = """<html><body>
John
Google
StackOverflow
</body></html>"""
doc = new XmlSlurper().parseText(input)
doc.depthFirst().collect { it }.findAll { it.name() == "a" }.each {
println "${it.text()}, ${it.#href.text()}"
}
A quick google search turned up a nice looking possibility, TagSoup.
I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.
It is also easier to write and to read.
<html>
<body>
1
2
3
</body>
</html>
With the html above, this expression "/html/body/a" will list all href elements.
Here's a good step by step tutorial http://www.zvon.org/xxl/XPathTutorial/General/examples.html
Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.
Try a regular expression. Something like this should work:
(html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text ->
// do something with url and text
}
Take a look at Groovy - Tutorial 4 - Regular expressions basics and Anchor Tag Regular Expression Breaking.
Parsing using XMlSlurper only works if HTMl is well-formed.
If your HTMl page has non-well-formed tags, then use regex for parsing the page.
Ex: <a href="www.google.com">
here, 'a' is not closed and thus not well formed.
new URL(url).eachLine{
(it =~ /.*<A HREF="(.*?)">/).each{
// process hrefs
}
}
Html parser + Regular expressions
Any language would do it, though I'd say Perl is the fastest solution.