HTML parsing in Clojure - html

I'm looking for a good way to parse HTML in Clojure.
Exactly what I'm trying to do is get content of a web page with crawler and then get content of some HTML tags or their attributes.
So I have URL to the page, and I get html as String, but how do get data I need?

Use https://github.com/cgrand/enlive
It allows you to select and retrieve with CSS-alike selectors.
Or https://github.com/nathell/clj-tagsoup
I am not experienced with tag-soup but I can tell that enlive works well for most scraping.

Related

Storing content to display on page

In my current application we have a service which responds with a XML.With the XML we do XSLT transformation to a HTML and display it on our web page(review,Tutorials).
My Question here is this the only way to store the content of the webpage and display it as required.
We are going to migrate to a Angular app.So do i still continue using the XSLT way or is there other better way to store the content of the page and display it.
I suggest you use native JSON communication, if possible. Or try to convert your data to a native JSON object - Angular (and javascript, really) can work much better and faster with the native JSON itself.
Although, f you want to display a html content, you can use the innerHTML tag.
If you manage to get the data in a JSON, read more and learn about the angular tehcniques of displaying data. I suggest you do the tutorial on angular.io :)

Recovering HTML rendered in browser using JSoup

Can somebody kindly suggest the proper way to use JSoup on a website like "https://network.axial.net/a/company/business-team-san-francisco/"?
This website has a lot of Javascripting, and no matter what I do {documentObj.body().data(), documentObj.html(), connectionObj.response().body(), Jsoup.connect(urlStr).userAgent("Mozilla").data("name", "jsoup") etc.}, I am not able to recover the html that is rendered in a browser.
This is not possible with JSoup. The intent of JSoup is to parse HTML only.
If you are looking for something that can evaluate Javascript to return the resulting DOM, you might want to look at either Selenium or HtmlUnit.

parsing wikipedia page content

I'm looking for a library to parse html pages, specifically wikipedia articles for example: http://en.wikipedia.org/wiki/Railgun, I want to extract the article's text and images (full scale or original image not the thumb).
Is there an html parser out there ?
I would prefer not to use the wikimedia api since I can't seem to figure out how to extract an article's text and the fullsize images with them.
Thanks and sorry for my english.
EDIT: I forgot to say that the ending result should be valid html
EDIT: I got the json string with this: https://en.wikipedia.org/w/api.php?action=parse&pageid=218930&prop=text&format=json so now I need to parse the json.
I know that in javascript I can do something like this:
var pageHTML = JSON.parse("the json string").parse.text["*"];
Since I know a bit of html/javascript and python, how can I make that http request and parse the json in python 3 ?
I think you should be able to get everything with the webapi,
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Parsing_wikitext
or you could download the whole wikipedia
https://meta.wikimedia.org/wiki/Research:Data
You can get the html from the api too, check the info on https://www.mediawiki.org/wiki/Extension:TextExtracts/pt, it's like this example: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=175&titles=hello%20world .
Depending on how many pages you'll need, you should consider using public dumps if the volume of pages is high.
I made a Node.js module called wikipedia-to-json (written in javascript) that parses the HTML in wikipedia articles and gives you back structed JSON objects that describe the layout of the article in-order. (titles, paragraphs, images, lists, sub-titles...)
That might be useful if you just want to do a quick extractions of text and sections and understand how things look like.

Objective-C event-driven HTML parsing

I need to be able to parse HTML snippets in an event-driven way. For example, if the parser finds a HTML tag, it should notify me and pass the HTML tag, value, attributes etc. to a delegate. I cannot use NSXMLParser because I have messy HTML. Is there a useful library for that?
What I want to do is parse the HTML and create a NSAttributedArray and display it in a UITextView.
YES you can parse HTML content of file.
If you want to get specific value from HTML content you need to Parce HTML content by using Hpple. Also This is documentation with exmple that are is for parse HTML. Another way is rexeg but it is more complicated so this is best way in your case.

Automate Web Applications -parsing HTML Data

I just want to automate a web application, where that application parses the HTML page and pulls all the HTML Tags inner text based on some condition like if we have a tag called Span Example has given whose class="spanclass_1"
This is span tag...
which has particular class id. so that app parses and pulls that span into it.
And here the main pain area is, I should not use the developer code to automate that same parsing the HTML.
I want to automate that parsing done correctly, simply by using the parsed data which is shown in UI.
Any help, would be great.
Appreciating your time reading this.
(Note span tag is not shown)
Thanks buddies.
not enough details.
is this html page just a file in local filesystem on it is internet webpage?
do u have access to pages? can u modify it ? if answer yes, that just add javascript to page which will extract data and post to server.
if answer not, than it depends on language u use to programm.
Find good framework to parse html. load page parse it and extract data. Several situation can be there.
Worse scenario - page generated on client side using js.
Best scenario - page is in xhtml mode( u are lucky. any xml parser will help to build dom and extract data)
So so - page is simple html format (try several html parser to find most suitable for u)