I'm looking for a library to parse html pages, specifically wikipedia articles for example: http://en.wikipedia.org/wiki/Railgun, I want to extract the article's text and images (full scale or original image not the thumb).
Is there an html parser out there ?
I would prefer not to use the wikimedia api since I can't seem to figure out how to extract an article's text and the fullsize images with them.
Thanks and sorry for my english.
EDIT: I forgot to say that the ending result should be valid html
EDIT: I got the json string with this: https://en.wikipedia.org/w/api.php?action=parse&pageid=218930&prop=text&format=json so now I need to parse the json.
I know that in javascript I can do something like this:
var pageHTML = JSON.parse("the json string").parse.text["*"];
Since I know a bit of html/javascript and python, how can I make that http request and parse the json in python 3 ?
I think you should be able to get everything with the webapi,
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Parsing_wikitext
or you could download the whole wikipedia
https://meta.wikimedia.org/wiki/Research:Data
You can get the html from the api too, check the info on https://www.mediawiki.org/wiki/Extension:TextExtracts/pt, it's like this example: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=175&titles=hello%20world .
Depending on how many pages you'll need, you should consider using public dumps if the volume of pages is high.
I made a Node.js module called wikipedia-to-json (written in javascript) that parses the HTML in wikipedia articles and gives you back structed JSON objects that describe the layout of the article in-order. (titles, paragraphs, images, lists, sub-titles...)
That might be useful if you just want to do a quick extractions of text and sections and understand how things look like.
Related
I make use of an API that consists of a JSON with prefetched data from a CMS. The partial JSON looks like this:
"content": "<p><em>We're looking for a pro-active, analytical, commercially minded and ambitious deal-closer who loves to work - and play - hard to join our Company.
I then pass this data to a child component and then render it by using v-html. I expected this to output the HTML tags with styling and semantics. However, it renders the HTML tags as plain text:
<p><em/>We're looking for a pro-active, analytical, commercially minded and ambitious deal-closer who loves to work - and play - hard to join our Company.
Does anyone know what I am doing wrong? Should I have parsed the JSON? Should I have decoded the raw JSON to HTML tags first?
Nothing to do with JSON; everything to do with your web service unhelpfully giving you unparsed HTML.
You're going to have to decode these HTML entities yourself.
One common trick is to feed the unparsed HTML to an off-DOM element, then read it back via textContent, which will give you the parsed version.
let p = document.createElement('p');
p.innerHTML ='<p>'
console.log(p.textContent); //"<p>"
I would like to parse the content of a wikipedia page, but I do miss something which I do not understand. Can someone help me ?
Example:
I have a wikipedia page:
https://it.wikipedia.org/wiki/Anni_690_a.C.
In this page a chinese politican is mentoined: "Jin Wen Gong"
I try to use the following webservice to get the content, but in the json there is no data about "Jin Wen Gong".
https://it.wikipedia.org/w/api.php?action=query&prop=revisions&rvlimit=1&titles=Anni_690_a.C.&rvprop=content&format=json
How do I parse wikipedia correctly ?
The part you are looking for is not directly in the contents of that page, which you can see if you start editing the page: you will also not see any note of jin wen gong
The part where you see it is generated from this piece of wiki-code:
{{Bio decennio a.C.|Morti|69}}
This code is in the JSON.
On Wikipedia that translates to a list of people (probably people that have died in the mentioned year, if I guess the italian?).
I've been going through the doc's for past few hours and simply can't seem to figure this out though probably simple.
I have this link:
http://en.wikipedia.org/w/api.php?format=xml&action=expandtemplates&titles=Arabinose&text={{Chembox%20Elements}}&prop=wikitext
Which obviously will give me the schema of Template Chembox | Chembox Elements in this case.
All I simply want is to retrieve the Molecular forumla content/data/value for the given page/title without having to parse the entire wiki content at my end.
Understand I have prop=wikitext which will be returning wikitext in the above example, there's no option in expandtemplates for prop=text. I've been back and forth with action=query, expandedtemplates etc and no joy.
MediaWiki's API won't do the work for you. You'll have to parse your own results.
I'm looking for a good way to parse HTML in Clojure.
Exactly what I'm trying to do is get content of a web page with crawler and then get content of some HTML tags or their attributes.
So I have URL to the page, and I get html as String, but how do get data I need?
Use https://github.com/cgrand/enlive
It allows you to select and retrieve with CSS-alike selectors.
Or https://github.com/nathell/clj-tagsoup
I am not experienced with tag-soup but I can tell that enlive works well for most scraping.
I have looked at the readability api which is useful to display data in a clean format on a html webpage. I am passing a Url to http://www.readability.com/read?url= to display the data. I am initially directed to a page where I can choose to view the info using readability is there any way I can directly view the content in a neat fashion without going through the actual re-direct?
take a look at Readability's API: http://www.readability.com/developers/api
Before you implement your code, you have to create an API Key on their website.