I make use of an API that consists of a JSON with prefetched data from a CMS. The partial JSON looks like this:
"content": "<p><em>We're looking for a pro-active, analytical, commercially minded and ambitious deal-closer who loves to work - and play - hard to join our Company.
I then pass this data to a child component and then render it by using v-html. I expected this to output the HTML tags with styling and semantics. However, it renders the HTML tags as plain text:
<p><em/>We're looking for a pro-active, analytical, commercially minded and ambitious deal-closer who loves to work - and play - hard to join our Company.
Does anyone know what I am doing wrong? Should I have parsed the JSON? Should I have decoded the raw JSON to HTML tags first?
Nothing to do with JSON; everything to do with your web service unhelpfully giving you unparsed HTML.
You're going to have to decode these HTML entities yourself.
One common trick is to feed the unparsed HTML to an off-DOM element, then read it back via textContent, which will give you the parsed version.
let p = document.createElement('p');
p.innerHTML ='<p>'
console.log(p.textContent); //"<p>"
Related
I'm consuming an API that returns encoded HTML entities on the response (I don't have access to the code).
This response contains text that will be later used on the HTML, such as inside
<p>, <h1> ...
I would like to know the best way to decode all the HTML entities on the document.
All the solutions I found tell me to decode each API response. But it isn't scalable on my case.
Btw, I'm using React.
I'm looking for a library to parse html pages, specifically wikipedia articles for example: http://en.wikipedia.org/wiki/Railgun, I want to extract the article's text and images (full scale or original image not the thumb).
Is there an html parser out there ?
I would prefer not to use the wikimedia api since I can't seem to figure out how to extract an article's text and the fullsize images with them.
Thanks and sorry for my english.
EDIT: I forgot to say that the ending result should be valid html
EDIT: I got the json string with this: https://en.wikipedia.org/w/api.php?action=parse&pageid=218930&prop=text&format=json so now I need to parse the json.
I know that in javascript I can do something like this:
var pageHTML = JSON.parse("the json string").parse.text["*"];
Since I know a bit of html/javascript and python, how can I make that http request and parse the json in python 3 ?
I think you should be able to get everything with the webapi,
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Parsing_wikitext
or you could download the whole wikipedia
https://meta.wikimedia.org/wiki/Research:Data
You can get the html from the api too, check the info on https://www.mediawiki.org/wiki/Extension:TextExtracts/pt, it's like this example: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=175&titles=hello%20world .
Depending on how many pages you'll need, you should consider using public dumps if the volume of pages is high.
I made a Node.js module called wikipedia-to-json (written in javascript) that parses the HTML in wikipedia articles and gives you back structed JSON objects that describe the layout of the article in-order. (titles, paragraphs, images, lists, sub-titles...)
That might be useful if you just want to do a quick extractions of text and sections and understand how things look like.
Classic problem. Want to see html rendered but I'm seeing text in the browser. Whether I tell handlebars js to decode it or not in template ( three curly braces vs two - {{{myHtmlData}}} vs {{myHtmlData}} ) doesn't get me there. Something about the JSON being returned via the model.fetch() has this html data wrapped up in such a way that it is resistant to the notion of displaying as HTML. It's always considered a string whether encoded or decoded so it always displays as text.
Is this just something backbone isn't meant to do?
The technologies involved here are:
backbone.marionette
handlebars.js
.NET Web API
Your data is being escaped automatically. It's a good thing, but since you're sure the data is a safe HTML. Use {{{}}} as in this other question Insert html in a handlebar template without escaping .
I'm looking for a good way to parse HTML in Clojure.
Exactly what I'm trying to do is get content of a web page with crawler and then get content of some HTML tags or their attributes.
So I have URL to the page, and I get html as String, but how do get data I need?
Use https://github.com/cgrand/enlive
It allows you to select and retrieve with CSS-alike selectors.
Or https://github.com/nathell/clj-tagsoup
I am not experienced with tag-soup but I can tell that enlive works well for most scraping.
I have a couple of websites that I want to extract data from and based on previous experiences, this isn't as easy as it sound. Why? Simply because the HTML pages I have to parse aren't properly formatted (missing closing tag, etc.).
Considering that I have no constraints regarding the technology, language or tool that I can use, what are your suggestions to easily parse and extract data from HTML pages? I have tried HTML Agility Pack, BeautifulSoup, and even these tools aren't perfect (HTML Agility Pack is buggy, and BeautifulSoup parsing engine doesn't work with the pages I am passing to it).
You can use pretty much any language you like just don't try and parse HTML with regular expressions.
So let me rephrase that and say: you can use any language you like that has a HTML parser, which is pretty much everything invented in the last 15-20 years.
If you're having issues with particular pages I suggest you look into repairing them with HTML Tidy.
I think hpricot (linked by Colin Pickard) is ace. Add scrubyt to the mix and you get a great html scraping and browsing interface with the text matching power of Ruby http://scrubyt.org/
here is some example code from http://github.com/scrubber/scrubyt_examples/blob/7a219b58a67138da046aa7c1e221988a9e96c30e/twitter.rb
require 'rubygems'
require 'scrubyt'
# Simple exmaple for scraping basic
# information from a public Twitter
# account.
# Scrubyt.logger = Scrubyt::Logger.new
twitter_data = Scrubyt::Extractor.define do
fetch 'http://www.twitter.com/scobleizer'
profile_info '//ul[#class="about vcard entry-author"]' do
full_name "//li//span[#class='fn']"
location "//li//span[#class='adr']"
website "//li//a[#class='url']/#href"
bio "//li//span[#class='bio']"
end
end
puts twitter_data.to_xml
As language Java and as a open source library Jsoup will be a pretty solution for you.
hpricot may be what you are looking for.
You may try PHP's DOMDocument class. It has a couple of methods for loading HTML content. I usually make use of this class. My advises are to prepend a DOCTYPE element to the HTML in case it hasn't one and to inspect in Firebug the HTML that results after parsing. In some cases, where invalid markup is encountered, DOMDocument does a bit of rearrangement of the HTML elements. Also, if there's a meta tag specifying the charset inside the source be careful that it will be used internally by libxml when parsing the markup. Here's a little example
$html = file_get_contents('http://example.com');
$dom = new DOMDocument;
$oldValue = libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors($oldValue);
echo $dom->saveHTML();
Any language which works with HTML on DOM level is good.
for perl it is HTML::TreeBuilder module.