I have an application that is pulling articles from the web and I need to retrieve the URL for the first image in an article. Here's an example of the code for these images:
<img alt="Twitter (zpower)" src="http://www.example.com/image.png" width="630" height="420">
I need to get just the value for the src. How would I do this?
You'll need to parse the HTML and extract the src attribute. You could do it by hand, but a better way is to rely on someone else's parsing library (for instance, ElementParser).
I'd like to second #ravuya's response, but also mention that you can also use the built in NSXMLParser to do the parsing for you.
Related
I am going to learn HTML. And I want to put my image from gallery to my web. How do I do that?
Please correct this code if I'm wrong because the image from my gallery is not displaying to my webpage.
Code:
<img src="/storage/emulated/0/DCIM/Camera/IMG_20190717_104912.jpg" alt=" "/ >
/storage/emulated/0/DCIM/Camera/IMG_20190717_104912.jpg looks like a file path on an Android phone; it's probably not right. If you end up deploying your HTML file at http://site.example/index.html then the browser will try to fetch http://site.example/storage/emulated/0/DCIM/Camera/IMG_20190717_104912.jpg which probably isn't your Android phone.
The best thing to do is to copy the image to the same place as your HTML file and then refer to it this way:
<img src="IMG_20190717_104912.jpg">
Here's a couple of other tips about your markup:
First, img is what's called a self-closing tag; you do not need to write /> at the end. You might see that sometimes in books from a period in the early 2000s where some people thought HTML might become an application of XML; that did not pan out. It is harmless to write /> but it is not useful so don't do it. (Note that if you do want to make your HTML be XML, I think you need to write /> with no space between them.) If you're learning HTML and see /> in it, try to find a more up-to-date resource to learn from.
Second, "alt" text is very useful to humans and machines, so use it; but if you have no useful content don't put a space in there—that's not useful.
I am trying to send an email from an iOS device (Using Xamarin) in an HTML format with images embedded in the body of the email.
Some solutions that I found online suggest to use an approach similar to the one shown here:
NSData ImgData = UIImage.FromFile(FileName).AsJPEG();
string img64baseStr = ImgData.GetBase64EncodedString(NSDataBase64EncodingOptions.None);
string srcStr = string.Format("data:image/jpg;base64,{0}", img64baseStr);
Using the code above I can see the pictures properly in the iOS Email client. However, when the email is sent I can't see the images on the receiving side.
There are other setbacks to this approach, but I can avoid getting into those in more details at this point.
I have also tried using the images as resources in the project. However, when I reference the pictures directly in the HTML in this form:
<img src="Pic1.png" width="700" height="500" alt=""/>
the linkage is broken and the email is missing the images.
How can I properly reference resource images in an HTML email?
So it seems like the approach described above, of converting the data object to a base64 string is deprecated by most email clients for security reasons and the email client will block Dada URIs that are arriving this way.
I found the question posted in the link below to be helpful for understanding why things weren't working for me:
base64 encoded images in email signatures
Specifically refer to the answer posted by #Shadow2531 and the discussion that followed it.
Finally, I was able to achieve what I wanted using the MailKit package that is available on NuGet.
The package has a pretty comprehensive documentation. Specifically for the problem I was trying to solve, take a look at this page:
http://www.mimekit.net/docs/html/CreatingMessages.htm
Good luck.
I'm using Jsoup's parseBodyFragment() and parse() methods to work with blocks of code made up of script, noscript, and style tags. The goal isn't to clean them - just to select(), analyze, and output them. The select() portion works really well.
However, the issue is that it's automatically encoding the url parameters of src attributes. So, when the input is this:
<noscript>
<img height="1" width="1" style="display:none;" alt="" src="https://something.orother.com/i/cnt?txn_id=123&p_id=123"/>
</noscript>
I end up with this, returned from Jsoup, via the outerHTML() method:
<noscript>
<img height="1" width="1" style="display:none;" alt="" src="https://something.orother.com/i/cnt?txn_id=123&p_id=123"/>
</noscript>
The issue being the standard ampersand (&) in the url parameter is being encoded and output as &. Is there a way to disable this?
I'm looking for a way to get the html of the selected element without modification. Thanks!
Update (2/23/2016): Clarified problem. Also, found an issue on the Github repo describing the problem: https://github.com/jhy/jsoup/issues/372. Looks like this might not be possible.
The original HTML is invalid. An & which doesn't start a character reference must be expressed as & in an HTML attribute value.
HTML parsers are expected to perform error recovery and generate a valid DOM.
Jsoup works by parsing the HTML into a DOM, letting you run queries on it, then exporting the DOM back to HTML afterwards.
You can't avoid white space normalisation, error recovery, or any of the other things that parsers do. The approach used by Jsoup to extract data is not designed to support the preservation of errors.
I'm looking for a library to parse html pages, specifically wikipedia articles for example: http://en.wikipedia.org/wiki/Railgun, I want to extract the article's text and images (full scale or original image not the thumb).
Is there an html parser out there ?
I would prefer not to use the wikimedia api since I can't seem to figure out how to extract an article's text and the fullsize images with them.
Thanks and sorry for my english.
EDIT: I forgot to say that the ending result should be valid html
EDIT: I got the json string with this: https://en.wikipedia.org/w/api.php?action=parse&pageid=218930&prop=text&format=json so now I need to parse the json.
I know that in javascript I can do something like this:
var pageHTML = JSON.parse("the json string").parse.text["*"];
Since I know a bit of html/javascript and python, how can I make that http request and parse the json in python 3 ?
I think you should be able to get everything with the webapi,
https://www.mediawiki.org/wiki/API:Main_page
https://www.mediawiki.org/wiki/API:Parsing_wikitext
or you could download the whole wikipedia
https://meta.wikimedia.org/wiki/Research:Data
You can get the html from the api too, check the info on https://www.mediawiki.org/wiki/Extension:TextExtracts/pt, it's like this example: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exchars=175&titles=hello%20world .
Depending on how many pages you'll need, you should consider using public dumps if the volume of pages is high.
I made a Node.js module called wikipedia-to-json (written in javascript) that parses the HTML in wikipedia articles and gives you back structed JSON objects that describe the layout of the article in-order. (titles, paragraphs, images, lists, sub-titles...)
That might be useful if you just want to do a quick extractions of text and sections and understand how things look like.
I got a HTML-document and I want to extract every single URL of a video-file. Whats the best way to do this, since there are different HTML-versions and different possibilities to embed a video-file into a HTML-document. For this purpose I'd use the Html Agility Pack (c#).
You should parse the html with a regular expression for getting the video URL's.