How to extract a certain block of text from web site - extract

i have to extract useful information from web
i can i do using c#
example
title: abc
i have get only "abc"

As, #Oded♦ recommended, Html Agility Pack will be useful.
This is example of html agility pack.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");

If you need to extract text from a website, you need to use an HTML parser such as the HTML Agility Pack.

Using DOM parser you can extract required elements. If you pre-aware of the block id or if you able to prepare it then the extraction is quite simple.

Related

Trying to pull through P tags from render section to place else where

Hi I'm using rendersection in MVC. Is it possible to store any paragraph tags seperately in a seperate variable. My current code is this
var menuText1= RenderSection("text`", false).ToHtmlString();
and i push the content on the front end like this:
#Html.Raw(menuText1)
The actuall content in menuText1 consists of several anchor tags and one paragraph tag is possible to pull through both sets of content seperately
If you would consider using a DOM Parser. You can install HTML Agility Pack from HTML Agility Pack Nuget Package
Once installed you can load your HTML in the HTML Document Object. Then you can get the tags as shown below.
var doc = new HtmlDocument();
doc.LoadHtml("Your HTML");
var pTags = doc.DocumentNode.Descendants("p").ToList();
var aTags = doc.DocumentNode.Descendants("a").ToList();
Hope that can help!

Processing JSON string as HTML in Angular

I have to work with a series of JSON files that have HTML inside them. I need to use that HTMl to format the body text of a series of pages and cannot use templates for this information.
I know jQuery has the .html() method which can be used to parse the JSON string as HTML.
Is there a better way using Angular? I would rather not use jQuery for my app.
Thanks.
use ng-html-bind directive in your html...
HTML
<div ng-controller="ngBindHtmlCtrl">
<p ng-bind-html="myHTML"></p>
</div>
where myHTML is your html string...
CONTROLLER
$scope.myHTML = 'I am an <code>HTML</code>string with links! and other <em>stuff</em>';

How to extract an html text from an xml? (Parser Rapture XML, language Objective C)

It is possible with an xml parser to extract an html text?
Explaining in detail:
I have this simple xml
<?xml version="1.0" encoding="iso-8859-1"?>
<eventi>
<evento><id_evento>4553</id_evento><descrizione>Lorem Ipsum<a href='http://www.yea.it/yea.asp' target='_blank'><span class='U'>Vai alla pagina di gioco</span></a></descrizione></evento>
</eventi>
and i'm parsing it with rapture XML, while developing an app for IOS. When i do
rootXML = [RXMLElement elementFromURL:[NSURL URLWithString:[NSString stringWithFormat:#"%#%#", indXMLdettaglioEvento, idElemento]]];
[rootXML iterateWithRootXPath:#"//evento" usingBlock: ^(RXMLElement *datiXML) {
NSLog(#"%#",[datiXML child:#"descrizione"].text);;
]}
The nslog of [datiXML child:#"descrizione"].text returns the text without the html tags. It is possible to make it return the entire html?
[datiXML child:#"descrizione"]
returns a parsed XML item which has a text of Lorem Ipsum, but it also has children itself! The first child, I think you find, is going to be an XML item for your link anchor:
[[dataiXML child::#"descrizione"] child: #"a"] => XML item for link
[[[dataiXML child::#"descrizione"] child: #"a"] child: #"span"] => XML item for span
So you'll need to traverse the whole tree to parse your xHTML -- but I think you'll find it's all there.
As previous commenters have said, lots of valid HTML pages are not valid XML. And lots of HTML pages that "work" aren't valid! So this wouldn't be a good strategy for writing a Web browser. But that's not what we're doing here; if the service you're talking to delivers XML, it makes perfect sense to use an XML parser to parse it!
You can use some open source libraries like TinyXML, TouchXML etc for parsing XML documents.
Otherwise you can write your own parser using NSXMLParser.
Hope this Helps !!!

Jsoup parse error (tag table within tag p)

When I parse this code with Jsoup:
<p>
<table>[...]</table>
</p>
Jsoup returns:
<p></p>
<table>[...]</table>
Is this a mistake? How can I fix this?
I think it has to do with your example not being "valid" html. I believe a table cannot exist within a p tag. Jsoup is probably enforcing correct HTML.
jsoup is very intelligent. It will reform your input text to the valid html conent, if you use its default parsing method.
Document doc = Jsoup.parse(html);
Actually, jsoup can handle xml-like text (certainly, including html and xml). You can try follwing method to parse xml-like text. It will not reform your input, and parse the input as it is.
Document doc = Jsoup.parse(html, "", Parser.xmlParser());

How do you parse a web page and extract all the href links?

I want to parse a web page in Groovy and extract all of the href links and the associated text with it.
If the page contained these links:
Google<br />
Apple
the output would be:
Google, http://www.google.com<br />
Apple, http://www.apple.com
I'm looking for a Groovy answer. AKA. The easy way!
Assuming well-formed XHTML, slurp the xml, collect up all the tags, find the 'a' tags, and print out the href and text.
input = """<html><body>
John
Google
StackOverflow
</body></html>"""
doc = new XmlSlurper().parseText(input)
doc.depthFirst().collect { it }.findAll { it.name() == "a" }.each {
println "${it.text()}, ${it.#href.text()}"
}
A quick google search turned up a nice looking possibility, TagSoup.
I don't know java but I think that xpath is far better than classic regular expressions in order to get one (or more) html elements.
It is also easier to write and to read.
<html>
<body>
1
2
3
</body>
</html>
With the html above, this expression "/html/body/a" will list all href elements.
Here's a good step by step tutorial http://www.zvon.org/xxl/XPathTutorial/General/examples.html
Use XMLSlurper to parse the HTML as an XML document and then use the find method with an appropriate closure to select the a tags and then use the list method on GPathResult to get a list of the tags. You should then be able to extract the text as children of the GPathResult.
Try a regular expression. Something like this should work:
(html =~ /<a.*href='(.*?)'.*>(.*?)<\/a>/).each { url, text ->
// do something with url and text
}
Take a look at Groovy - Tutorial 4 - Regular expressions basics and Anchor Tag Regular Expression Breaking.
Parsing using XMlSlurper only works if HTMl is well-formed.
If your HTMl page has non-well-formed tags, then use regex for parsing the page.
Ex: <a href="www.google.com">
here, 'a' is not closed and thus not well formed.
new URL(url).eachLine{
(it =~ /.*<A HREF="(.*?)">/).each{
// process hrefs
}
}
Html parser + Regular expressions
Any language would do it, though I'd say Perl is the fastest solution.