Parsing html page content without using selector - html

I am going to parse some web pages using Java program. For this purpose I wrote a small code for parsing page content by using xpath as selector. For parsing different sites you need to find the appropriate xpath per each site. The problem is for doing that you need an operator to find the write xpath for you. (for example using firepath firefox addon) Suppose you dont know what page you should parse or the number of sites get really big for operator to find right xpath. In this case you need a way for parsing pages without using any selector. (same scenario exist for CSS selector) Or there should be a way to find xpath automatically! I was wondering what is the method of parsing web pages in this way?
Here is the small code which I wrote for this purpose, please feel free to extend that in presenting your solutions.
public downloadHTML(String url) throws IOException{
CleanerProperties props = new CleanerProperties();
// set some properties to non-default values
props.setTranslateSpecialEntities(true);
props.setTransResCharsToNCR(true);
props.setOmitComments(true);
// do parsing
TagNode tagNode = new HtmlCleaner(props).clean(
new URL(url)
);
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
tagNode, "c:\\TEMP\\clean.xml", "utf-8"
);
}
public static void testJavaxXpath(String pattern)
throws ParserConfigurationException, SAXException, IOException,
FileNotFoundException, XPathExpressionException {
DocumentBuilder b = DocumentBuilderFactory.newInstance()
.newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream(
"c:\\TEMP\\clean.xml"));
// Evaluate XPath against Document itself
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xPath.evaluate(pattern,
doc.getDocumentElement(), XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
Element e = (Element) nodes.item(i);
System.out.println(e.getFirstChild().getTextContent());
}
}

Related

How to get all of the headlines from a google news search using Jsoup

public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://www.google.com/search?q=tesla&oq=tesla&aqs=chrome.0.69i59l3j0l3.494j0j9&sourceid=chrome&ie=UTF-8#q=tesla&tbm=nws").userAgent("Mozilla").get();
Elements links = doc.select("div[class=_cnc]");
for (Element link : links) {
Elements titles = link.select("h3.r_U6c");
String title = titles.text();
System.out.println(title);
System.out.println("Headline: " + link.text());
System.out.println("Link: " + link.attr("data-href"));
}
}}
Here is the HTMl layout. I want to extract the titles for each of the links. I am just not sure on how to format the CSS selector portions of my code. I tried to look through some old threads but couldn't get anything to work. I am just looking for the text of the headlines not the actual links. The print link statements were just for some testing that I couldn't get running.
Thanks guys
Picture of HTML
The page you're trying to fetch is loaded with Javascript. Jsoup don't process Javascript scripts.
Instead use some tools like Selenium or ui4j.

Allow using some html tags in MVC 4

How i can allow client to use html tags in MVC 4?
I would like to save records to the database and when it extract in view allow only some HTML tags (< b > < i > < img >) and others tags must be represented as text.
My Controller:
[ValidateInput(false)]
[HttpPost]
public ActionResult Rep(String a)
{
var dbreader = new DataBaseReader();
var text = Request["report_text"];
dbreader.SendReport(text, uid, secret).ToString();
...
}
My View:
#{
var dbreader = new DataBaseReader();
var reports = dbreader.GetReports();
foreach (var report in reports)
{
<div class="report_content">#Html.Raw(report.content)</div>
...
}
}
You can replace all < chars to HTML entity:
tags = tags.Replace("<", "<");
Now, replace back only allowed tags:
tags = tags
.Replace("<b>", "<b>")
.Replace("</b>", "</b>")
.Replace("<i>", "</i>")
.Replace("</i>", "</i>")
.Replace("<img ", "<img ");
And render to page using #Html.Raw(tags)
If you are trying some property of your view model object to accept Html text, use AllowHtmlAttribute
[AllowHtml]
public string UserComment{ get; set; }
and before binding to the view
model.UserComment=model.UserComment.Replace("<othertagstart/end>",""); //hard
Turn off validation for report_text (1) and write custom HTML encoder (2):
Step 1:
Request.Unvalidated().Form["report_text"]
More info here. You don't need to turn off validation for entire controller action.
Step 2:
Write a custom html encoder (convert all tags except b, i, img to e.g.: script -> ;ltscript;gt), since you are customizing a default behaviour of request validation and html tag filtering. Consider to safeguard yourself from SQL injection attacks by checking SQL parameters passed to stored procedures/functions etc.
You may want to check out BBCode BBCode on Wikipedia. This way you have some control on what is allowed and what's not, and prevent illegal usage.
This would work like this:
A user submits something like 'the meeting will now be on [b]monday![/b]'
Before saving it to your database you remove all real html tags ('< ... >') to avoid the use of illegal tags or code injection, but leave the pseudo tags as they are.
When viewed you convert only the allowed pseudo html tags into real html
I found solution of my problem:
html = Regex.Replace(html, "<b>(.*?)</>", "<b>$1</b>");
html = Regex.Replace(html, "<i>(.*?)</i>", "<i>$1</i>");
html = Regex.Replace(html, "<img(?:.*?)src="(.*?)"(?:.*?)/>", "<img src=\"$1\"/>");

HTML parsing individual tables/not all data being parsed?

I'm rather new when it comes to Windows Phone 8 development and I've been toying around with a few things as part of the application I'm developing.
Right now I'm trying to parse information from a website such as the RuneScape 07 High Scores - http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=zezima
I'm using HTML Agility Pack and I'm able to parse some data (down to Woodcutting), but anything passed that doesn't appear? (Is that down to the size of my ListBox?)
Ideally, I'd like to be able to parse the table information individually rather than in one block like so:
public MainPage()
{
InitializeComponent();
HtmlWeb.LoadAsync("http://services.runescape.com/m=hiscore_oldschool/hiscorepersonal.ws?user1=zezima", DownLoadCompleted);
}
void DownLoadCompleted(object sender, HtmlDocumentLoadCompleted e)
{
if(e.Error == null)
{
HtmlDocument doc = e.Document;
if (doc != null)
{
var result = doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']");
foreach (var htmlNode in result)
{
lBox.Items.Add(htmlNode.InnerText);
}
}
}
But if I try and access an individual table such as this one using
var result = doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']/table/tbody/tr[5]/td[2]");
I get a NullReferenceException.
Is this possible or am I doing something exceptionally wrong?
You probably relied on a developper tools such as FireBug or Chrome, etc... to determine the XPATH for the nodes you're after.
You can' really do this as the XPATH given by such tools correspond to the in memory HTML DOM while the Html Agility Pack only knows about the raw HTML sent back by the server.
What you need to do is look at what's sent back (or just do a view source). You'll see there is no TBODY element for example. So you want to find anything discriminant, and use XPATH axes for example.
Here is a code that seems to work:
// get all TD nodes with ALIGN attribute set to left
foreach (var node in doc.DocumentNode.SelectNodes("//div[#id='contentHiscores']//td[#align='left']"))
{
var item = lBox.Items.Add(node.InnerText.Trim());
// use an 'XPATH axe': get all sibling TD nodes with ALIGN attribute set to 'right'
foreach (var sibling in node.SelectNodes("following-sibling::td[#align='right']"))
{
item.SubItems.Add(sibling.InnerText.Trim());
}
}

Remove attributes using HtmlAgilityPack

I'm trying to create a code snippet to remove all style attributes regardless of tag using HtmlAgilityPack.
Here's my code:
var elements = htmlDoc.DocumentNode.SelectNodes("//*");
if (elements!=null)
{
foreach (var element in elements)
{
element.Attributes.Remove("style");
}
}
However, I'm not getting it to stick? If I look at the element object immediately after Remove("style"). I can see that the style attribute has been removed, but it still appears in the DocumentNode object. :/
I'm feeling a bit stupid, but it seems off to me? Anyone done this using HtmlAgilityPack? Thanks!
Update
I changed my code to the following, and it works properly:
public static void RemoveStyleAttributes(this HtmlDocument html)
{
var elementsWithStyleAttribute = html.DocumentNode.SelectNodes("//#style");
if (elementsWithStyleAttribute!=null)
{
foreach (var element in elementsWithStyleAttribute)
{
element.Attributes["style"].Remove();
}
}
}
Your code snippet seems to be correct - it removes the attributes. The thing is, DocumentNode .InnerHtml(I assume you monitored this property) is a complex property, maybe it get updated after some unknown circumstances and you actually shouldn't use this property to get the document as a string. Instead of it HtmlDocument.Save method for this:
string result = null;
using (StringWriter writer = new StringWriter())
{
htmlDoc.Save(writer);
result = writer.ToString();
}
now result variable holds the string representation of your document.
One more thing: your code may be improved by changing your expression to "//*[#style]" which gets you only elements with style attribute.
Here is a very simple solution
VB.net
element.Attributes.Remove(element.Attributes("style"))
c#
element.Attributes.Remove(element.Attributes["style"])

call code behind functions with html controls

I have a simple function that I want to call in the code behind file name Move
and I was trying to see how this can be done and Im not using asp image button because not trying to use asp server side controls since they tend not to work well with ASP.net MVC..the way it is set up now it will look for a javascript function named Move but I want it to call a function named move in code behind of the same view
<img alt='move' id="Move" src="/Content/img/hPrevious.png" onclick="Move()"/>
protected void Move(){
}
//based on Search criteria update a new table
protected void Search(object sender EventArgs e)
{
for (int i = 0; i < data.Count; i++){
HtmlTableRow row = new HtmlTableRow();
HtmlTableCell CheckCell = new HtmlTableCell();
HtmlTableCell firstCell = new HtmlTableCell();
HtmlTableCell SecondCell = new HtmlTableCell();
CheckBox Check = new CheckBox();
Check.ID = data[i].ID;
CheckCell.Controls.Add(Check);
lbl1.Text = data[i].Date;
lbl2.Text = data[i].Name;
row.Cells.Add(CheckCell);
row.Cells.Add(firstCell);
row.Cells.Add(SecondCell);
Table.Rows.Add(row);
}
}
Scott Guthrie has a very good example on how to do this using routing rules.
This would give you the ability to have the user navigate to a URL in the format /Search/[Query]/[PageNumber] like http://site/Search/Hippopotamus/3 and it would show page 3 of the search results for hippopotamus.
Then in your view just make the next button point to "http://site/Search/Hippopotamus/4", no javascript required.
Of course if you wanted to use javascript you could do something like this:
function Move() {
var href = 'http://blah/Search/Hippopotamus/2';
var slashPos = href.lastIndexOf('/');
var page = parseInt(href.substring(slashPos + 1, href.length));
href = href.substring(0, slashPos + 1);
window.location = href + (++page);
}
But that is much more convoluted than just incrementing the page number parameter in the controller and setting the URL of the next button.
You cannot do postbacks or call anything in a view from JavaScript in an ASP.NET MVC application. Anything you want to call from JavaScript must be an action on a controller. It's hard to say more without having more details about what you're trying to do, but if you want to call some method "Move" in your web application from JavaScript, then "Move" must be an action on a controller.
Based on comments, I'm going to update this answer with a more complete description of how you might implement what I understand as the problem described in the question. However, there's quite a bit of information missing from the question so I'm speculating here. Hopefully, the general idea will get through, even if some of the details do not match TStamper's exact code.
Let's start with a Controller action:
public ActionResult ShowMyPage();
{
return View();
}
Now I know that I want to re-display this page, and do so using an argument passed from a JavaScript function in the page. Since I'll be displaying the same page again, I'll just alter the action to take an argument. String arguments are nullable, so I can continue to do the initial display of the page as I always have, without having to worry about specifying some kind of default value for the argument. Here's the new version:
public ActionResult ShowMyPage(string searchQuery);
{
ViewData["SearchQuery"] = searchQuery;
return View();
}
Now I need to call this page again in JavaScript. So I use the same URL I used to display the page initially, but I append a query string parameter with the table name:
http://example.com/MyControllerName/ShowMyPage?searchQuery=tableName
Finally, in my aspx I can call a code behind function, passing the searchQuery from the view data. Once again, I have strong reservations about using code behind in an MVC application, but this will work.
How to call a code-behind function in aspx:
<% Search(ViewData["searchQuery"]); %>
I've changed the arguments. Since you're not handling an event (with a few exceptions, such as Page_Load, there aren't any in MVC), the Search function doesn't need the signature of an event handler. But I did add the "tablename" argument so that you can pass that from the aspx.
Once more, I'll express my reservations about doing this in code behind. It strikes me that you are trying to use standard ASP.NET techniques inside of the MVC framework, when MVC works differently. I'd strongly suggest going through the MVC tutorials to see examples of more standard ways of doing this sort of thing.