Scraping using Html Agility Package - html

I am trying to scrape data from a news article using HtmlAgilityPackage the link is as follows http://www.ndtv.com/india-news/vyapam-scam-documents-show-chief-minister-shivraj-chouhan-delayed-probe-780528
I have written the following code below to extract all the comments in this articles but for some reason my variable aTags is returning null value
Code:
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(txtinputurl.Text);
var aTags = document.DocumentNode.SelectNodes("//div[#class='com_user_text']");
int counter = 1;
if (aTags != null)
{
foreach (var aTag in aTags)
{
lbloutput.Text += lbloutput.Text + ". " + aTag.InnerHtml + "\t" + "<br />";
counter++;
}
}
I have also used this XPath but still the same result //div[#class='newcomment_list']/ul/li/div[#class='headerwrap']/div[#class='com_user_text']
Please help me with the correct Xpath to Extract all the comments
Searched all over the net but no solution.

Do a 'View Source' on the page and search for com_user_text. The user comments don't appear at all. They are loaded via javascript after the page is loaded. So when you load the page content via getHtmlWeb.Load(), you don't get user comments.
As this answer says, HTML Agility is not a tool capable of emulating a browser and running javascript. Instead, you need something like WatiN that "allows programmatic access to web pages through a given browser engine and will load the full document."

Related

innerHTML call to receive a url

I am trying to make a call so that when a title of a video is clicked on in my playlist, it will call back a particular videos url to be shown in the metadata field box that I have created.
So far I am getting results but the function below that I am using is giving me rmtp url's like this:
(rtmp://brightcove.fcod.llnwd.net/a500/d16/&mp4:media/1978114949001/1978114949001_2073371902001_How-to-Fish-the-Ice-Worm.mp4&1358870400000&7b1c5b2e65a7c051419c7f50bd712b1b
)
Brightcove has said to use (FLVURL&media_delivery=http).
I have tried every way I know of to put a media delivery in my function but always come up with nothing but the rmtp or a blank.
Can you please help with the small amount of code I have shown. If I need to show more that is not a problem. Thanks
function showMetaData(idx) {
$("tr.select").removeClass("select");
$("#tbData>tr:eq("+idx+")").addClass("select");
var v = oCurrentVideoList[idx];
//URL Metadata
document.getElementById('divMeta.FLVURL').innerHTML = v.FLVURL;
Here is my Population call for my list.
//For PlayList by ID
function buildMAinVideoList() {
//Wipe out the old results
$("#tbData").empty();
console.log(oCurrentMainVideoList);
oCurrentVideoList = oCurrentMainVideoList;
// Display video count
document.getElementById('divVideoCount').innerHTML = oCurrentMainVideoList.length + " videos";
document.getElementById('nameCol').innerHTML = "Video Name";
//document.getElementById('headTitle').innerHTML = title;
document.getElementById('search').value = "Search Videos";
document.getElementById('tdMeta').style.display = "block";
document.getElementById('searchDiv').style.display = "inline";
document.getElementById('checkToggle').style.display = "inline";
$("span[name=buttonRow]").show();
$(":button[name=delFromPlstButton]").hide();
//For each retrieved video, add a row to the table
var modDate = new Date();
$.each(oCurrentMainVideoList, function(i,n){
modDate.setTime(n.lastModifiedDate);
$("#tbData").append(
"<tr style=\"cursor:pointer;\" id=\""+(i)+"\"> \
<td>\
<input type=\"checkbox\" value=\""+(i)+"\" id=\""+(i)+"\" onclick=\"checkCheck()\">\
</td><td>"
+n.name +
"</td><td>"
+(modDate.getMonth()+1)+"/"+modDate.getDate()+"/"+modDate.getFullYear()+"\
</td><td>"
+n.id+
"</td><td>"
+((n.referenceId)?n.referenceId:'')+
"</td></tr>"
).children("tr").bind('click', function(){
showMetaData(this.id);
})
});
//Zebra stripe the table
$("#tbData>tr:even").addClass("oddLine");
//And add a hover effect
$("#tbData>tr").hover(function(){
$(this).addClass("hover");
}, function(){
$(this).removeClass("hover");
});
//if there are videos, show the metadata window, else hide it
if(oCurrentMainVideoList.length > 1){showMetaData(0);}
else{closeBox("tdMeta");}
}
If looking for HTTP paths, when the API call to Brightcove is correct you won't see the rtmp:// urls.
Since you're getting the rtmp URLs, this verifies you're using an API token with URL access, which is good. A request like this should return the playlist and the http URLs (insert your token and playlist ID).
http://api.brightcove.com/services/library?command=find_playlist_by_id&token={yourToken}&playlist_id={yourPlaylist}&video_fields=FLVURL&media_delivery=http
This API test tool can help build the queries for you, and show the expected results:
http://opensource.brightcove.com/tool/api-test-tool
I'm not seeing what would be wrong in your code, but in case you haven't tried this already, debugging in the browser can help you confirm the API results being returned, without having to access it via code. This help you root out any issues with the code you're using to access the values, vs problems with the values themselves. This is an overview on step-debugging in Chrome if you haven't used this before:
https://developers.google.com/chrome-developer-tools/docs/scripts-breakpoints

GoogleAppsScript: How do I trim strings after parsing HTML?

What I'm trying to do is parse & extract the movies title, without all the HTML gunk, from the webpage which will eventually get saved into a spreadsheet. My code:
function myFunction() {
var url = UrlFetchApp.fetch("http://boxofficemojo.com/movies/?id=clashofthetitans2.htm")
var doc = url.getContentText()
var patt1 = doc.match(/<font face\=\"Verdana\"\ssize\=\"6\"><b>.*?<\/b>/i);
//var cleaned = patt1.replace(/^<font face\=\"Verdana\" size\=\"6\"><b>/,"");
//Logger.log(cleaned); Didn't work, get "cannot find function in object" error.
//so tried making a function below:
String.trim = function() {
return this.replace(/^\W<font face\=\"Verdana\"\ssize\=\"6\"><b>/,""); }
Logger.log(patt1.trim());
}
I'm very new to all of this (programming and GoogleScripting in general) I've been referencing w3school.com's JavaScript section but many things on there just don't work with Google Scripts. I'm just not sure what's missing here, is my RegEx wrong? Is there a better/faster way to extract this data instead of RegEx? Any help would be great, Thanks for reading!
While trying to parse information out of HTML that's not under your control is always a bit of a challenge, there is a way you could make this easier on yourself.
I noticed that the title element of each movie page also contains the movie title, like this:
<title>Wrath of the Titans (2012) - Box Office Mojo</title>
You might have more success parsing the title out of this, as it is probably more stable.
var url = UrlFetchApp.fetch("http://boxofficemojo.com/movies/?id=clashofthetitans2.htm");
var doc = url.getContentText();
var match = content.match(/<title>(.+) \([0-9]{4}\) -/);
Logger.log("Movie title is " + match[1]);

can we display other html page information in same page

I am building webpage with several pages.i don't want to use links to go to those pages. i have given the page numbers in the bottom of the page. but when i click that page number the page should information of other page should in the same page.how can i achieve this?
If you don't want to redirect to another page you have to use a frame (the easier way, but really uglier) or AJAX. The AJAX code is easy, if you need it I'll post by comment :)
Chris Coyier at CSS-tricks has a great article explaining a non-frame SEO friendly technique for doing just that.
var oXHR = new XMLHttpRequest();
oXHR.open("get", "page.php?num=1", true); // here you get the page you need
oXHR.onreadystatechange = function ()
{
if (oXHR.status != 200)
document.getElementById('page_displayed').innerHTML = "Error: " + oXHR.status + " " + oXHR.statusText;
else
document.getElementById('page_displayed').innerHTML = oXHR.responseText;
// here will be displayed your content
}
oXHR.send(null);
This is the AJAX code. Then in "page.php" you would have to write something like (I'll write in pseudo-code):
<?php
// ipotize you see 10 post for every page
$post = 10;
$page_num = $_GET['num'];
// select from database the content you need
$sql = "SELECT content FROM pages LIMIT 0, ".$post;
// OR (if you have more html contents for different pages)
// if ($page_num == 1)
{
?>
<html code here>
<?php
}
// in each case you must return some text, it will be displayed on your page
?>
Ask if you don't understand :)
Yes frames is going to be the best thing for you.
Is the link for details http://www.w3schools.com/html/html_frames.asp

Using Phonegap, Json and jQuery mobile, how to make a list of titles linking to the individuel articles

I used Json to get data off a site build in Wordpress (using the Json API plugin). I'm using jQuery mobile for the layout of the application in Phonegap. Getting the data to display in Phonegap wasn't the hardest thing to find (code below). But, is it possible to make a list of the titles of different posts and linking them to the specific article and loading the content in a page? In PHP you could just use an argument but is there a way to make something like this work in jQuery mobile?
Here's code I used. Also handy if someones happens to come across this post using google.
<script>
$(document).ready(function(){
var url="http://127.0.0.1:8888/wp/api/get_recent_posts";
$.getJSON(url,function(json){
$.each(json.posts,function(i,post){
$("#content").append(
'<div class="post">'+
'<h1>'+post.title+'</h1>'+
'<p>'+post.content+'</p>'+
'</div>'
);
});
});
});
</script>
EDIT:
I'd like to thank shanabus again for helping me with this. This was the code I got it to work
with:
$(document).ready(function() {
var url="http://127.0.0.1:8888/wpjson/api/get_recent_posts";
var buttonHtmlString = "", pageHtmlString = "";
var jsonResults;
$.getJSON(url,function(data){
jsonResults = data.posts;
displayResults();
});
function displayResults() {
for (i = 0; i < jsonResults.length; i++) {
buttonHtmlString += '' + jsonResults[i].title + '';
pageHtmlString += '<div data-role="page" id="' + $.trim(jsonResults[i].title).toLowerCase().replace(/ /g,'') + '">';
pageHtmlString += '<div data-role="header"><h1>' + jsonResults[i].title + '</h1></div>';
pageHtmlString += '<div data-role="content"><p>' + jsonResults[i].content + '</p></div>';
pageHtmlString += '</div>';
}
$("#buttonGroup").append(buttonHtmlString);
$("#buttonGroup a").button();
$("#buttonGroup").controlgroup();
$("#main").after(pageHtmlString);
}
});
Yes, this is possible. Check out this example: http://jsfiddle.net/shanabus/nuWay/1/
There you will see that we take an object array, cycle through it and append new buttons (and jqm styling). Does this do what you are looking to do?
I would also recommend improving your javascript by removing the $.each and substituting it for the basic for loop:
for(i = 0; i < json.posts.length; i++)
This loop structure is known to perform better. Same with the append method. I've heard time and time again that its more efficient to build up a string variable and append it once rather than call append multiple times.
UPDATE
In response to your comment, I have posted a new solution that simulates loading a Json collection of content objects to dynamically add page elements to your application. It also dynamically generates the buttons to link to them.
This works if you do it in $(document).ready() and probably a few other jQM events, but you may have to check the documentation on that or call one of the refresh content methods to make the pages valid.
http://jsfiddle.net/nuWay/4/
Hope this helps!

Sending values through links

Here is the situation: I have 2 pages.
What I want is to have a number of text links(<a href="">) on page 1 all directing to page 2, but I want each link to send a different value.
On page 2 I want to show that value like this:
Hello you clicked {value}
Another point to take into account is that I can't use any php in this situation, just html.
Can you use any scripting? Something like Javascript. If you can, then pass the values along in the query string (just add a "?ValueName=Value") to the end of your links. Then on the target page retrieve the query string value. The following site shows how to parse it out: Parsing the Query String.
Here's the Javascript code you would need:
var qs = new Querystring();
var v1 = qs.get("ValueName")
From there you should be able to work with the passed value.
Javascript can get it. Say, you're trying to get the querystring value from this url: http://foo.com/default.html?foo=bar
var tabvalue = getQueryVariable("foo");
function getQueryVariable(variable)
{
var query = window.location.search.substring(1);
var vars = query.split("&");
for (var i=0;i<vars.length;i++)
{
var pair = vars[i].split("=");
if (pair[0] == variable)
{
return pair[1];
}
}
}
** Not 100% certain if my JS code here is correct, as I didn't test it.
You might be able to accomplish this using HTML Anchors.
http://www.w3schools.com/HTML/html_links.asp
Append your data to the HREF tag of your links ad use javascript on second page to parse the URL and display wathever you want
http://java-programming.suite101.com/article.cfm/how_to_get_url_parts_in_javascript
It's not clean, but it should work.
Use document.location.search and split()
http://www.example.com/example.html?argument=value
var queryString = document.location.search();
var parts = queryString.split('=');
document.write(parts[0]); // The argument name
document.write(parts[1]); // The value
Hope it helps
Well this is pretty basic with javascript, but if you want more of this and more advanced stuff you should really look into php for instance. Using php it's easy to get variables from one page to another, here's an example:
the url:
localhost/index.php?myvar=Hello World
You can then access myvar in index.php using this bit of code:
$myvar =$_GET['myvar'];
Ok thanks for all your replies, i'll take a look if i can find a way to use the scripts.
It's really annoying since i have to work around a CMS, because in the CMS, all pages are created with a Wysiwyg editor which tend to filter out unrecognized tags/scripts.
Edit: Ok it seems that the damn wysiwyg editor only recognizes html tags... (as expected)
Using php
<?
$passthis = "See you on the other side";
echo '<form action="whereyouwantittogo.php" target="_blank" method="post">'.
'<input type="text" name="passthis1" value="'.
$passthis .' " /> '.
'<button type="Submit" value="Submit" >Submit</button>'.
'</form>';
?>
The script for the page you would like to pass the info to:
<?
$thispassed = $_POST['passthis1'];
echo '<textarea>'. $thispassed .'</textarea>';
echo $thispassed;
?>
Use this two codes on seperate pages with the latter at whereyouwantittogo.php and you should be in business.