C# Get full html document from site

C# Get full html document from site - html

I tried to use GetStringAsync
using (var client = new HttpClient())
{
var html = await client.GetStringAsync(url);
richTextBox1.Text = html.ToString();
}
and DownloadString
System.Net.WebClient wc = new System.Net.WebClient();
string webData = wc.DownloadString(url);
richTextBox1.Text = webData;
But it doesn't give me full html document like Google Chrome F12. How can I get full html code of url using C#?
Need this url: http://poeplanner.com/ but it doesn't show me even a single table when Chrome F12 does.

My guess is that the code you don't see is a code that added with javascript. So you need use a browser program to get this code too.
This app will run the javascript too and you can ask from it the final html.
If I'm right, try to use phantomjs.
Related question on PhantomJS

Related

Cannot transform XML file to html using XSLT stylesheet

client side - react js
server side - dot net
XSLT version - 2.0
hi, requirement is to transform an XML file to a html file using an XSLT stylesheet to display to the user in the client side. But problem is I could not find a way to transform it properly.
What I tried so far,
Tried linking the stylesheet in the xml file and opening it in the browser so that the transformation will be done by the browser automatically but this did not work as expected. In chrome it's just a blank window and in firefox it displays the text with no styling. I also found out that browsers still do not support xslt 2.0 transformation so I assume that is the issue.
----------------------xml data--------------------------------
Above shows how I linked it. Tried both type="text/xslt" and type="text/xsl".
Tried transform in the server side (.net 7 /c#).
XslCompiledTransform transform = new XslCompiledTransform();
using(XmlReader reader = XmlReader.Create(new StringReader(xsltString))) {
transform.Load(reader);
}
StringWriter results = new StringWriter();
using(XmlReader reader = XmlReader.Create(new StringReader(inputXml))) {
transform.Transform(reader, null, results);
}
return results.ToString();
Above method did not give any error but no content in the resulting file. Later found out that XslCompiledTransform does not support XSLT 2.0, it only supports 1.0. So I tried a 3rd party library
Saxon-HE.
var xslt = new FileInfo(#"E:\xmltesting\stylesheet-ubl.xslt");
var input = new FileInfo(#"E:\xmltesting\invoice32.xml");
var output = new FileInfo(#"E:\xmltesting\test.html");
var processor = new Processor();
var compiler = processor.NewXsltCompiler();
var executable = compiler.Compile(new Uri(xslt.FullName));
var destination = new DomDestination();
using (var inputStream = input.OpenRead())
{
var transformer = executable.Load();
transformer.SetInputStream(inputStream, new Uri(input.DirectoryName));
transformer.Run(destination);
}
destination.XmlDocument.Save(output.FullName);
Above method gives exception at below line,
var executable = compiler.Compile(new Uri(xslt.FullName));
System.TypeInitializationException: 'The type initializer for 'sun.util.calendar.ZoneInfoFile' threw an exception.'
Inner Exception
MissingMethodException: Method not found: 'Void System.IO.FileStream..ctor(System.String, System.IO.FileMode, System.Security.AccessControl.FileSystemRights, System.IO.FileShare, Int32, System.IO.FileOptions)'.
Could not find much related to this exception.
Since transforming from the server-side doesn't look that promising atm moved back to client side transformation. I am currently looking into saxon-js...but still no luck.
Anyone have an idea on how to go about this?. Thanks.

Martin's answer has shown you the options for running the transformation server-side using Saxon on .NET.
But you also asked about the options for running the transformation client-side in the browser; for that, please take a look at SaxonJS.

If you want to run XSLT 2 or 3 stylesheets with .NET 7 you can do so using the commercial SaxonCS package (https://www.nuget.org/packages/SaxonCS, latest versions are 11.5 and 12.0) or using the IKVM cross compiled version of Saxon HE 11.5 (https://www.nuget.org/packages/SaxonHE11s9apiExtensions); the following is code using the IKVM cross compiled Saxon HE 11.5 in .NET 7:
using net.liberty_development.SaxonHE11s9apiExtensions;
using net.sf.saxon.s9api;
var processor = new Processor(false);
var xsltCompiler = processor.newXsltCompiler();
var xsltExecutable = xsltCompiler.Compile(new FileInfo("ubl.xslt"));
var xslt30Transformer = xsltExecutable.load30();
xslt30Transformer.Transform(new FileInfo("invoice-sample.xml"), processor.NewSerializer(new FileInfo("invoice-sample.html")));

How to programmatically download website sources?

I need to download data feed from this website:
http://www.oddsportal.com/soccer/argentina/copa-argentina/rosario-central-racing-club-hnmq7gEQ/
In Chrome using developer tools I was able to find this link
http://fb.oddsportal.com/feed/match/1-1-hnmq7gEQ-1-2-yj45f.dat
which contains everything I need. Question is how to programmatically (preferably in java) get to the second link when I know the first.
Thanks in advance for any useful help.

This is quite similar to this issue. You can use that to get a String with all the sources. Then you just search the string to find what you're looking for. It can look like this.
First start ChromeDriver and navigate to the page you wish to scrap.
WebDriver driver = new ChromeDriver();
driver.get("http://www.oddsportal.com/soccer/argentina/copa-argentina/rosario-central-racing-club-hnmq7gEQ/");
Then download the sources into a string
String scriptToExecute = "var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return network;";
String netData = ((JavascriptExecutor) driver).executeScript(scriptToExecute).toString();
And finally search the string for the desired link
netData = netData.substring(netData.indexOf("fb.oddsportal"), netData.indexOf(".dat")+4);
System.out.println(netData);

You can use a framework such as JSoup in Java and scrape a page.
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Once you have this you can then query the links on that page and save them to an array:
Elements links = doc.select("a[href]");
Then run though this array and follow them links.
for (Element link : links) {
Document doc = Jsoup.connect(link.attr("abs:href")).get();
}

Enabling Chrome Extension in Incognito Mode via CLI flags?

I'm using selenium to test a chrome extension and part of the extension requires the user to be in incognito mode. Currently, I've not been able to enable the extension to be allowed in incognito mode upon startup except by adding the argument user-data-dir=/path/to/directory.
The problem with this is that it loads the extension from the depths of my file system, rather than in a way I can check into git.
I've also tried navigating selenium to the chrome extensions settings page but it seems that selenium can't drive chrome:// pages.
Any ideas on to how to enable incognito on the chrome extension on boot of the chrome driver?

Here is the solution that will work with the latest version of Chrome 74.
Navigate to chrome://extensions
Click on Details button for your desired extension
Copy the url (This contains your extension id)
Now we have to navigate to the above url and then click on the allow in incognito toggle.
Java:
driver.get("chrome://extensions/?id=bhghoamapcdpbohphigoooaddinpkbai");
JavascriptExecutor js = (JavascriptExecutor) driver;
js.executeScript("document.querySelector('extensions-manager').shadowRoot.querySelector('#viewManager > extensions-detail-view.active').shadowRoot.querySelector('div#container.page-container > div.page-content > div#options-section extensions-toggle-row#allow-incognito').shadowRoot.querySelector('label#label input').click()");
Python:
driver.get("chrome://extensions/?id=bhghoamapcdpbohphigoooaddinpkbai")
driver.execute_script("return document.querySelector('extensions-manager').shadowRoot.querySelector('#viewManager > extensions-detail-view.active').shadowRoot.querySelector('div#container.page-container > div.page-content > div#options-section extensions-toggle-row#allow-incognito').shadowRoot.querySelector('label#label input').click()");
Continue Reading, if you want to know how and why
Root Cause:
As part of enhancements to the chrome browser, google moved all the chrome option in to shadow dom. So you can not access allow in incognito toggle element as selenium find_element method which will point to the original dom of the page. So we have to switch to the shadow dom and access the elements in the shadow tree.
Details:
Shadow DOM:
Note: We will be referring to the terms shown in the picture. So please go through the picture for better understanding.
Solution:
In order to work with shadow element first we have to find the shadow host to which the shadow dom is attached. Here is the simple method to get the shadow root based on the shadowHost.
private static WebElement getShadowRoot(WebDriver driver,WebElement shadowHost) {
JavascriptExecutor js = (JavascriptExecutor) driver;
return (WebElement) js.executeScript("return arguments[0].shadowRoot", shadowHost);
}
And then you can access the shadow tree element using the shadowRoot Element.
// get the shadowHost in the original dom using findElement
WebElement shadowHost = driver.findElement(By.cssSelector("shadowHost_CSS"));
// get the shadow root
WebElement shadowRoot = getShadowRoot(driver,shadowHost);
// access shadow tree element
WebElement shadowTreeElement = shadowRoot.findElement(By.cssSelector("shadow_tree_element_css"));
In order to simplify all the above steps created the below method.
public static WebElement getShadowElement(WebDriver driver,WebElement shadowHost, String cssOfShadowElement) {
WebElement shardowRoot = getShadowRoot(driver, shadowHost);
return shardowRoot.findElement(By.cssSelector(cssOfShadowElement));
}
Now you can get the shadowTree Element with single method call
WebElement shadowHost = driver.findElement(By.cssSelector("shadowHost_CSS_Goes_here));
WebElement shadowTreeElement = getShadowElement(driver,shadowHost,"shadow_tree_element_css");
And perform the operations as usual like .click(), .getText().
shadowTreeElement.click()
This Looks simple when you have only one level of shadow DOM. But here, in this case we have multiple levels of shadow doms. So we have to access the element by reaching each shadow host and root.
Below is the snippet using the methods that mentioned above (getShadowElement and getShadowRoot)
// Locate shadowHost on the current dom
WebElement shadowHostL1 = driver.findElement(By.cssSelector("extensions-manager"));
// now locate the shadowElement by traversing all shadow levels
WebElement shadowElementL1 = getShadowElement(driver, shadowHostL1, "#viewManager > extensions-detail-view.active");
WebElement shadowElementL2 = getShadowElement(driver, shadowElementL1,"div#container.page-container > div.page-content > div#options-section extensions-toggle-row#allow-incognito");
WebElement allowToggle = shadowElementL2.findElement(By.cssSelector("label#label input"));
allowToggle.click();
You can achieve all the above steps in single js call as at mentioned at the beginning of the answer (added below just to reduce the confusion).
WebElement allowToggle = (WebElement) js.executeScript("return document.querySelector('extensions-manager').shadowRoot.querySelector('#viewManager > extensions-detail-view.active').shadowRoot.querySelector('div#container.page-container > div.page-content > div#options-section extensions-toggle-row#allow-incognito').shadowRoot.querySelector('label#label input')");

In chrome version 69 this code works (Python version):
driver.get('chrome://extensions')
go_to_extension_js_code = '''
var extensionName = 'TestRevolution';
var extensionsManager = document.querySelector('extensions-manager');
var extensionsItemList = extensionsManager.shadowRoot.querySelector(
'extensions-item-list');
var extensions = extensionsItemList.shadowRoot.querySelectorAll(
'extensions-item');
for (var i = 0; i < extensions.length; i += 1) {
var extensionItem = extensions[i].shadowRoot;
if (extensionItem.textContent.indexOf(extensionName) > -1) {
extensionItem.querySelector('#detailsButton').click();
}
}
'''
enable_incognito_mode_js_code = '''
var extensionsManager = document.querySelector('extensions-manager');
var extensionsDetailView = extensionsManager.shadowRoot.querySelector(
'extensions-detail-view');
var allowIncognitoRow = extensionsDetailView.shadowRoot.querySelector(
'#allow-incognito');
allowIncognitoRow.shadowRoot.querySelector('#crToggle').click();
'''
driver.execute_script(go_to_extension_js_code)
driver.execute_script(enable_incognito_mode_js_code)
Just remember to change var extensionName = 'TestRevolution'; line to your extension name.

If you are trying to enable the already installed extension in incodnito, then try the below code . It should work with chrome.
driver.get("chrome://extensions-frame");
WebElement checkbox = driver.findElement(By.xpath("//label[#class='incognito-control']/input[#type='checkbox']"));
if (!checkbox.isSelected()) {
checkbox.click();
}

I'm still newbie in coding, but I figured another method after looking in chrome's crisper.js at chrome://extensions/ .
First you need to know the extension ID. You can do it by making the id constant here, or using pako's method on obtaining the id's. For mine it's "lmpekldgmhemmmbllpdmafmlofflampm"
Then launch chrome with --incognito and addExtension, then execute the javascript to enable in incognito.
Example:
public class test2 {
static String dir = System.getProperty("user.dir");
static WebDriver driver;
static JavascriptExecutor js;
public static void main(String[] args) throws InterruptedException, IOException{
ChromeOptions options = new ChromeOptions();
options.addArguments("--incognito");
options.addExtensions(new File(dir + "\\randua.crx"));
System.setProperty("webdriver.chrome.driver",dir + "\\chromedriver73.exe");
driver = new ChromeDriver(options);
js = (JavascriptExecutor) driver;
String extID = "lmpekldgmhemmmbllpdmafmlofflampm";
driver.get("chrome://extensions-frame/");
new WebDriverWait(driver, 60).until(webDriver -> js.executeScript("return document.readyState").equals("complete"));
js.executeScript("chrome.developerPrivate.updateExtensionConfiguration({extensionId: \"" + extID + "\",incognitoAccess: true})");
Thread.sleep(1000);
}
}
Hope it helps :)

Get HTML from Frame using WebBrowser control - unauthorizedaccessexception

I'm looking for a free tool or dlls that I can use to write my own code in .NET to process some web requests.
Let's say I have a URL with some query string parameters similar to http://www.example.com?param=1 and when I use it in a browser several redirects occur and eventually HTML is rendered that has a frameset and a frame's inner html contains a table with data that I need. I want to store this data in the external file in a CSV format. Obviously the data is different depending on the querystring parameter param. Let's say I want to run the application and generate 1000 CSV files for param values from 1 to 1000.
I have good knowledge in .NET, javascript, HTML, but the main problem is how to get the final HTML in the server code.
What I tried is I created a new Form Application, added a webbrowser control and used code like this:
private void FormMain_Shown(object sender, EventArgs e)
{
var param = 1; //test
var url = string.Format(Constants.URL_PATTERN, param);
WebBrowserMain.Navigated += WebBrowserMain_Navigated;
WebBrowserMain.Navigate(url);
}
void WebBrowserMain_Navigated(object sender, WebBrowserNavigatedEventArgs e)
{
if (e.Url.OriginalString == Constants.FINAL_URL)
{
var document = WebBrowserMain.Document.Window.Frames[0].Document;
}
}
But unfortunately I receieve unauthorizedaccessexception because probably frame and the document are in different domains. Does anybody has an idea of how to work around this and maybe another brand new approach to implement functionality like this?

Thanks to the Noseratio's comments I managed to do that with the WebBrowser control. Here are some major points that might help others who have similar questions:
1) DocumentCompleted event should be used. For Navigated event body of the document is NULL.
2) Following answer helped a lot: WebBrowserControl: UnauthorizedAccessException when accessing property of a Frame
3) I was not aware about IHTMLWindow2 similar interfaces, for them to work correctly I added references to following COM libs: Microsoft Internet Controls (SHDocVw), Microsoft HTML Object Library (MSHTML).
4) I grabbed the html of the frame with the following code:
void WebBrowserMain_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if (e.Url.OriginalString == Constants.FINAL_URL)
{
try
{
var doc = (IHTMLDocument2) WebBrowserMain.Document.DomDocument;
var frame = (IHTMLWindow2) doc.frames.item(0);
var document = CrossFrameIE.GetDocumentFromWindow(frame);
var html = document.body.outerHTML;
var dataParser = new DataParser(html);
//my logic here
}
5) For the work with Html, I used the fine HTML Agility Pack that has some pretty good XPath search.

How to create a WinRT Book reader application

I'd like to create an application that receives formatted text (RTF) or html, renders it an show it page by page..
Is there any control that aims to do that?
I tried to use the RichEditBox control to load a file but it stucks during the operation:
var file = await Windows.ApplicationModel.Package.Current.InstalledLocation.GetFileAsync(#"myFile.rtf");
using (var memstream = await file.OpenReadAsync())
{
MainText.Document.LoadFromStream(Windows.UI.Text.TextSetOptions.ApplyRtfDocumentDefaults, memstream);
}
I tried to load an HTML file this way:
var file = await Windows.ApplicationModel.Package.Current.InstalledLocation.GetFileAsync(#"myFile.htm");
var stream = await file.OpenAsync(FileAccessMode.Read);
string app;
using (StreamReader rStream = new StreamReader(stream.AsStream()))
{
app = rStream.ReadToEnd();
}
myWebView.NavigateToString(app);
But I cannot find a way to "count" the lenght of the parsed text to chunk it in pages..
There is any other way or library to do that? Any example online?

If you want to show your HTML contents in pages then you can use RichTextBlock with RichTextBlockOverflow. RTF is not supported to RichTextBlock.
how to inject RTF file to RichTextBlock in c#/xaml Windows store app
Showing Html in WinRT with RichTextBlock or other component
XAML text display sample

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

C# Get full html document from site - html

My guess is that the code you don't see is a code that added with javascript. So you need use a browser program to get this code too. This app will run the javascript too and you can ask from it the final html. If I'm right, try to use phantomjs. Related question on PhantomJS

Related

Cannot transform XML file to html using XSLT stylesheet

How to programmatically download website sources?

Enabling Chrome Extension in Incognito Mode via CLI flags?

Get HTML from Frame using WebBrowser control - unauthorizedaccessexception

How to create a WinRT Book reader application

Categories

Resources