Selenium python : block third-party - selenium-chromedriver

I would like to block third party websites when I acess a website using selenium to avoid being slowed down while scraping (the same way as Umatrix does)
I've seen i could do something like this (I'm using Chrome):
driver.execute_cdp_cmd('Network.setBlockedURLs', {"urls":
["www.baidu.com"]}) driver.execute_cdp_cmd('Network.enable', {})
Instead of quoting each URL, can I specify "thirdparty" in a way or another ?
Thanks

You can add ad blocker(extension) to chrome and use default chrome profile, it should block the ads

Related

Where to find entire HTML content in Chromium source code

I am currently trying to do this: once the webpage loads, find out if the URL is of a certain pattern (say www.wikipedia.com/*), then, if so, parse the HTML content of that webpage like one can do with BeautifulSoup, and check if the webpage has a div with class foo and id boo. Any idea where can I writ this code, that is, where can I get access to URL, where do I need to listen to to know that the webpage has finished loading following which I can look for the URL and HTML content, and where and how I can parse the HTML?
I tried going through the code in src/chrome/browser/tab_contents, I could not find any reasonable place where I can do all this.
Take a look at the following conceptual application layers which represent how Chromium displays web pages:
Image Source: https://docs.google.com/drawings/d/1gdSTfvLxbJDbX8oiWo5LTwAmXmdMQvjoUhYEhfhj0-k/edit
The different layers are described as:
WebKit: Rendering engine shared between Safari, Chromium, and all other WebKit-based browsers. The Port is a part of WebKit that integrates with platform dependent system services such as resource loading and graphics.
Glue: Converts WebKit types to Chromium types. This is our "WebKit embedding layer." It is the basis of two browsers, Chromium, and test_shell (which allows us to test WebKit).
Renderer / Render host: This is Chromium's "multi-process embedding layer." It proxies notifications and commands across the process boundary.
WebContents: A reusable component that is the main class of the Content module. It's easily embeddable to allow multiprocess rendering of HTML into a view. See the content module pages for more information.
Browser: Represents the browser window, it contains multiple WebContentses.
Tab Helpers: Individual objects that can be attached to a WebContents (via the WebContentsUserData mixin). The Browser attaches an assortment of them to the WebContentses that it holds (one for favicons, one for infobars, etc).
Since your goal is to access and interpret the HTML content of a web page by element and/or class, you can look to the rendering process which uses Blink:
The renderers use the Blink open-source layout engine for interpreting and laying out HTML.
Blink has a WebDocument class which allows you to access the HTML content and other properties of a web page:
WebDocument document = GetMainFrame()->GetDocument();
WebElement element = document.GetElementById(WebString::FromUTF8("example"));
// document.Url();
Cleanest would be via the chrome remote debugging protocol
Use the DOM methods to get the root DOM and walk, search, or query the dom
This would make testing simpler as well: you can implement the logic in your favourite scripting language using an existing client library (there are many) and once that works implement it in C++.
If this for some reason has to be inprocess within Chromium, as a next step start a thread that connects to this and performs the operations.
You need to use a server side library to parse the contents of a requested HTML page. In Java for example there is a library "jsoup" there might be another alternatives for other server side languages. The main problem you could find is a "forbiden access", due to security restrictions, but as you are not trying to access REST services or similar things but only parse pure HTML to found string patterns, it must be easily done with "jsoup". There was a project where similar things were programmed for accessing web sites pages & parse the response html string.
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
See: https://jsoup.org/

How to use chrome dev tools to find elements based on css class or id?

Long time automation developer here (just for context).
It's been bugging me for quite a while that the dev tools in chrome used to find elements just don't seem to work as I expect. Hopefully someone can point out what I'm doing wrong.
Looking at , say, sauce labs page: https://saucelabs.com/blog/selenium-tips-finding-elements-by-their-inner-text-using-contains-a-css-pseudo-class
ok now that page has div's and anchors
and indeed I can do find ('a') or find('div')
but why do I have a problem using classes or id's ?
The find() method refers to window.find(), a non-standard API for the browser's built-in Find function. It does not find web elements the same way Selenium or Capybara do, and so it does not parse the input as a selector.
You find elements with selectors in Chrome DevTools using document.querySelector() or document.querySelectorAll(). There are no special methods in Chrome DevTools for this, however it does provide the $() and $$() aliases (respectively) to save you time and keystrokes.
You can use jquery code in chrome console, for example if you want to find something with class of "foo" you can write $('.foo') or a id of "bar" you write $('#bar')
You can read all about it here
Also you can just google what you want "Jquery how to find a div with id"

How are DOM/rendered html and Coded-Ui are related, can coded-ui test a web application without even considering how that page is rendered in DOM?

I want to know how the coded-ui in web application utilizes DOM of that page. Or is it related to that page's rendered html is coming?
Edited: If suppose i have a grid having rows and column and i want to capture any particular column in it, then do coded-ui takes the help of the rendered html in this process (id,tagname etc) ?
you can utilize the htmlcontrols which is listed in below url:
https://msdn.microsoft.com/en-us/library/microsoft.visualstudio.testtools.uitesting.htmlcontrols.aspx
I used codedui jquery extensions available in NuGet here
. Once you will add this dll as a reference you can make use ExecuteScript() method for running a jquery script inside coded-ui. Similary you can make use of other built in members.

GWT and autofill

I've noticed that browsers don't recognize my password field as a potential auto-complete target. I'm assuming this has something to do with the fact that the password field isn't in the original HTML - it's created by my GWT script after the page has loaded.
Is there a way to tell a browser, "hey, here's this form, treat it like usual?" How can I let browsers hook into my app for autofill?
There are some workarounds to get the browser to auto-complete your login like the one described here.
After struggling some time with it I strongly suggest you simply wrap an existing form of your host page (do not generate the inputs with GWT), do a form.submit() on it and have a servlet listen to the request.
I believe that password fields ( tags with type="password") are not auto-filled for fairly obvious security reasons. It doesn't matter that the field is added after page load by your GWT script.
Try mimicking the field in regular HTML and compare that to how your GWT app creates the DOM structure. Perhaps your GWT app is putting the page together differently?

Any tools to identify undefined CSS/HTML classes?

Are there any tools out there that can look at my website HTML and tell me that (for example) "there is an HTML element at mysite.com/example.html using a class of SOMECLASS but SOMECLASS is not defined in any included CSS files".
?
I've created a snippet that does exactly that: https://gist.github.com/kdzwinel/426a0f76f113643fa285
You can run it in the DevTools console and the sample output will look like this:
You could try out a Firefox plugin like Dust-me-selectors
You could try inspecting with Firebug
There is a free Windows desktop tool that can scan a local web project folder and output undefined css classes, i.e. classes that are used in html but are not defined in any css. It also takes JavaScript into account to some degree.
https://sourceforge.net/projects/cssscanner/
All other answers either didn't work for me or didnt understand the question (including the accepted answer). This one I just tested myself and it works surprisingly well, though it won't catch every edge case.