How to prevent puppeteer from crawling my website content

How to prevent puppeteer from crawling my website content - puppeteer

I know that puppeteer is a simple and great tool, which can easily get the website data
As far as I know, if it is headless mode, there will be many properties different from normal browsers
But if I use the following method to link an open browser with the puppeteer , I can't detect it?
First :Modify Desktop Google Browser Shortcut Properties and open brwoser
C:\Users\13632\AppData\Local\Google\Chrome\Application\chrome.exe --remote-debugging-port=9222
const axios = require('axios')
const puppeteer = require('puppeteer')
async function main() {
const response = await axios.get(`http://127.0.0.1:9222/json/version`);
const webSocketDebuggerUrl = response.data.webSocketDebuggerUrl;
browser = await puppeteer.connect({
browserWSEndpoint: webSocketDebuggerUrl,
ignoreDefaultArgs: ["--enable-automation"],
slowMo: 100,
defaultViewport: { width: 1280, height: 600 },
});
let target = await browser.waitForTarget(t => t.url().includes("you url"))
const page = await target.page();
}
main()
The above method is to link to an opened browser, which is a normal Google browser. It seems that it is impossible to detect whether it is an automated tool? Is there any other way for me to judge whether the other party is a human or a machine

Browser profiling and automation detection (and beating it) is an entire subfield of its own. Some drivers (chromedriver; I've not used puppeteer) set flags to indicate automated use, but these are easily defeated. (See for instance undetected chromedriver for a package which tries not to be detectable.)
Then there's user profiling (bots tend to click in predictable ways), running JS in the browser to try to detect the environment, blacklisting ips (most bots are behind proxies), and so on.
Ask yourself: what are you afraid of? And then defend against that. Anything you put on the Internet can and will be crawled, but you can make it hard to do disruptive things like booking all the concert tickets and the reselling them with a 500% markup. Specific challenges like that have specific answers; but there is no foolproof way to detect automated browsers, and doing so is a waste of effort.

Related

Programmatically start the performance profiling in Chrome

Is there a way to start the performance profiling programmatically in Chrome?
I want to run a performance test of my web app several times to get a better estimate of the FPS but manually starting the performance profiling in Chrome is tricky because I'd have to manually align the frame models. (I am using this technique to extract the frames)
CMD + Shift + E reloads the page and immediately starts the profiling, which alleviates the alignment problem but it only runs for 3 seconds as explained here. So this doesn't work.
Ideally, I'd like to click on a button to start my test script and also starts the profiling. Is there a way to achieve that?

in case you're still interested, or someone else may find it helpful, there's an easy way to achieve this using Puppeteer's tracing class.
Puppeteer uses Chrome DevTools Protocol's Tracing Domain under the hood, and writes a JSON file to your system that can be loaded in the dev tools performance panel.
To get a profile trace of your page's loading time you can implement the following:
const puppeteer = require('puppeteer');
(async () => {
// launch puppeteer browser in headful mode
browser = await puppeteer.launch({
headless: false,
devtools: true
});
// start a page instance in the browser
page = await browser.newPage();
// start the profiling, with a path to the out file and screenshots collected
await page.tracing.start({
path: `tests/logs/trace-${new Date().getTime()}.json`,
screenshots: true
});
// go to the page
await page.goto('http://localhost:8080');
// wait for as long as you want
await page.waitFor(4000);
// or you can wait for an element to appear with:
// await page.waitForSelector('some-css-selector');
// stop the tracing
await page.tracing.stop();
// close the browser
await browser.close();
})();
Of course, you'll have to install Puppeteer first (npm i puppeteer). If you don't want to use Puppeteer you can interact with Chrome DevTools Protocol's API directly (see link above). I didn't investigate that option very much since Puppeteer delivers a high level and easy to use API over CDP's API. You can also interact directly with CDP via Puppeteer's CDPSession API.
Hope this helps. Good luck!

You can use the chrome devtools protocol and use any driver library from here https://github.com/ChromeDevTools/awesome-chrome-devtools#protocol-driver-libraries to programmatically create a profile.
Use this method - https://chromedevtools.github.io/devtools-protocol/tot/Profiler#method-start to start a profile.

Why is incognito mode not working while using puppeteer-extra plugin with puppeteer

I am using puppeteer-extra package with stealth plugin of puppeteer. While using the default puppeteer package, incognito shows up , but while using puppeteer-extra plugin, even while initializing the incognito context, the incognito window doesn't open up. Any idea if its some compatibility issue or someone already came across this problem.
I have tried with args passing "--incognito" mode and also using the context method.
While using --incognito parameter it opens parent window with incognito but while using newPage(), it open a second window which is without incognito flow.
Two approaches I had used
Importing puppeteer extra package:
import puppeteer from 'puppeteer-extra';
import pluginStealth from 'puppeteer-extra-plugin-stealth';
Method 1:
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
Method 2 :
const browser = await puppeteer.launch({args:[--incognito]});
I expect that while using puppeteer-extra package, the behavior should be same as using puppeteer.

The problem
This appears to be caused by a bug in the puppeteer-extra library. When you open a puppeteer instance with puppeteer-extra, the browser instance is hotpatched to better integrate newly opened pages with plugins.
Unfortunately the current implementation of browser._createPageInContext (as of version 2.1.3) doesn't correctly handle which browser context the new page should belong to once it's opened.
The fix
The fix is this pull request.
Specifically, you need to change this line
return async (contextId) => {
to this
return async function (contextId) {
so that arguments on the next line is evaluated correctly
const page = await originalMethod.apply(context, arguments)

Websocket communication with multiple Chrome Docker containers

I have a Chrome container (deployed using this Dockerfile) that renders pages on request from an App container.
The basic flow is:
App sends an http request to Chrome and in response receives a websocket url to use (e.g. ws://chrome.example.com:9222/devtools/browser/13400ef6-648b-4618-8e4c-b5c73db2a122)
App then uses that websocket url to communicate further with Chrome, and to receive the rendered page. I am using the puppeteer library to connect to and communicate with the Chrome instance, using puppeteer.connect({ browserWSEndpoint: webSocketUrl });
For a single Chrome container this works really well.
But I'm trying to scale things up to have multiple Chrome containers in a Docker swarm.
The problem is, I think, that the websocket url received by App is specific to the instance running in that particular Chrome container, so when it is used by App (and where there are now multiple Chrome containers), the websocket requests from App will not necessarily be routed to the right Chrome container.
What is the best way of dealing with this?

You’ve got the basic design correct, but the issue you’re experiencing is with session “stickiness”. However, instead of trying to re-route subsequent requests back to the appropriate machine, we should look for a way to avoid the "pre" request.
The best way to do that is to have your Chrome docker image man-in-the-middle all http “upgrade” requests. This http action is what all WebSocket connections emit prior to changing protocols including the puppeteer library (which is just a WebSocket client under-the-hood). Doing this will also obviate the need for a pre-connect call since the proxying to Chrome will happen on upgrade vs exposing a URL for the app to use. Here's a pretty basic example of doing this with the http-proxy module:
const http = require('http');
const httpProxy = require('http-proxy');
const proxy = new httpProxy.createProxyServer();
http
.createServer()
.on('upgrade', async(req, socket, head) => {
const browser = await puppeteer.launch();
const target = browser.wsEndpoint();
proxy.ws(req, socket, head, { target })
})
.listen(3000);
There's other benefits with this approach as will: you can limit things like concurrency and even inject scripts to be ran at a later time. Those require a little more though and preparation, but the overall idea remains the same. This also makes load-balancing trivial since there's not need to make routing sticky.
If this is something you're interested in implementing all that works is largely done for you in the browserless repo. It even allows for things like concurrency limitations, session time limitations, and includes a feature-rich IDE. You can find more docs on that project here.

MvxDynamicImageHelper unreliable

I have an Mvx base iOS project which is having problems with image downloads.
I have a couple of screens which contain UICollectionViews and the UICollectionViewCells use MvxDynamicImageHelpers to set the Image of their UIImageViews to images hosted on the internet (Azure blob storage via Azure CDN in actual fact). I have noticed that the images sometimes do not appear and that this is more common on a slow connection and if I scroll through the whole UICollectionView while the images are loading - presumably as it initiates a large number of simultaneous downloads. Restarting the app causes some, but not all, of the images to be shown.
Looking in the Caches/Pictures.MvvmCross folder I see there are a number of files with .tmp extensions and some without .tmp extensions but a 0 byte file size. I presume that the .tmp files are the ones that are re-downloaded following an app restart and that an invalid in-memory cache entry is causing them not to be re-downloaded until this happens.
I have implemented my versions of MvxDownloadRequest and MvxHttpFileDownloader and registered my IMvxHttpFileDownloader. The only modification in MvxHttpFileDownloader is to use my MvxDownloadRequest instead of the standard Mvx one.
As far as I can see, there are no exceptions being thrown in MvxDownloadRequest.Start or MvxDownloadRequest.ProcessResponse and MvxDownloadRequest.FileDownloadFailed is not being called. Having replaced MvxDownloadRequest.Start with the following, all images are always downloaded and displayed successfully:
try
{
ThreadPool.QueueUserWorkItem((state) => {
try
{
var fileService = this.GetService<IMvxSimpleFileStoreService>();
var tempFilePath = DownloadPath + ".tmp";
var imageData = NSData.FromUrl(NSUrl.FromString(Url));
var image = UIImage.LoadFromData(imageData);
NSError nsError;
image.AsPNG().Save(tempFilePath, true, out nsError);
fileService.TryMove(tempFilePath, DownloadPath, true);
}
catch (Exception exception)
{
FireDownloadFailed(exception);
return;
}
FireDownloadComplete();
});
}
catch (Exception e)
{
FireDownloadFailed(e);
}
So, what could be causing the problems with the standard WebRequest which is not affecting the above version? I'm guessing it's something to with GC and will do further debugging when I get time, but this won't be fore a while unfortunately. Would be very much appreciated if someone can answer this or provide pointers for when I do look at it.
Thanks,
J

From the description of your investigations so far, it sounds like you have isolated the problem down to the level that httpwebrequest sometimes fails, but that the NSData methods are 100% reliable.
If this is the case, then it would suggest that the problem is somewhere in the xamarin.ios network stack or in the use of it.
It might be worth checking the xamarin bugzilla repository and also asking their support team if they are aware of any issues in this area. I believe they did make some announcements about changes to the iOS networking at evolve - see the CFNetworkHandler part late in the video and slides at http://xamarin.com/evolve/2013#session-b3mx6e6rmb - and there are worrying questions on here like iPhone app gets into a state where network requests never complete
Beyond that, I'd guess the first step in any debugging would be to isolate the issue in a simple test app - eg a simple app which just downloads one image at a time and which demonstrates a simple pass/fail for each technique. If you can replicate the issue in a small test app, then it'll be much quicker to work out what the issue is.

Using local file for Web Audio API in Javascript

I'm trying to get sound working on my iPhone game using the Web Audio API. The problem is that this app is entirely client side. I want to store my mp3s in a local folder (and without being user input driven) so I can't use XMLHttpRequest to read the data. I was looking into using FileSystem but Safari doesn't support it.
Is there any alternative?
Edit: Thanks for the below responses. Unfortunately the Audio API is horribly slow for games. I had this working and the latency just makes the user experience unacceptable. To clarify, what I need is sounething like -
var request = new XMLHttpRequest();
request.open('GET', 'file:///./../sounds/beep-1.mp3', true);
request.responseType = 'arraybuffer';
request.onload = function() {
context.decodeAudioData(request.response, function(buffer) {
dogBarkingBuffer = buffer;
}, onError);
}
request.send();
But this gives me the errors -
XMLHttpRequest cannot load file:///sounds/beep-1.mp3. Cross origin requests are only supported for HTTP.
Uncaught Error: NETWORK_ERR: XMLHttpRequest Exception 101
I understand the security risks with reading local files but surely within your own domain should be ok?

I had the same problem and I found this very simple solution.
audio_file.onchange = function(){
var files = this.files;
var file = URL.createObjectURL(files[0]);
audio_player.src = file;
audio_player.play();
};
<input id="audio_file" type="file" accept="audio/*" />
<audio id="audio_player" />
You can test here:
http://jsfiddle.net/Tv8Cm/

Ok, it's taken me two days of prototyping different solutions and I've finally figured out how I can do this without storing my resources on a server. There's a few blogs that detail this but I couldn't find the full solution in one place so I'm adding it here. This may be considered a bit hacky by seasoned programmers but it's the only way I can see this working, so if anyone has a more elegent solution I'd love to hear it.
The solution was to store my sound files as a Base64 encoded string. The sound files are relatively small (less than 30kb) so I'm hoping performance won't be too much of an issue. Note that I put 'xxx' in front of some of the hyperlinks as my n00b status means I can't post more than two links.
Step 1: create Base 64 sound font
First I need to convert my mp3 to a Base64 encoded string and store it as JSON. I found a website that does this conversion for me here - xxxhttp://www.mobilefish.com/services/base64/base64.php
You may need to remove return characters using a text editor but for anyone that needs an example I found some piano tones here - xxxhttps://raw.github.com/mudcube/MIDI.js/master/soundfont/acoustic_grand_piano-mp3.js
Note that in order to work with my example you're need to remove the header part data:audio/mpeg;base64,
Step 2: decode sound font to ArrayBuffer
You could implement this yourself but I found an API that does this perfectly (why re-invent the wheel, right?) - https://github.com/danguer/blog-examples/blob/master/js/base64-binary.js
Resource taken from - here
Step 3: Adding the rest of the code
Fairly straightforward
var cNote = acoustic_grand_piano.C2;
var byteArray = Base64Binary.decodeArrayBuffer(cNote);
var context = new webkitAudioContext();
context.decodeAudioData(byteArray, function(buffer) {
var source = context.createBufferSource(); // creates a sound source
source.buffer = buffer;
source.connect(context.destination); // connect the source to the context's destination (the speakers)
source.noteOn(0);
}, function(err) { console.log("err(decodeAudioData): "+err); });
And that's it! I have this working through my desktop version of Chrome and also running on mobile Safari (iOS 6 only of course as Web Audio is not supported in older versions). It takes a couple of seconds to load on mobile Safari (Vs less than 1 second on desktop Chrome) but this might be due to the fact that it spends time downloading the sound fonts. It might also be the fact that iOS prevents any sound playing until a user interaction event has occured. I need to do more work looking at how it performs.
Hope this saves someone else the grief I went through.

Because ios apps are sandboxed, the web view (basically safari wrapped in phonegap) allows you to store your mp3 file locally. I.e, there is no "cross domain" security issue.
This is as of ios6 as previous ios versions didn't support web audio api

Use HTML5 Audio tag for playing audio file in browser.
Ajax request works with http protocol so when you try to get audio file using file://, browser mark this request as cross domain request. Set following code in request header -
header('Access-Control-Allow-Origin: *');

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to prevent puppeteer from crawling my website content - puppeteer

Related

Programmatically start the performance profiling in Chrome

Why is incognito mode not working while using puppeteer-extra plugin with puppeteer

Websocket communication with multiple Chrome Docker containers

MvxDynamicImageHelper unreliable

Using local file for Web Audio API in Javascript

Categories

Resources