Crawl urls from sitemap.xml using Apify Puppeteer and requestQueue - puppeteer

Apify can crawl links from sitemap.xml
const Apify = require('apify');
Apify.main(async () => {
const requestList = new Apify.RequestList({
sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
});
await requestList.initialize();
const crawler = new Apify.PuppeteerCrawler({
requestList,
handlePageFunction: async ({ page, request }) => {
console.log(`Processing ${request.url}...`);
await Apify.pushData({
url: request.url,
title: await page.title(),
html: await page.content(),
});
},
});
await crawler.run();
console.log('Done.');
});
https://sdk.apify.com/docs/examples/puppeteersitemap#docsNav
However, I am not sure how to crawl links from sitemap.xml if I am using requestQueue. For ex:
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({url: "https://google.com});
//this is not working. Apify is simply crawling sitemap.xml
//and not adding urls from sitemap.xml to requestQueue
await requestQueue.addRequest({url:`https://google.com/sitemap.xml`});
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
// This function is called for every page the crawler visits
handlePageFunction: async (context) => {
const {request, page} = context;
const title = await page.title();
let page_url = request.url;
console.log(`Title of ${page_url}: ${title}`);
await Apify.utils.enqueueLinks({
page, selector: 'a', pseudoUrls, requestQueue});
},
});
await crawler.run();

The great thing about Apify is that you can use both RequestList and RequestQueue together. In that case, items are taken from the list to the queue as you scrape (not overloading the queue). By using both, you get the best from both worlds.
Apify.main(async () => {
const requestList = new Apify.RequestList({
sources: [{ requestsFromUrl: 'https://edition.cnn.com/sitemaps/cnn/news.xml' }],
});
await requestList.initialize();
const requestQueue = await Apify.openRequestQueue();
const crawler = new Apify.PuppeteerCrawler({
requestList,
requestQueue,
handlePageFunction: async ({ page, request }) => {
console.log(`Processing ${request.url}...`);
// This is just an example, define your logic
await Apify.utils.enqueueLinks({
page, selector: 'a', pseudoUrls: null, requestQueue,
});
await Apify.pushData({
url: request.url,
title: await page.title(),
html: await page.content(),
});
},
});
await crawler.run();
console.log('Done.');
});
If you want to use just the queue, you will need to parse the XML yourself. Of course, this is not a big issue. You can parse it easily with Cheerio either before the crawler or by using Apify.CheerioCrawler
Anyway, we recommend using RequestList for bulk urls because it is basically instantly created in-memory but the queue is actually a database (or JSON files locally).

Related

Why am I missing data when using a proxy with puppeteer?

When using a proxy for my puppeteer request to scrape a page, some data is missing such as price but the rest is there. When I remove the proxy, everything loads correctly.
const getData = async () => {
const browser = await puppeteer.launch({
headless: true,
args: ["--no-sandbox", "--disable-setuid-sandbox", "--ignore-certificate-errors", "--proxy-server=myproxy"],
});
const page = await browser.newPage();
await page.authenticate({ username: "username", password: "password" });
await page.goto(url, {
timeout: 300000,
waitUntil: "load",
});
const html = await page.evaluate(() => document.body.innerHTML);
const $ = cheerio.load(html);
const products = [];
$("[data-automation=product-results] > div").each((index, el) => {
const id = product.attr("data-product-id");
const name = product.find("[data-automation=name]").text();
const image = product.find("[data-automation=image]").attr("src");
const price = product.find("[data-automation=current-price]").text();
const data = {
id,
name,
image,
price,
};
products.push(data);
});
console.log(products);
};
I have tried: waitUntil: 'domcontentloaded' (same results), waitUntil: 'networkidle0' and waitUntil: 'networkidle2' both times out (5 minutes).
I don't quite understand why I am able to get all the data without using a proxy and only get partial data using a proxy.

pupeteer function not returning array

Hi Guys can you please point my mistake on this code?
console.log(urls) is printing undefined.
Thanks in advance.
const puppeteer = require('puppeteer');
async function GetUrls() {
const browser = await puppeteer.launch( { headless: false,
executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe' })
const page = await browser.newPage();
await page.goto("https://some page");
await page.waitForSelector('a.review.exclick');
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('a.review.exclick');
items.forEach((item) => {
results.push({
url: item.getAttribute('href'),
});
});
return results;
browser.close();
});
}
(async () => {
let URLS = await GetUrls();
console.log(URLS);
process.exit(1);
})();
Here is a list:
you don't have a return statement in your GetUrls() function
you close the browser after a return statement AND inside the page.evaluate() method
Keep in mind that anything that is executed within the page.evaluate() will relate to the browser context. To quickly test this, add a console.log("test") before let results = []; and you will notice that nothing appears in your Node.js console, it will appear in your browser console instead.
Therefore, the browser variable is visible within the GetUrls() function but NOT visible within the page.evaluate() method.
Here is the corrected code sample:
const puppeteer = require('puppeteer');
async function GetUrls() {
const browser = await puppeteer.launch({
headless: false,
executablePath: 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe'
})
const page = await browser.newPage();
await page.goto("https://some page");
await page.waitForSelector('a.review.exclick');
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('a.review.exclick');
items.forEach((item) => {
results.push({
url: item.getAttribute('href'),
});
});
return results;
});
await browser.close();
return urls;
}
(async () => {
let URLS = await GetUrls();
console.log(URLS);
process.exit(1);
})();

Scraping carousell with puppeteer

I am currently doing a project that needs to scrape a data from the search result in carousell.ph
I basically made a sample HTML and replicate the output HTML of carousell, so far the javascript work except when I tried to migrate it using puppeteer it always gives me an error.
The task is basically get all the product list from the search url "https://www.carousell.ph/search/iphone"
Here's the code I made.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
let url = 'https://www.carousell.ph/search/iphone';
await page.goto(url, {waitUntil: 'load', timeout: 10000});
await page.setViewport({ width: 2195, height: 1093 });
await page.screenshot({ fullPage: true, path: 'carousell.png' });
document.querySelectorAll('main').forEach(main => {
main.querySelectorAll('a').forEach(product => {
const product_details = product.querySelectorAll('p');
const productName = product.textContent;
const productHref = product.getAttribute('href');
console.log(product_details[0].textContent + " - "+ product_details[1].textContent);
});
});
await browser.close()
})()
As #hardkoded stated, document is not something that is out of the box in puppeteer, it's dogma in the browser, but not in Node.js. You also do not need to for each in Node.js. The Map Technique outlined in this video is very helpful and quick. I'd make sure also to keep await on your loop or map technique, because the function is asynchronous so you want to make sure the promise comes back resolved.
Map technique
An extremely fast way to get many elements into an array from a page is to use a function like below. So instead of getting an array of the elements and then looping them for their properties. You can create a function like this below using $$eval and map. The result is a formatted JSON array that takes all the looping out of the equation.
const links = await first_state_list.$$eval("li.stateList__item", links =>
links.map(ele2 => ({
State_nme: ele2.querySelector("a").innerText.trim(), //GET INNER TEXT
State_url: ele2.querySelector("a").getAttribute("href") //get the HREF
}))
);
Already made it work.
const puppeteer = require('puppeteer');
async function timeout(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage();
let searchItem = 'k20&20pro';
let carousellURL = 'https://www.carousell.ph/search/' + searchItem;
await page.goto(carousellURL, {waitUntil: 'load', timeout: 100000});
//await page.goto(carousellURL, {waitUntil: 'networkidle0'});
await page.setViewport({
width: 2195,
height: 1093
});
await page.evaluate(() => {
window.scrollBy(0, window.innerHeight);
})
await timeout(15000);
await page.screenshot({
fullPage: true,
path: 'carousell.png'
});
var data = await page.evaluate(() =>
Array.from(
document.querySelectorAll('main div a:nth-child(2)')).map(products => products.href
)
)
var i;
for (i = 0; i < data.length; i++) {
console.log(data[i]);
// comment this section but this will open the page to get product details.
//await page.goto(data[1], {"waitUntil" : "networkidle0"});
// inner product page details
// this will get the title
// document.querySelectorAll('h1')[0].innerText;
// this will get the amount
// document.querySelectorAll('h2')[0].innerText;
// this will get the description
// document.querySelectorAll('section div div:nth-child(4) p')[0].innerText;
// this will get sellers name
// document.querySelectorAll('div div:nth-child(2) a p')[0].innerText;
let ss_filename = 'carousellph_'+searchItem+'_'+i+'.png';
console.log(ss_filename);
console.log("\r\n");
//await page.screenshot({ fullPage: false, path: ss_filename });
}
await browser.close()
})()

Want to measure the response time of web page using puppeteer script

I have a webpage and automated through puppeteer script as follow, I want to measure the response time and other metrics like what is the time taken to load the dom content, time for the first byte, etc.
const puppeteer = require('puppeteer');
const myURL = "http://localhost:8080/#/login";
const Username = "jmallick";
const BlockName = "TEST99";
const Password = "Test1234";
async function loginHandler (){
return new Promise (async (resolve, reject) =>{
let browserLaunch = await puppeteer.launch({headless : false,
args: ['--window-size=1920,1040']});
try{
const pageRender = await browserLaunch.newPage();
await pageRender.goto(myURL, {waitUntil: 'networkidle0'});
pageRender.setViewport() --> didn't worked
<want to measure the details here>
await pageRender.type('#txtUserName',Username);
await pageRender.type('#txtPassword',Password);
await pageRender.type('#txtBlockName',FirmName);
await Promise.all([
pageRender.click('#fw-login-btn'),
pageRender.waitForNavigation({ waitUntil: 'networkidle0' }),
pageRender.setViewport() --> didn't worked
<want to measure the details here>
]);
const homePage = await pageRender.$('[href="#/home"]');
await homePage.click({waitUntil: 'networkidle0'});
<want to measure the details here>
}
catch (e){
return reject(e);
}
finally{
browserLaunch.close();
}
});
}
loginHandler().then(console.log).catch(console.error);
Tried with pageRender.setViewport() but didn't work. Checked other pages but couldn't find the same match.

How to use Proxy with puppeteer

I'm new to puppeteer and node, trying to use a proxy with puppeteer in order to collect requests & responses, hopefully also websocket communication, but so far couldn't get anything to work..
I'm trying the following code:
const puppeteer = require('puppeteer');
const httpProxy = require('http-proxy');
const url = require('url');
let runProxy = async ()=> {
// raise a proxy and start collecting req.url/response.statusCode
};
let run = async () => {
await runProxy();
const browser = await puppeteer.launch({
headless: false,
args: ['--start-fullscreen',
'--proxy-server=localhost:8096']
});
page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto('http://www.google.com',
{waitUntil: 'networkidle2', timeout: 120000});
};
run();
I've tried some variation from https://github.com/nodejitsu/node-http-proxy but nothing seems to work for me, some guidance is at need, thanks
try this, use https-proxy-agent or http-proxy-agent to proxy request for per page:
import {Job, Launcher, OnStart, PuppeteerUtil, PuppeteerWorkerFactory} from "../..";
import {Page} from "puppeteer";
class TestTask {
#OnStart({
urls: [
"https://www.google.com",
"https://www.baidu.com",
"https://www.bilibili.com",
],
workerFactory: PuppeteerWorkerFactory
})
async onStart(page: Page, job: Job) {
await PuppeteerUtil.defaultViewPort(page);
await PuppeteerUtil.useProxy(page, "http://127.0.0.1:2007");
await page.goto(job.url);
console.log(await page.evaluate(() => document.title));
}
}
#Launcher({
workplace: __dirname + "/workplace",
tasks: [
TestTask
],
workerFactorys: [
new PuppeteerWorkerFactory({
headless: false,
devtools: true
})
]
})
class App {}