Cheerio scraper won't find any links in a sitemap - cheerio

I'm trying to fetch URLs from a sitemap (XML) that I want to scrape.
I tried using the standard Cheerio template for this but it keeps returning that no URLs are found.
Any idea why this happens?
const Apify = require("apify");
const cheerio = require("cheerio");
Apify.main(async () => {
const input = await Apify.getInput();
// Download sitemap
const xml = await Apify.utils.requestAsBrowser({
url: input?.url || "https://www.example.com/product-sitemap2.xml",
headers: {
"User-Agent": "curl/7.54.0",
},
});
// Parse sitemap and create RequestList from it
// const $ = cheerio.load(xml.toString());
const $ = cheerio.load(xml);
const sources = [];
$("loc").each(function (val) {
const url = $(this).text().trim();
sources.push({
url,
headers: {
// NOTE: Otherwise the target doesn't allow to download the page!
"User-Agent": "curl/7.54.0",
},
});
});
});
.

It seems you need to use xml.body instead of xml.
Docs for Apify.utils.requestAsBrowser function.
const $ = cheerio.load(xml.body);

You trying outdated version, the latest is https://crawlee.dev/docs/examples/crawl-sitemap

Related

How to do web scraping into a web that has the app-root element? [duplicate]

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.
I use the cheerio in node.js and My code is below.
var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
request(url, function (err, res, html) {
var $ = cheerio.load(html);
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
});
This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.
The content has not been appended yet.
How can I get these elements using node.js? How can I scrape pages with dynamic content?
Here you go;
var phantom = require('phantom');
phantom.create(function (ph) {
ph.createPage(function (page) {
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
page.open(url, function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
page.evaluate(function() {
$('.listMain > li').each(function () {
console.log($(this).find('a').attr('href'));
});
}, function(){
ph.exit()
});
});
});
});
});
Check out GoogleChrome/puppeteer
Headless Chrome Node API
It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com (assuming .npm-expansions remains)
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.npmjs.com/');
const textContent = await page.evaluate(() => {
return document.querySelector('.npm-expansions').textContent
});
console.log(textContent); /* No Problem Mate */
browser.close();
})();
evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.
Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.
Examples in the pages above, but here's how to do dynamic scraping:
var phantom = require('x-ray-phantom');
var Xray = require('x-ray');
var x = Xray()
.driver(phantom());
x('http://google.com', 'title')(function(err, str) {
if (err) return done(err);
assert.equal('Google', str);
done();
})
Answering this as a canonical, an alternative to Puppeteer for scraping dynamic sites which is also well-supported as of 2023 is Playwright. Here's a simple example:
const playwright = require("playwright"); // ^1.28.1
let browser;
(async () => {
browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
const text = await page.locator('h1:text("Example")').textContent();
console.log(text); // => Example Domain
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.
Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

How do I recurrsively call multiple URLs using Puppeteer and Headless Chrome?

I am trying to write a program to scan multiple URLs at the same time (parallelizaiton) and I have extracted sitemap and stored it as an array in a Variable as shown below. But i am unable to open using Puppeteer. I am getting the below error:
originalMessage: 'Cannot navigate to invalid URL'
My code below. Can someone please help me out .
const sitemapper = require('#mastixmc/sitemapper');
const SitemapXMLParser = require('sitemap-xml-parser');
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml';
/*If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*/
const options = {
delay: 3000,
limit: 50000
};
const sitemapXMLParser = new SitemapXMLParser(url, options);
sitemapXMLParser.fetch().then(result => {
var locs = result.map(value => value.loc)
var locsFiltered = locs.toString().replace("[",'<br>');
const urls = locsFiltered
console.log(locsFiltered)
const puppeteer = require("puppeteer");
async function scrapeProduct(url) {
const urls = locsFiltered
const browser = await puppeteer.launch({
headless: false
});
for (i = 0; i < urls.length; i++) {
const page = await browser.newPage();
const url = urls[i];
const promise = page.waitForNavigation({
waitUntil: "networkidle2"
});
await page.goto(`${url}`);
}};
scrapeProduct();
});
You see invalid URL because you've convert an array into URL string by wrong method.
These line is a better one:
// var locsFiltered = locs.toString().replace("[",'<br>') // This is wrong
// const urls = locsFiltered // So value is invalid
// console.log(locsFiltered)
const urls = locs.map(value => value[0]) // This is better
So to scrape CNN sites, i've added puppeteer-cluster for speed:
const { Cluster } = require('puppeteer-cluster')
const sitemapper = require('#mastixmc/sitemapper')
const SitemapXMLParser = require('sitemap-xml-parser')
const url = 'https://edition.cnn.com/sitemaps/sitemap-section.xml'
async function scrapeProduct(locs) {
const urls = locs.map(value => value[0])
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2, // You can set this to any number you like
puppeteerOptions: {
headless: false,
devtools: false,
args: [],
}
})
await cluster.task(async ({ page, data: url }) => {
await page.goto(url, {timeout: 0, waitUntil: 'networkidle2'})
const screen = await page.screenshot()
// Store screenshot, do something else
})
for (i = 0; i < urls.length; i++) {
console.log(urls[i])
await cluster.queue(urls[i])
}
await cluster.idle()
await cluster.close()
}
/******
If sitemapindex (link of xml or gz file) is written in sitemap, the URL will be accessed.
You can optionally specify the number of concurrent accesses and the number of milliseconds after processing and access to resume processing after a delay.
*******/
const options = {
delay: 3000,
limit: 50000
}
const sitemapXMLParser = new SitemapXMLParser(url, options)
sitemapXMLParser.fetch().then(async result => {
var locs = result.map(value => value.loc)
await scrapeProduct(locs)
})

CheerioJS to parse data on script tag

I've been trying to parse the data that is in the script tag using cheerio however It's been difficult for the following reasons.
Can't parse string that is generated into JSON because of html-entities
More Info:
Also what is strange to me is that you have to re-load the content into cheerio a second time to get the text.
Your welcome to fork this replit or copy and paste the code to try it yourself
https://replit.com/#Graciasc/Cheerio-Script-Parse
const cheerio = require('cheerio')
const {decode} = require('html-entities')
const html = `
<body>
<script type="text/javascript"src="/data/common.0e95a19724a68c79df7b.js"></script>
<script>require("dynamic-module-registry").set("from-server-context", JSON.parse("\x7B\x22data\x22\x3A\x7B\x22available\x22\x3Atrue,\x22name\x22\x3A"Gracias"\x7D\x7D"));</script>
</body>
`;
const $ = cheerio.load(html, {
decodeEntities: false,
});
const text = $('body').find('script:not([type="text/javascript"])');
const cheerioText = text.eq(0).html();
//implement a better way to grab the string
const scriptInfo = cheerio.load(text.eq(0).html()).text();
const regex = new RegExp(/^.*?JSON.parse\(((?:(?!\)\);).)*)/);
const testing = regex.exec(scriptInfo)[1];
// real output:
//\x7B\x22data\x22\x3A\x7B\x22available\x22\x3Atrue,\x22name\x22\x3A"Gracias"\x7D\x7D when logged
console.log(testing)
// Not Working
const json = JSON.parse(testing)
const decoding = decode(testing)
// same output as testing
console.log(decoding)
// Not working
console.log('decode', JSON.parse(decoding))
//JSON
{ Data: { available: true, name: 'Gracias' } }
A clean solution is to use JSDOM
repl.it link( https://replit.com/#Graciasc/Cheerio-Script-Parse#index.js)
const { JSDOM } = require('jsdom')
const dom = new JSDOM(`<body>
<script type="text/javascript"src="/data/common.0e95a19724a68c79df7b.js"></script>
<script>require("dynamic-module-registry").set("from-server-context", JSON.parse("\x7B\x22data\x22\x3A\x7B\x22available\x22\x3Atrue,\x22name\x22\x3A"Gracias"\x7D\x7D"));</script>
</body>`)
const serializedDom = dom.serialize()
const regex = new RegExp(/^.*?JSON.parse\("((?:(?!"\)\);).)*)/gm);
const jsonString = regex.exec(serializedDom)[1];
console.log(JSON.parse(jsonString))
// output: { data: { available: true, name: 'Gracias' } }

how to render Rest api JSON data to the HTML page using Node.js without using jquery

I am trying to consume a RESTful API for JSON data and trying to display it on the HTML page.
Here is the code for parsing API into JSON data.
var https = require('https');
var schema;
var optionsget = {
host : 'host name', // here only the domain name
port : 443,
path : 'your url', // the rest of the url with parameters if needed
method : 'GET' // do GET
};
var reqGet = https.request(optionsget, function(res) {
console.log("statusCode: ", res.statusCode);
var chunks = [];
res.on('data', function(data) {
chunks.push(data);
}).on('end', function() {
var data = Buffer.concat(chunks);
schema = JSON.parse(data);
console.log(schema);
});
});
reqGet.end();
reqGet.on('error', function(e) {
console.error(e);
});
var express = require('express');
var app = express();
app.get('/getData', function (request, response) {
//console.log( data );
response.end(schema);
})
var server = app.listen(8081, function () {
var host = server.address().address
var port = server.address().port
console.log("Example app listening at http://%s:%s", host, port)
})
I am able to get the data in the JSON format but how do I display it in the HTML page? I am trying to do it using node js as I don't want to use jquery for that. Any help would be greatly appreciated. Thanks
your json can render through ejs
npm i ejs
var app = express();
app.set('view engine', 'ejs');
app.use(express.static(__dirname+'/public'));
app.get('/view/getData', function (request, response) {
//here view/getData is getData.ejs file in the view folder
response.render(__dirname+'/public/view/getData'',{schema: schema});
//schema is the object wich reference through the view in ejs template
});
your view file here getData.ejs
//store your getData in view/getData.ejs
<h> <%= schema[0]%> </h>//your json data here
basic ejs reference here
res.render brief explanation here

export json object from .json file to vue through express and assign it to the variable

I would like to display on my page some data which I have in dsa.json file. I am using express with vue.
Here's my code from the server.js:
var data;
fs.readFile('./dsa.json', 'utf8', (err, data) => {
if (err) throw err;
exports.data = data;
});
Here's code from between <script> tags in index.html
var server = require(['../server']);
var data = server.data;
var scoreboards = new Vue({
el: '#scoreboard',
data: {
students: data
}
});
I am using requirejs (CDN) to require server between <script> tags in index.html.
index.html is in public directory whereas dsa.json and server.js are in the main catalogue.
Here are the errors I get in the client:
require.min.js:1 GET http://localhost:3000/server.js
require.min.js:1 Uncaught Error: Script error for "../server"
I think it has something to do with context and scope but I don't know what exactly.
I am using Chrome.
Your approach is completely wrong. You can't include the server script on your page. Also, I'm not a NodeJS ninja, yet I don't think that exporting the data inside the function will work -> exports.data = data.
The workaround:
Server side:
const fs = require('fs');
const express = require('express');
const app = express();
const data = fs.readFileSync('./dsa.json', 'utf8'); // sync is ok in this case, because it runs once when the server starts, however you should try to use async version in other cases when possible
app.get('/json', function(req, res){
res.send(data);
});
Client side:
var xhr = new XMLHttpRequest();
xhr.open('GET', '/json', true);
xhr.addEventListener('load', function() {
var scoreboards = new Vue({
el: '#scoreboard',
data: {
students: JSON.parse(xhr.response)
}
});
});
xhr.addEventListener('error', function() {
// handle error
});
xhr.send();