Not able to scrape data - google-apps-script

Not able to scrape data - google-apps-script

I am just starting out in Google Apps Script. Since best coding practices recommend using as few sheet formulas as possible I am trying to do my web scraping with GAS Parser then push the data over to my spreadsheet.
Within my sheet using the below formula returns a table of data which is exactly what I am looking for from GAS.
=IMPORTHTML("https://finance.yahoo.com/quote/BOO.L/history?p=BOO.L", "table", 1)
The two questions here & here are similar but trying those methods also fail. It almost seems like I am not getting the full page content since when I view data in Logger.log() after the code below I am not getting anything that resembles the page I need.
UrlFetchApp.fetch(url).getContentText();
Since running the formula seems to get the data perfectly I can only assume the problems with my own code but can't figure where. Here is the code I have tried thus far;
function scrapeData() {
var url = "https://finance.yahoo.com/quote/BARC.L/history?p=BARC.L";
var fromText = '<td class="Py(10px) Ta(start) Pend(10px)"><span>';
var toText = '</span></td>';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser
.data(content)
.from(fromText)
.to(toText)
.iterate();
Logger.log(scraped)
}
Any guidance much appreciated.

You want to retrieve and put the values from the URL to Spreadsheet using Google Apps Script.
If my understanding is correct, how about this modification? I think that there are several answers for your situation. So please think of this as one of them.
Modification points:
In order to retrieve the table, I used Parser and XmlService.
Retrieve the table as the string value using Parser.
Parse the table using XmlService. I think that XmlService makes us easily parse the table.
XmlService is the strong parsing tool of XML. So when this can be used to HTML, it makes us retrieve the values from HTML more easily. However, recently, the most HTML cannot be directly parsed by XmlService. So I always use this flow.
Modified script:
function scrapeData() {
// Retrieve table as a string using Parser.
var url = "https://finance.yahoo.com/quote/BOO.L/history?p=BOO.L";
// var url = "https://finance.yahoo.com/quote/BARC.L/history?p=BARC.L";
var fromText = '<div class="Pb(10px) Ovx(a) W(100%)" data-reactid="30">';
var toText = '<div class="Mstart(30px) Pt(10px)"';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser.data(content).from(fromText).to(toText).build();
// Parse table using XmlService.
var root = XmlService.parse(scraped).getRootElement();
// Retrieve header
var headerTr = root.getChild("thead").getChildren();
var res = headerTr.map(function(e) {return e.getChildren().map(function(f) {return f.getValue()})});
var len = res[0].length;
// Retrieve values
var valuesTr = root.getChild("tbody").getChildren();
var values = valuesTr.map(function(e) {return e.getChildren().map(function(f) {return f.getValue()})})
.map(function(e) {return e.length == len ? e : e.concat(Array.apply(null, new Array(len - e.length)).map(String.prototype.valueOf,""))});
Array.prototype.push.apply(res, values);
// Put the result to the active spreadsheet.
var ss = SpreadsheetApp.getActiveSheet();
ss.getRange(1, 1, res.length, res[0].length).setValues(res);
}
Note:
Before you run this modified script, please install the GAS library of Parser.
In my environment, I could confirmed that the modified script works for both p=BOO.L and p=BARC.L. I couldn't confirm others. So when you tried others, if an error occurs, please modify the script.
Reference:
Parser
XmlService
If this was not what you want, I'm sorry.

Related

Google App Script Randomly Stopped Working - TypeError: Cannot read properties of undefined (reading 'streams') (line 6)

This Google Apps Script code to scrape press release news from Yahoo Finance randomly stopped working today.
It suddenly gives the following error -
TypeError: Cannot read properties of undefined (reading 'streams') (line 6)
function pressReleases(code) {
var url = 'https://finance.yahoo.com/quote/'+code+'/press-releases'
var html = UrlFetchApp.fetch(url).getContentText().match(/root.App.main = ([\s\S\w]+?);\n/);
if (!html || html.length == 1) return;
var obj = JSON.parse(html[1].trim());
var res = obj.context.dispatcher.stores.StreamStore.streams["YFINANCE:"+code+".mega"].data.stream_items[0].title;
return res || "No value";
}
Code in Cell (with the stock symbol in cell A6)
=pressReleases(A6)
I can still retrieve JSON using python and the format of the data in the JSON is the exact same so I'm guessing it's a problem with Google Apps Script but I'm having no luck in fixing it.
The JSON output is here: https://privatebin.net/?4064ef5520f5b445#FDiJS868e3xSsgzh3y8LsF72LsefyoZ635kqCx62ZtwH
Any help as to why it suddenly stopped working would be appreciated.

Your showing script is from this answer? When I saw the HTML, it seems that in the current stage, the data is converted with the salted base64. In this case, I would like to propose an answer by reflecting on the method of this answer.
Usage:
1. Get crypto-js.
Please access https://cdnjs.cloudflare.com/ajax/libs/crypto-js/4.1.1/crypto-js.min.js. And, copy and paste the script to the script editor of Google Apps Script, and save the script.
2. Modify script.
function pressReleases(code) {
var url = 'https://finance.yahoo.com/quote/' + code + '/press-releases'
var html = UrlFetchApp.fetch(url).getContentText().match(/root.App.main = ([\s\S\w]+?);\n/);
if (!html || html.length == 1) return;
var obj = JSON.parse(html[1].trim());
// --- I modified the below script.
const { _cs, _cr } = obj;
if (!_cs || !_cr) return;
const key = CryptoJS.algo.PBKDF2.create({ keySize: 8 }).compute(_cs, JSON.parse(_cr)).toString();
const obj2 = JSON.parse(CryptoJS.enc.Utf8.stringify(CryptoJS.AES.decrypt(obj.context.dispatcher.stores, key)));
var res = obj2.StreamStore.streams["YFINANCE:" + code + ".mega"].data.stream_items[0].title;
// ---
return res || "No value";
}
When this script is used and code is PGEN, the value of Precigen to Present at the 41st Annual J.P. Morgan Healthcare Conference is obtained.
Note:
If you want to directly load crypto-js, you can also use the following script. But, in this case, the process cost becomes higher than that of the above flow. Please be careful about this.
const cdnjs = "https://cdnjs.cloudflare.com/ajax/libs/crypto-js/4.1.1/crypto-js.min.js";
eval(UrlFetchApp.fetch(cdnjs).getContentText());
I can confirm that this method can be used for the current situation (December, 21, 2022). But, when the specification in the data and HTML is changed in the future update on the server side, this script might not be able to be used. Please be careful about this.
Reference:
crypto-js

I believe match returns an array so perhaps you need
var html = UrlFetchApp.fetch(url).getContentText().match(/root.App.main = ([\s\S\w]+?);\n/)[index];
Oh I missed the fact that your using it in a conditional on the very next line and then in the next line you're using the index of one.
var obj = JSON.parse(html[1].trim());
so the problem must be either in a data variation or this line var res = obj.context.dispatcher.stores.StreamStore.streams["YFINANCE:"+code+".mega"].data.stream_items[0].title;

Why is this Importxml formula not working?

The following formula does work for some, but not for others:
=IFNA(VALUE(IMPORTXML("https://finance.yahoo.com/quote/C2PU.SI", "//*[#class=""D(ib) Mend(20px)""]/span[1]")))
If used without IFNA, it says 'Resource at url not found'.
Here's the value I'm trying to pull in:
I appreciate if you could point me to the right direction.
Thank you!

It does not return any values even for simple importxml.
It seems the site is generated by javascript or protected so it can't be scraped by importxml.

Don't use the "inspect" tool as it will show the DOM as it's being rendered by the web browser including modifications to the source code by client-side JavaScript, instead look at the source code.
Resources
How to know if Google Sheets IMPORTDATA, IMPORTFEED, IMPORTHTML or IMPORTXML functions are able to get data from a resource hosted on a website?

The structure of the DOM is generated by javascript. Nevertheless, all informations you need are contained by a json string called here root.App.main. You can get all the data by these way
function extract(url){
var source = UrlFetchApp.fetch(url).getContentText()
return source.match(/(?<=root.App.main = ).*(?=}}}})/g) + '}}}}'
}
and then retrieve the data by conventionnal json parsing. This will give you the value
[![function marketPrice() {
var code = 'C2PU.SI'
var url='https://finance.yahoo.com/quote/' + code
var source = UrlFetchApp.fetch(url).getContentText()
var jsonString = source.match(/(?<=root.App.main = ).*(?=}}}})/g) + '}}}}'
var data = JSON.parse(jsonString)
var regularMarketPrice = data.context.dispatcher.stores.StreamDataStore.quoteData.item(code).regularMarketPrice.raw
Logger.log(regularMarketPrice)
}
Object.prototype.item=function(i){return this\[i\]};][1]][1]

How to get the currency information from this site

I'm trying to bring to my google sheets the currency information from the site:
https://www.bbva.mx/personas/informacion-financiera-al-dia.html
I'm trying to use IMPORTHTML and IMPORTXML but none of this is working for me
The information I need is this
Any help on this please ???
Maybe using Apps scripts ?
Edit:
this is the code im using
function fetchData() {
var url = 'https://www.bbva.mx/personas/informacion-financiera-al-dia.html';
var dolarTable = UrlFetchApp.fetch(url).getContentText();
Logger.log(dolarTable)
var match = dolarTable.match(/Dólar(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(<\/tr>)/);
var string = match[0].replace(/(\r\n|\n|\r)/gm," ");
string = string.replace(/\s/g, "");
var dollar = string.search("\\$");
var value = string.indexOf("$", dollar + 1);
var substrings = string.substring(value);
var almostThere = substrings.substring(0).indexOf("<");
var number = substrings.substring(0, almostThere);
return SpreadsheetApp.getActiveSpreadsheet().getSheets[0].getRange('A1').setValue(number);
}
getting this error
Regular expression operation exceeded execution time limit (line 5, file "Code")

Okay so the problem you're running into here is that while in Sheets, the IMPORTHTML and IMPORTXML Imports data from a table or list within an HTML page, the webpage you're trying to access is using active server scripts to generate the HTML content.
In Apps Script, there is a built-in UrlFetchApp class which you can use to get HTML data - it has its own limitations, but allows you to get the data from a page into your script for usage.
The page you're trying to get uses a frame that contains an .aspx file, and it's this generated content that has the information you're trying to retrieve. Honestly, this solution is a little ad-hoc as I've used UrlFetchApp.fetch() to get the data, then used regular expressions and built-in JavaScript string functions to get the information out as generically as I can:
function fetchData() {
var dolarTable = UrlFetchApp.fetch('https://bbv.infosel.com/bancomerindicators/indexv8.aspx').getContentText();
var match = dolarTable.match(/Dólar(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(.*)\s+(<\/tr>)/);
var string = match[0].replace(/(\r\n|\n|\r)/gm," ");
string = string.replace(/\s/g, "");
var dollar = string.search("\\$");
var value = string.indexOf("$", dollar + 1);
var substrings = string.substring(value);
var almostThere = substrings.substring(0).indexOf("<");
var number = substrings.substring(0, almostThere);
SpreadsheetApp.getActiveSpreadsheet().getSheets()[0].getRange('A1').setValue(number);
}
This will fetch the HTML data of the page, then reduce what you're looking for by substring filtering. I've kept it generic so as long as the structure of the page doesn't change too much, it should still work even if the value of the amount changes.

Efficient Way of sending Spreadsheet over email using GAS function?

I am creating an addon for Google Sheets that my local High School's volunteer clubs can use to keep track of their member's volunteer hours. Most of the code is done and works very nicely, and I am currently working on a system that will send a member a spreadsheet listing all of the volunteer events that they have logged. I have GAS create a separate spreadsheet, and then send an email with that separate spreadsheet attached in PDF. When the email is received, the PDF is empty except for a singular empty cell at the top left of the page.
I am pretty new to GAS but have been able to grasp the content pretty easily. I have only tried one method of sending the Spreadsheet and that is by using the .getAs(MimeType.PDF). When I changed the "PDF" to "GOOGLE_SHEETS," GAS returned the error: "Blob object must have non-null data for this operation." I am not entirely sure what a Blob object is, and have not found any website or video that has fully explained it, so I am not sure how to go about troubleshooting that error.
I think I'm having a problem grabbing the file because it either sends an empty PDF or it returns an error claiming it needs "non-null data."
function TigerMail()
{
var Drive = DriveApp;
var app = SpreadsheetApp;
var LOOKUP = app.getActiveSpreadsheet().getSheetByName("Student
Lookup");
var Name = LOOKUP.getRange("E1").getValue();
Name = Name + "'s Hours";
//app.openById(Name+"'s Hours");
var HOURS = app.create(Name);
var ESheet = HOURS.getSheets()[0];
var ROW = LOOKUP.getLastRow();
var arr = LOOKUP.getRange("D1:J"+ROW).getValues();
var cell = ESheet.getRange("A1:G"+ROW);
cell.setValues(arr);
////////////////////////////////////////////////////
var LOOKUP = app.getActiveSpreadsheet().getSheetByName("Student
Lookup");
var cell = LOOKUP.getRange("D1");
var Addr = cell.getValue();
var ROW = LOOKUP.getLastRow();
var file = Drive.getFilesByName(Name);
var file = file.next();
var FORMAT = file.getAs(MimeType.GOOGLE_SHEETS);
TigerMail.sendEmail(Addr, "Hours", "Attached is a list of all of the
events you have volunteered at:", {attachments: [FORMAT]} );
}
the final four lines are where the errors are occurring at. I believe I am misunderstanding how the .next() and .getFilesByName() work.
(above the comment line: creating a spreadsheet of hours)
(below the comment line: grabbing the spreadsheet and attaching it to an email)
Here is the link to the Google Sheet:
https://docs.google.com/spreadsheets/d/1qlUfTWaj-VyBD2M45F63BtHaqF0UOVkwi04XwZFJ4vg/edit?usp=sharing

In your script, new Spreadsheet is created and put values.
You want to sent an email by attaching the file which was converted from the created Spreadsheet to PDF format.
If my understanding is correct, how about this modification? Please think of this as just one of several answers.
Modification points:
About Drive.getFilesByName(Name), unfortunately, there is no method of getFilesByName() in Drive.
I think that when you want to use the created Spreadsheet, HOURS of var HOURS = app.create(Name) can be used.
About var FORMAT = file.getAs(MimeType.GOOGLE_SHEETS), in the case of Google Docs, when the blob is retrieved, the blob is automatically converted to PDF format. This can be also used for your situation.
In order to save the values put to the created Spreadsheet, it uses SpreadsheetApp.flush().
When above points are reflected to your script, it becomes as follows.
Modified script:
Please modify as follows.
From:
var file = Drive.getFilesByName(Name);
var file = file.next();
var FORMAT = file.getAs(MimeType.GOOGLE_SHEETS);
To:
SpreadsheetApp.flush();
var FORMAT = HOURS.getBlob();
Note:
In your script, it seems that var ROW = LOOKUP.getLastRow() is not used.
References:
flush()
getBlob()
If I misunderstood your question and this was not the result you want, I apologize.

How to parse an HTML string using CSS selectors? [duplicate]

var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);
The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.
The entity name must immediately follow the '&' in the entity reference.
Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed.
Is there a clean way of parsing html into a DOM tree.

I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();
Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.

In 2021, the best way to parse HTML on the .gs side that I know of is...
Click + next to Library
Enter 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
Click "Look up"
Click Add
Sample usage:
const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText();
const $ = Cheerio.load(contentText);
$('.some-class').first().text();
That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first() is important or else you may extract more content than you expected (think of it as using querySelector() instead of querySelectorAll()).
Credit where credit is due: https://github.com/tani/cheeriogs

As of May 2020, you can now use the Cheerio library for Google Apps Script to do this.
Returns the content of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('#mp-right').text());
Returns the content of the first paragraph <p> of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('p').first().text());
To add to your project:
Select Resources - Libraries... in the Google Apps Script editor. Enter the project key 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0 in the Add a library field, and click "Add". Select the highest version number, and click "Save".

I found that the best way to parse html in google apps is to avoid using XmlService.parse or Xml.parse. XmlService.parse doesn't work well with bad html code from certain websites.
Here a basic example on how you can parse any website easily without using XmlService.parse or Xml.parse. In this example, i am retrieving a list of president from "wikipedia.org/wiki/President_of_the_United_States"
whit a regular javascript document.getElementsByTagName(), and pasting the values into my google spreadsheet.
1- Create a new Google Sheet;
2- Click the menu Tools > Script editor... to open a new tab with the code editor window and copy the following code into your Code.gs:
function onOpen() {
var ui = SpreadsheetApp.getUi();
ui.createMenu("Parse Menu")
.addItem("Parse", "parserMenuItem")
.addToUi();
}
function parserMenuItem() {
var sideBar = HtmlService.createHtmlOutputFromFile("test");
SpreadsheetApp.getUi().showSidebar(sideBar);
}
function getUrlData(url) {
var doc = UrlFetchApp.fetch(url).getContentText()
return doc
}
function writeToSpreadSheet(data) {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheets()[0];
var row=1
for (var i = 0; i < data.length; i++) {
var x = data[i];
var range = sheet.getRange(row, 1)
range.setValue(x);
var row = row+1
}
}
3- Add an HTML file to your Apps Script project. Open the Script Editor and choose File > New > Html File, and name it 'test'.Then copy the following code into your test.html
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input id= "mButon" type="button" value="Click here to get list"
onclick="parse()">
<div hidden id="mOutput"></div>
</body>
<script>
window.onload = onOpen;
function onOpen() {
var url = "https://en.wikipedia.org/wiki/President_of_the_United_States"
google.script.run.withSuccessHandler(writeHtmlOutput).getUrlData(url)
document.getElementById("mButon").style.visibility = "visible";
}
function writeHtmlOutput(x) {
document.getElementById('mOutput').innerHTML = x;
}
function parse() {
var list = document.getElementsByTagName("area");
var data = [];
for (var i = 0; i < list.length; i++) {
var x = list[i];
data.push(x.getAttribute("title"))
}
google.script.run.writeToSpreadSheet(data);
}
</script>
</html>
4- Save your gs and html files and Go back to your spreadsheet. Reload your Spreadsheet. Click on "Parse Menu" - "Parse". Then click on "Click here to get list" in the sidebar.

Xml.parse() has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml service is deprecated however, and the newer XmlService doesn't have this functionality.

For simple tasks such as grabbing one value from a webpage, you could use a regular expression. Regex is notoriously bad for parsing HTML as there's all sorts of weird cases it can get tripped up, but if you're confident about the HTML you're accessing this can sometimes be the simplest way.
Here's an example that fetches the contents of the page's <title> tag:
var page = UrlFetchApp.fetch(contestURL);
var regExp = new RegExp("<title>(.*)</title>", "gi");
var result = regExp.exec(page.getContentText());
// [1] is the match group when using parenthesis in the pattern
var value = result ? result[1] : 'No title found';

I know it is not exactly what OP asked, but I found this question when I was looking for some html parsing options - so it might be useful for others as well.
There is an easy to use the library for TEXT parsing. It's useful if you want to get only one piece of information from the html(xml) code.
EDIT 2021: The script library id is:
1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw
It works like in the picture above
function getData() {
var url = "https://chrome.google.com/webstore/detail/signaturesatori-central-s/fejomcfhljndadjlojamaklegghjnjfn?hl=en";
var fromText = '<span class="e-f-ih" title="';
var toText = '">';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser
.data(content)
.from(fromText)
.to(toText)
.build();
Logger.log(scraped);
return scraped;
}

If you are using
Cheerio library for Google Apps Script
Source code
Library page (⭐ star it!)
Installation by library ID:
1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
A function to get current emojis from unicode.org:
function getEmojis() {
var t = new Date();
var url = 'https://unicode.org/emoji/charts/full-emoji-list.html';
var fetch = UrlFetchApp.fetch(url);
var contentText = fetch.getContentText();
//console.log(new Date() - t);
// Cherio
var $ = Cheerio.load(contentText);
var data = [];
$("table > tbody > tr").each((index, element) => {
var row = [];
$(element).find("td").each((index, child) => {
row.push($(child).text());
});
if (row.length > 0) {
data.push(row);
}
});
//console.log(data);
//console.log(new Date() - t);
// Result
return data;
}
↑ Sample code shows how to parse table and put it into [[array]]
May be used as a custom function:
Bonus
Parsing the site may be a time-consuming operation + you may reach the limit.
Here's a test file with a full version of the script:
https://docs.google.com/spreadsheets/d/1iO7YjYWyfseQu_YCfRbGDPg7NskOgMu_iO1iGjr7KxY/edit#gid=93365395
↑ it uses CasheService to reduce the number of calls.

Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.

There are two options
a) One is to use JavaScript's string functions. First locate your tag using string.indexOf() and then extract the data you want using string.substring().
b) The other option is to make use of the Xml Service.

It's not possible to create an HTML DOM server-side in Apps Script. Using regular expressions is likely your best option, at least for simple parsing.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Not able to scrape data - google-apps-script

Related

Google App Script Randomly Stopped Working - TypeError: Cannot read properties of undefined (reading 'streams') (line 6)

Why is this Importxml formula not working?

How to get the currency information from this site

Efficient Way of sending Spreadsheet over email using GAS function?

How to parse an HTML string using CSS selectors? [duplicate]

Categories

Resources