I am trying to scrape a table of price data from this website using the following code;
function scrapeData() {
// Retrieve table as a string using Parser.
var url = "https://stooq.com/q/d/?s=barc.uk&i=d";
var fromText = '<td align="center" id="t03">';
var toText = '</td>';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser.data(content).from(fromText).to(toText).build();
//Parse table using XmlService.
var root = XmlService.parse(scraped).getRootElement();
}
I have taken this method from an approach I used in a similar question here however its failing on this particular url and giving me the error;
Error on line 1: Content is not allowed in prolog. (line 12, file "Stooq")
In related questions here and here they talk of textual content that is not accepted being submitted to the parser however, I am unable to apply the solutions in these questions to my own problem. Any help would be much appreciated.
How about this modification?
Modification points:
In this case, it is required to modify the retrieved HTML values. For example, when var content = UrlFetchApp.fetch(url).getContentText() is run, each attribute value is not enclosed. These are required to be modified.
There is a merged column in the header.
When above points are reflected to the script, it becomes as follows.
Modified script:
function scrapeData() {
// Retrieve table as a string using Parser.
var url = "https://stooq.com/q/d/?s=barc.uk&i=d";
var fromText = '#d9d9d9}</style>';
var toText = '<table';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser.data(content).from(fromText).to(toText).build();
// Modify values
scraped = scraped.replace(/=([a-zA-Z0-9\%-:]+)/g, "=\"$1\"").replace(/nowrap/g, "");
// Parse table using XmlService.
var root = XmlService.parse(scraped).getRootElement();
// Retrieve header and modify it.
var headerTr = root.getChild("thead").getChildren();
var res = headerTr.map(function(e) {return e.getChildren().map(function(f) {return f.getValue()})});
res[0].splice(7, 0, "Change");
// Retrieve values.
var valuesTr = root.getChild("tbody").getChildren();
var values = valuesTr.map(function(e) {return e.getChildren().map(function(f) {return f.getValue()})});
Array.prototype.push.apply(res, values);
// Put the result to the active spreadsheet.
var ss = SpreadsheetApp.getActiveSheet();
ss.getRange(1, 1, res.length, res[0].length).setValues(res);
}
Note:
Before you run this modified script, please install the GAS library of Parser.
This modified script is not corresponding to various URL. This can be used for the URL in your question. If you want to retrieve values from other URL, please modify the script.
Reference:
Parser
XmlService
If this was not what you want, I'm sorry.
Related
I have built a simple custom function in Apps Script using URLFetchApp to get the follower count for TikTok accounts.
function tiktok_fans() {
var raw_data = new RegExp(/("followerCount":)([0-9]+)/g);
var handle = '#charlidamelio';
var web_content = UrlFetchApp.fetch('https://www.tiktok.com/'+ handle + '?lang=en').getContentText();
var match_text = raw_data.exec(web_content);
var result = (match_text[2]);
Logger.log(result)
return result
}
The Log comes back with the correct number for followers.
However, when I change the code to;
function tiktok_fans(handle) {
var raw_data = new RegExp(/("followerCount":)([0-9]+)/g);
//var handle = '#charlidamelio';
var web_content = UrlFetchApp.fetch('https://www.tiktok.com/'+ handle + '?lang=en').getContentText();
var match_text = raw_data.exec(web_content);
var result = (match_text[2]);
Logger.log(result)
return result
}
and use it in a spreadsheet for example =tiktok_fans(A1), where A1 has #charlidamelio I get an #ERROR response in the cell
TypeError: Cannot read property '2' of null (line 6).
Why does it work in the logs but not in the spreadsheet?
--additional info--
Still getting the same error after testing #Tanaike answer below, "TypeError: Cannot read property '2' of null (line 6)."
Have mapped out manually to see the error, each time the below runs, a different log returns "null". I believe this is to do with the ContentText size/in the cache. I have tried utilising Utilities.sleep() in between functions with no luck, I still get null's.
code
var raw_data = new RegExp(/("followerCount":)([0-9]+)/g);
//tiktok urls
var qld = UrlFetchApp.fetch('https://www.tiktok.com/#thisisqueensland?lang=en').getContentText();
var nsw = UrlFetchApp.fetch('https://www.tiktok.com/#visitnsw?lang=en').getContentText();
var syd = UrlFetchApp.fetch('https://www.tiktok.com/#sydney?lang=en').getContentText();
var tas = UrlFetchApp.fetch('https://www.tiktok.com/#tasmania?lang=en').getContentText();
var nt = UrlFetchApp.fetch('https://www.tiktok.com/#ntaustralia?lang=en').getContentText();
var nz = UrlFetchApp.fetch('https://www.tiktok.com/#purenz?lang=en').getContentText();
var aus = UrlFetchApp.fetch('https://www.tiktok.com/#australia?lang=en').getContentText();
var vic = UrlFetchApp.fetch('https://www.tiktok.com/#visitmelbourne?lang=en').getContentText();
//find folowers with regex
var match_qld = raw_data.exec(qld);
var match_nsw = raw_data.exec(nsw);
var match_syd = raw_data.exec(syd);
var match_tas = raw_data.exec(tas);
var match_nt = raw_data.exec(nt);
var match_nz = raw_data.exec(nz);
var match_aus = raw_data.exec(aus);
var match_vic = raw_data.exec(vic);
Logger.log(match_qld);
Logger.log(match_nsw);
Logger.log(match_syd);
Logger.log(match_tas);
Logger.log(match_nt);
Logger.log(match_nz);
Logger.log(match_aus);
Logger.log(match_vic);
Issue:
From your situation, I remembered that the request of UrlFetchApp with the custom function is different from the request of UrlFetchApp with the script editor. So I thought that the reason for your issue might be related to this thread. https://stackoverflow.com/a/63024816 In your situation, your situation seems to be the opposite of this thread. But, it is considered that this issue is due to the specification of the site.
In order to check this difference, I checked the file size of the retrieved HTML data.
The file size of HTML data retrieved by UrlFetchApp executing with the script editor is 518k bytes.
The file size of HTML data retrieved by UrlFetchApp executing with the custom function is 9k bytes.
It seems that the request of UrlFetchApp executing with the custom function is the same as that of UrlFetchApp executing withWeb Apps. The data of 9k bytes are retrieved by using this.
From the above result, it is found that the retrieved HTML is different between the script editor and the custom function. Namely, the HTML data retrieved by the custom function doesn't include the regex of ("followerCount":)([0-9]+). By this, such an error occurs. I thought that this might be the reason for your issue.
Workaround:
When I tested your situation with Web Apps and triggers, the same issue occurs. By this, in the current stage, I thought that the method for automatically executing the script might not be able to be used. So, as a workaround, how about using a button and the custom menu? When the script is run by the button and the custom menu, the script works. It seems that this method is the same as that of the script editor.
The sample script is as follows.
Sample script:
Before you run the script, please set range. For example, please assign this function to a button on Spreadsheet. When you click the button, the script is run. In this sample, it supposes that the values like #charlidamelio are put to the column "A".
function sample() {
var range = "A2:A10"; // Please set the range of "handle".
var raw_data = new RegExp(/("followerCount":)([0-9]+)/g);
var sheet = SpreadsheetApp.getActiveSheet();
var r = sheet.getRange(range);
var values = r.getValues();
var res = values.map(([handle]) => {
if (handle != "") {
var web_content = UrlFetchApp.fetch('https://www.tiktok.com/'+ handle + '?lang=en').getContentText();
var match_text = raw_data.exec(web_content);
return [match_text[2]];
}
return [""];
});
r.offset(0, 1).setValues(res);
}
When this script is run, the values are retrieved from the URL and put to the column "B".
Note:
This is a simple script. So please modify it for your actual situation.
Reference:
Related thread.
UrlFetchApp request fails in Menu Functions but not in Custom Functions (connecting to external REST API)
Added:
About the following additional question,
whilst this works for 1 TikTok handle, when trying to run a list of multiple it fails each time, with the error TypeError: Cannot read property '2' of null. After doing some investigating and manually mapping out 8 handles, I can see that each time it runs, it returns "null" for one or more of the web_content variables. Is there a way to slow the script down/run each UrlFetchApp one at a time to ensure each returns content?
i've tried this and still getting an error. Have tried up to 10000ms. I've added some more detail to the original question, hope this makes sense as to the error. It is always in a different log that I get nulls, hence why I think it's a timing or cache issue.
In this case, how about the following sample script?
Sample script:
In this sample script, when the value cannot be retrieved from the URL, the value is tried to retrieve again as the retry. This sample script uses the 2 times as the retry. So when the value cannot be retrieved by 2 retries, the empty value is returned.
function sample() {
var range = "A2:A10"; // Please set the range of "handle".
var raw_data = new RegExp(/("followerCount":)([0-9]+)/g);
var sheet = SpreadsheetApp.getActiveSheet();
var r = sheet.getRange(range);
var values = r.getValues();
var res = values.map(([handle]) => {
if (handle != "") {
var web_content = UrlFetchApp.fetch('https://www.tiktok.com/'+ handle + '?lang=en').getContentText();
var match_text = raw_data.exec(web_content);
if (!match_text || match_text.length != 3) {
var retry = 2; // Number of retry.
for (var i = 0; i < retry; i++) {
Utilities.sleep(3000);
web_content = UrlFetchApp.fetch('https://www.tiktok.com/'+ handle + '?lang=en').getContentText();
match_text = raw_data.exec(web_content);
if (match_text || match_text.length == 3) break;
}
}
return [match_text && match_text.length == 3 ? match_text[2] : ""];
}
return [""];
});
r.offset(0, 1).setValues(res);
}
Please adjust the value of retry and Utilities.sleep(3000).
This works for me as a Custom Function:
function MYFUNK(n=2) {
const url = 'my website url'
const re = new RegExp(`<p id="un${n}.*\/p>`,'g')
const r = UrlFetchApp.fetch(url).getContentText();
const v = r.match(re);
Logger.log(v);
return v;
}
I used my own website and I have several paragraphs with ids from un1 to un7 and I'm taking the value of A1 for the only parameter. It returns the correct string each time I change it.
I need to parse this xml by Google Script. jsonformatter.org tells me that the XML is valid
I want to get text of ICO but //var ico = root.getChild('Ares_odpovedi').getChild('Odpoved').getChild('VBAS').getChild('ICO').getText(); is throwing an error
The full code is
function getARES() {
var url = 'https://wwwinfo.mfcr.cz/cgi-bin/ares/darv_bas.cgi?'
+ 'ico=06018025'
+ '&xml=1';
var response = UrlFetchApp.fetch(url);
var responseText = response.getContentText(); //.replace(/D:/g,'');
var document = XmlService.parse(responseText);
var root = document.getRootElement();
var ico_tmp0 = root.getName(); // value is "Ares_odpovedi"
var ico_tmp1 = root.getContentSize(); // value is 3
var ico_tmp2 = root.getChild('Ares_odpovedi'); // value is null
var ico_tmp3 = root.getChild('Odpoved'); // value is null
//var ico = root.getChild('Ares_odpovedi').getChild('Odpoved').getChild('VBAS').getChild('ICO').getText();
//var ico = root.getChild('Odpoved').getChild('VBAS').getChild('ICO').getText();
Logger.log(response);
Logger.log(" ");
Logger.log(responseText);
}
I believe your goal as follows.
You want to retrieve the text of ICO using Google Apps Script.
In this case, it is required to use the name space when getChild is used. When this is reflected to your script, it becomes as follows.
Modified script:
function getARES() {
var url = 'https://wwwinfo.mfcr.cz/cgi-bin/ares/darv_bas.cgi?'
+ 'ico=06018025'
+ '&xml=1';
var response = UrlFetchApp.fetch(url);
var responseText = response.getContentText(); //.replace(/D:/g,'');
var document = XmlService.parse(responseText);
var root = document.getRootElement();
// I modified below script.
var ns1 = XmlService.getNamespace("/ares/xml_doc/schemas/ares/ares_answer_basic/v_1.0.3");
var ns2 = XmlService.getNamespace("/ares/xml_doc/schemas/ares/ares_datatypes/v_1.0.3");
var res = root.getChild("Odpoved", ns1).getChild("VBAS", ns2).getChild("ICO", ns2).getText();
Logger.log(res)
}
When above script is run, 06018025 is retrieved.
When http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_bas.cgi?ico=27074358&xml=1 is used as the URL of UrlFetchApp.fetch, 27074358 is obtained.
References:
XML Service
Added:
From your replying of Any idea why var res2 = root.getChild("Odpoved", ns1).getChild("VBAS", ns2).getChild("DIC", ns2).getText(); does not work?, now I noticed that your question had been changed.
In your question, you wanted to retrieve the value of ICO. But in the case for retrieving the value of DIC, it is required to check the structure of XML. Because in your script in your question, the XML from var url = 'https://wwwinfo.mfcr.cz/cgi-bin/ares/darv_bas.cgi?' + 'ico=06018025' + '&xml=1'; doesn't include the value of DIC. I think that this is the reason of your issue.
When you want to retrieve the value of DIC from http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_bas.cgi?ico=27074358&xml=1, please use the following script.
Modified script:
function getARES() {
var url = 'http://wwwinfo.mfcr.cz/cgi-bin/ares/darv_bas.cgi?ico=27074358&xml=1'; // <--- Modified
var response = UrlFetchApp.fetch(url);
var responseText = response.getContentText(); //.replace(/D:/g,'');
var document = XmlService.parse(responseText);
var root = document.getRootElement();
var ns1 = XmlService.getNamespace("/ares/xml_doc/schemas/ares/ares_answer_basic/v_1.0.3");
var ns2 = XmlService.getNamespace("/ares/xml_doc/schemas/ares/ares_datatypes/v_1.0.3");
var res = root.getChild("Odpoved", ns1).getChild("VBAS", ns2).getChild("DIC", ns2).getText(); // <--- Modified
Logger.log(res) // In this case, CZ27074358 is retrieved.
}
Note:
About the name space, these threads might be useful.
What are XML namespaces for?
How does XPath deal with XML namespaces?
This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I'm attempting to scrape options pricing data from Yahoo Finance in Google Sheets. Although I'm able to pull the options chain just fine, i.e.
=IMPORTHTML("https://finance.yahoo.com/quote/TCOM/options?date=1610668800","table",2)
I find that it's returning results that don't completely match what's actually shown on Yahoo Finance. Specifically, the scraped results are incomplete - they're missing some strikes. i.e. the first 5 rows of the chart may match, but then it will start returning only every other strike (aka skipping every other strike).
Why would IMPORTHTML be returning "abbreviated" results, which don't match what's actually shown on the page? And more importantly, is there some way to scrape complete data (i.e. that doesn't skip some portion of the available strikes)?
In Yahoo finance, all data are available in a big json called root.App.main. So to get the complete set of data, proceed as following
var source = UrlFetchApp.fetch(url).getContentText()
var jsonString = source.match(/(?<=root.App.main = ).*(?=}}}})/g) + '}}}}'
var data = JSON.parse(jsonString)
You can then choose to fetch the informations you need. Take a copy of this example https://docs.google.com/spreadsheets/d/1sTA71PhpxI_QdGKXVAtb0Rc3cmvPLgzvXKXXTmiec7k/copy
edit
if you want to get a full list of available data, you can retrieve it by this simple script
// mike.steelson
let result = [];
function getAllDataJSON(url = 'https://finance.yahoo.com/quote/TCOM/options?date=1610668800') {
var source = UrlFetchApp.fetch(url).getContentText()
var jsonString = source.match(/(?<=root.App.main = ).*(?=}}}})/g) + '}}}}'
var data = JSON.parse(jsonString)
getAllData(eval(data),'data')
var sh = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet()
sh.getRange(1, 1, result.length, result[0].length).setValues(result);
}
function getAllData(obj,id) {
const regex = new RegExp('[^0-9]+');
for (let p in obj) {
var newid = (regex.test(p)) ? id + '["' + p + '"]' : id + '[' + p + ']';
if (obj[p]!=null){
if (typeof obj[p] != 'object' && typeof obj[p] != 'function'){
result.push([newid, obj[p]]);
}
if (typeof obj[p] == 'object') {
getAllData(obj[p], newid );
}
}
}
}
Here's a simpler way to get the last market price of a given option. Add this function to you Google Sheets Script Editor.
function OPTION(ticker) {
var ticker = ticker+"";
var URL = "finance.yahoo.com/quote/"+ticker;
var html = UrlFetchApp.fetch(URL).getContentText();
var count = (html.match(/regularMarketPrice/g) || []).length;
var query = "regularMarketPrice";
var loc = 0;
var n = parseInt(count)-2;
for(i = 0; i<n; i++) {
loc = html.indexOf(query,loc+1);
}
var value = html.substring(loc+query.length+9, html.indexOf(",", loc+query.length+9));
return value*100;
}
In your google sheets input the Yahoo Finance option ticker like below
=OPTION("AAPL210430C00060000")
I believe your goal as follows.
You want to retrieve the complete table from the URL of https://finance.yahoo.com/quote/TCOM/options?date=1610668800, and want to put it to the Spreadsheet.
Issue and workaround:
I could replicate your issue. When I saw the HTML data, unfortunately, I couldn't find the difference of HTML between the showing rows and the not showing rows. And also, I could confirm that the complete table is included in the HTML data. By the way, when I tested it using =IMPORTXML(A1,"//section[2]//tr"), the same result of IMPORTHTML occurs. So I thought that in this case, IMPORTHTML and IMPORTXML might not be able to retrieve the complete table.
So, in this answer, as a workaround, I would like to propose to put the complete table parsed using Sheets API. In this case, Google Apps Script is used. By this, I could confirm that the complete table can be retrieved by parsing the HTML table with Sheet API.
Sample script:
Please copy and paste the following script to the script editor of Spreadsheet, and please enable Sheets API at Advanced Google services. And, please run the function of myFunction at the script editor. By this, the retrieved table is put to the sheet of sheetName.
function myFunction() {
// Please set the following variables.
const url ="https://finance.yahoo.com/quote/TCOM/options?date=1610668800";
const sheetName = "Sheet1"; // Please set the destination sheet name.
const sessionNumber = 2; // Please set the number of session. In this case, the table of 2nd session is retrieved.
const html = UrlFetchApp.fetch(url).getContentText();
const section = [...html.matchAll(/<section[\s\S\w]+?<\/section>/g)];
if (section.length >= sessionNumber) {
if (section[sessionNumber].length == 1) {
const table = section[sessionNumber][0].match(/<table[\s\S\w]+?<\/table>/);
if (table) {
const ss = SpreadsheetApp.getActiveSpreadsheet();
const body = {requests: [{pasteData: {html: true, data: table[0], coordinate: {sheetId: ss.getSheetByName(sheetName).getSheetId()}}}]};
Sheets.Spreadsheets.batchUpdate(body, ss.getId());
}
} else {
throw new Error("No table.");
}
} else {
throw new Error("No table.");
}
}
const sessionNumber = 2; means that 2 of =IMPORTHTML("https://finance.yahoo.com/quote/TCOM/options?date=1610668800","table",2).
References:
Method: spreadsheets.batchUpdate
PasteDataRequest
I am looking for help from this community regarding the below issue.
// I am searching my Gmail inbox for a specific email
function getWeeklyEmail() {
var emailFilter = 'newer_than:7d AND label:inbox AND "Report: Launchpad filter"';
var threads = GmailApp.search(emailFilter, 0, 5);
var messages=[];
threads.forEach(function(threads)
{
messages.push(threads.getMessages()[0]);
});
return messages;
}
// Trying to parse the HTML table contained within the email
function getParsedMsg() {
var messages = getWeeklyEmail();
var msgbody = messages[0].getBody();
var doc = XmlService.parse(msgbody);
var html = doc.getRootElement();
var tables = doc.getDescendants();
var templ = HtmlService.createTemplateFromFile('Messages1');
templ.tables = [];
return templ.evaluate();
}
The debugger crashes when I try to step over the XmlService.parse function. The msgbody of the email contains both text and HTML formatted table. I am getting the following error: TypeError: Cannot read property 'getBody' of undefined (line 19, file "Code")
If I remove the getParsedMsg function and instead just display the content of the email, I get the email body along with the element tags etc in html format.
Workaround
Hi ! The issue you are experiencing is due to (as you previously mentioned) XmlService only recognising canonical XML rather than HTML. One possible workaround to solve this issue is to search in the string you are obtaining with getBody() for your desired tags.
In your case your main issue is var doc = XmlService.parse(msgbody);. To solve it you could iterate through the whole string looking for the table tags you need using Javascript search method. Here is an example piece of code retrieving an email with a single table:
function getWeeklyEmail() {
var emailFilter = 'newer_than:7d AND label:inbox AND "Report: Launchpad filter"';
var threads = GmailApp.search(emailFilter, 0, 5);
var messages=[];
threads.forEach(function(threads)
{
messages.push(threads.getMessages()[0]);
});
return messages;
}
// Trying to parse the HTML table contained within the email
function getParsedMsg() {
var messages = getWeeklyEmail();
var msgbody = messages[0].getBody();
var indexOrigin = msgbody.search('<table');
var indexEnd = msgbody.search('</table');
// Get what is in between those indexes of the string.
// I am adding 8 as it indexEnd only gets the first index of </table
// i.e the one before <
var Table = msgbody.substring(indexOrigin,indexEnd+8);
Logger.log(Table);
}
If you are looking for more than one table in your message, you can change getParsedMsg to the following:
function getParsedMsg() {
// If you are not sure about how many you would be expecting, use an approximate number
var totalTables = 2;
var messages = getWeeklyEmail();
var msgbody = messages[0].getBody();
var indexOrigin = msgbody.indexOf('<table');
var indexEnd = msgbody.indexOf('</table');
var Table = []
for(i=0;i<totalTables;i++){
// go over each stable and store their strings in elements of an array
var start = msgbody.indexOf('<table', (indexOrigin + i))
var end = msgbody.indexOf('</table', (indexEnd + i))
Table.push(msgbody.substring(start,end+8));
}
Logger.log(Table);
}
This will let you store each table in an element of an array. If you want to use these you would just need to retrieve the elements of this array and use them accordingly (for exaple to use them as HTML tables.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)
I am looking to bring in the "bid" values from each "ticker" from this API call https://api.etherdelta.com/returnTicker into Google Sheet cells.
An example cell value will have something like: =crypt("PPT).
Here is the code I have so far, but I am having a hard time figuring out how I can get the data for each ticker (I know I haven't declared "ticker" anywhere in the code).
function crypt(ticker) {
var url = "https://api.etherdelta.com/returnTicker";
var response = UrlFetchApp.fetch(url);
var text = response.getContentText();
var json = JSON.parse(text);
var price = json[bid];
return parseFloat(bid);
}
How about the following modifications?
Modification points :
Each ticker name has a header of ETH_.
ETH_ + ticker is a key of the object.
When =crypt("PPT") is used, the key is ETH_PPT and "bid" you want is in the value of ETH_PPT.
The modified script which was reflected above is as follows.
Modified script :
function crypt(ticker) {
var url = "https://api.etherdelta.com/returnTicker";
var response = UrlFetchApp.fetch(url);
var text = response.getContentText();
var json = JSON.parse(text);
var price = json["ETH_" + ticker].bid; // Modified
return parseFloat(price); // Modified
}
This modified script retrieves the value of bid for each ticker by putting =crypt("PPT") to a cell in the spreadsheet.
Note :
It seems that an error response is sometimes returned from the URL.
If I misunderstand your question, I'm sorry.