How to get all pages of a Google site with many pages

How to get all pages of a Google site with many pages - google-apps-script

I have a Google site and I use Google Apps Script to get all the pages of the site and export their data to JSON format.
I use the getAllDescendants function with a code similar to this:
function getAllSitePages(site) {
var result = [], i = 0;
while(true) {
var pages = site.getAllDescendants({start: i});
if(!pages || pages.length == 0) break;
result = result.concat(pages);
i += pages.length;
};
return result;
}
But this only gets me the first 891 (?!) pages. If my sites has around 1000 pages, is there a way to get all of them with the Sites Service?

For now, I was able to bypass the problem by using the getChildren function instead (as I currently don't have any page (including root), that has more than 800 direct children):
function getAllSitePages(root, result) {
result = result || []
var start = 0;
while (true) {
var pages = root.getChildren({ start });
if (!pages || pages.length == 0) break;
result.push(...pages);
pages.forEach(page => getAllSitePages(page, result));
start += pages.length;
};
return result;
}

Related

Retrieve all my subscriptions in Google Sheets

I'm using YouTube Data API v3 and Google Apps Script for retrieve all my subscriptions.
The problem I'm facing is that - using the following code, the response brings duplicated channels:
do {
const mySubsResponse = YouTube.Subscriptions.list('snippet', {
mine: true,
//channelId: "<MY_CHANNEL_ID>",
maxResults: 50,
fields: "pageInfo(totalResults),nextPageToken,items(snippet(title,resourceId(channelId)))"
});
if (!mySubsResponse || mySubsResponse == undefined) {
Logger.log('No subscriptions found.');
SpreadsheetApp.getUi().alert("No subscriptions found.");
break;
}
// Loop all my subscriptions found in the response:
for (let j = 0; j < mySubsResponse.items.length; j++) {
const mySubItem = mySubsResponse.items[j];
sheet.getRange("H" + incrSub).setValue(mySubItem.snippet.title);
sheet.getRange("I" + incrSub).setValue(mySubItem.snippet.resourceId.channelId);
incrSub++;
}
nextPageToken = mySubsResponse.nextPageToken;
} while (nextPageToken);
I believe this is due each item in the response is actually the video uploaded by the channel I'm subscribed to - I don't think it's a problem with the page token.
In the code above, I've commented the channelId parameter and I've testted with both: mine:true and channelId:<MY_CHANNEL_ID> and, the totalResults shows me I have 479 subscriptions, but, when I'm looping the results,
For example, I'm subscribed to the channel called "Channel_1"; this
channel had uploaded three videos today. The response of the code
above brings me "Channel_1" three times, when it should be only 1 -
because I'm subscribed to "Channel_1" once.
What I want to get is a list of all channels I'm subscribed to.
I've checked the subscriptions:list documentation, but, it's not clear how I can get my subscriptions only.
If the subscriptions:list endopint is not the correct one for this task, which endpoint enables me to bring the desired results?1
1a list of all channels I'm subscribed to.

After checking more closely (and, I admit, after a little break I have), I finally found the problem and the solution:
The problem is: I wasn't using the nextPageToken in every loop, so, basically, I was requesting the same page without actually making any pagination.
In this section:
const mySubsResponse = YouTube.Subscriptions.list('snippet', {
mine: true,
//channelId: "<MY_CHANNEL_ID>",
maxResults: 50,
fields: "pageInfo(totalResults),nextPageToken,items(snippet(title,resourceId(channelId)))"
});
Can be seen that the pageToken: nextPageToken is not defined.
Then, the solution is:
Modify the code for sending the nextPageToken obtained.
This is the modified code:
// Call my subscriptions:
/** Token pagination. */
var nextPageToken = "";
/** Row position where to start writing the results. */
var incrSub = 6;
/**
* Get all my subscriptions.
*/
do {
const mySubsResponse = YouTube.Subscriptions.list('snippet', {
channelId: "<MY_CHANNEL_ID>", // also works with "mine: true".
maxResults: 50,
// Here, the first time the call is made, the "nextPageToken" value
// is empty. In every iteration (if "nextPageToken" is retrieved),
// the "nextPageToken" is used - in order to get the next page.
pageToken: nextPageToken,
fields: "nextPageToken,items(snippet(title,resourceId(channelId)))"
});
if (!mySubsResponse || mySubsResponse == undefined) {
Logger.log('No subscriptions found.');
SpreadsheetApp.getUi().alert("No subscriptions found.");
break;
}
// Write the subscriptions returned in the response:
for (let j = 0; j < mySubsResponse.items.length; j++) {
const mySubItem = mySubsResponse.items[j];
sheet.getRange("H" + incrSub).setValue(mySubItem.snippet.title);
sheet.getRange("I" + incrSub).setValue(mySubItem.snippet.resourceId.channelId);
incrSub++;
}
// Check the token:
try {
if (mySubsResponse.nextPageToken != null || mySubsResponse.nextPageToken != undefined) {
nextPageToken = mySubsResponse.nextPageToken;
} else {
nextPageToken = undefined;
break;
}
} catch (ex_page) {
// An error occurred. Check closely the code.
}
} while (nextPageToken != undefined);
With this modified code, all of my subscriptions are returned successfully.

xpath in apps script?

I made a formula to extract some Wikipedia data in Google Seets which works fine. Here is the formula:
=regexreplace(join("",flatten(IMPORTXML(D2,".//p[preceding-sibling::h2[1][contains(., 'Geography')]]"))),"\[[^\]]+\]","")&char(10)&char(10)&iferror(regexreplace(join("",flatten(IMPORTXML(D2,".//p[preceding-sibling::h2[1][contains(., 'Education')]]"))),"\[[^\]]+\]",""))
Where D2 is a URL like https://en.wikipedia.org/wiki/Abbeville,_Alabama
This extracts some Geography and Education data from the Wikipedia page. Trouble is that importxml only runs a few times before it dies due to quota.
So I thought maybe better to use Apps Script where there are much higher limits on fetching and parsing. I could not see a good way however of using Xpath in Apps Script. Older posts on the web discuss using a deprecated service called Xml but it seems to no longer work. There is a Service called XmlService which looks like it may do the job but you can't just plug in an Xpath. It looks like a lot of sweating to get to the result. Any solutions out there where you can just plug in Xpath?

Here is an alternative solution I actually do in a case like this.
I have used XmlService but only for parsing the content, not for using Xpath. This makes use of the element tags and so far pretty consistent on my tests. Although, it might need tweaks when certain tags are in the result and you might have to include them into the exclusion condition.
Tested the code below in both links:
https://en.wikipedia.org/wiki/Abbeville,_Alabama#Geography
https://en.wikipedia.org/wiki/Montgomery,_Alabama#Education
My test shows that the formula above used did not return the proper output from the 2nd link while the code does. (Maybe because it was too long)
Code:
function getGeoAndEdu(path) {
var data = UrlFetchApp.fetch(path).getContentText();
// wikipedia is divided into sections, if output is cut, increase the number
var regex = /.{1,100000}/g;
var results = [];
// flag to determine if matches should be added
var foundFlag = false;
do {
m = regex.exec(data);
if (foundFlag) {
// if another header is found during generation of data, stop appending the matches
if (matchTag(m[0], "<h2>"))
foundFlag = false;
// exclude tables, sub-headers and divs containing image description
else if(matchTag(m[0], "<div") || matchTag(m[0], "<h3") ||
matchTag(m[0], "<td") || matchTag(m[0], "<th"))
continue;
else
results.push(m[0]);
}
// start capturing if either IDs are found
if (m != null && (matchTag(m[0], "id=\"Geography\"") ||
matchTag(m[0], "id=\"Education\""))) {
foundFlag = true;
}
} while (m);
var output = results.map(function (str) {
// clean tags for XmlService
str = str.replace(/<[^>]*>/g, '').trim();
decode = XmlService.parse('<d>' + str + '</d>')
// convert html entity codes (e.g.  ) to text
return decode.getRootElement().getText();
// filter blank results due to cleaning and empty sections
// separate data and remove citations before returning output
}).filter(result => result.trim().length > 1).join("\n").replace(/\[\d+\]/g, '');
return output;
}
// check if tag is found in string
function matchTag(string, tag) {
var regex = RegExp(tag);
return string.match(regex) && string.match(regex)[0] == tag;
}
Output:
Difference:
Formula ending output
Script ending output
Education ending in wikipedia
Note:
You still have quota when using UrlFetchApp but should be better than IMPORTXML's limit depending on the type of your account.
Reference:
Apps Script Quotas

Sorry I got very busy this week so I didn't reply. I took a look at your answer which seems to work fine, but it was quite code heavy. I wanted something I would understand so I coded my own solution. not that mine is any simpler. It's just my own code so it's easier for me to follow:
function getTextBetweenTags(html, paramatersInFirstTag, paramatersInLastTag) { //finds text values between 2 tags and removes internal tags to leave plain text.
//eg getTextBetweenTags(html,[['class="mw-headline"'],['id="Geography"']],[['class="wikitable mw-collapsible mw-made-collapsible"']])
// **Note: you may want to replace &#number; with ascII number
var openingTagPos = null;
var closingTagPos = null;
var previousChar = '';
var readingTag = false;
var newTag = '';
var tagEnd = false;
var regexFirstTagParams = [];
var regexLastTagParams = [];
//prepare regexes to test for parameters in opening and closing tags. put regexes in arrays so each condition can be tested separately
for (var i in paramatersInFirstTag) {
regexFirstTagParams.push(new RegExp(escapeRegex(paramatersInFirstTag[i][0])))
}
for (var i in paramatersInLastTag) {
regexLastTagParams.push(new RegExp(escapeRegex(paramatersInLastTag[i][0])))
}
var startTagIndex = null;
var endTagIndex = null;
var matches = 0;
for (var i = 0; i < html.length - 1; i++) {
var nextChar = html.substr(i, 1);
if (nextChar == '<' && previousChar != '\\') {
readingTag = true;
}
if (nextChar == '>' && previousChar != '\\') { //if end of tag found, check tag matches start or end tag
readingTag = false;
newTag += nextChar;
//test for firstTag
if (startTagIndex == null) {
var alltestsPass = true;
for (var j in regexFirstTagParams) {
if (!regexFirstTagParams[j].test(newTag)) alltestsPass = false;
}
if (alltestsPass) {
startTagIndex = i + 1;
//console.log('Start Tag',startTagIndex)
matches++;
}
}
//test for lastTag
else if (startTagIndex != null) {
var alltestsPass = true;
for (var j in regexLastTagParams) {
if (!regexLastTagParams[j].test(newTag)) alltestsPass = false;
}
if (alltestsPass) {
endTagIndex = i + 1;
matches++;
}
}
if(startTagIndex && endTagIndex) break;
newTag = '';
}
if (readingTag) newTag += nextChar;
previousChar = nextChar;
}
if (matches < 2) return 'No matches';
else return html.substring(startTagIndex, endTagIndex).replace(/<[^>]+>/g, '');
}
function escapeRegex(string) {
if (string == null) return string;
return string.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
}
My function requires an array of attributes for the start tag and an array of attributes for the end tag. It gets any text in between and removes any tags found inbetween. One issue I also noticed was there were often special characters (eg  ) so they need to be replaced. I did that outside the scope of the function above.
The function could be easily improved to check the tag type (eg h2), but it wasn't necessary for the wikipedia case.
Here is a function where I called the above function. the html variable is just the result of UrlFetchApp.fetch('some wikipedia city url').getContextText();
function getWikiTexts(html) {
var geography = getTextBetweenTags(html, [['class="mw-headline"'], ['id="Geography']], [['class="mw-headline"']]);
var economy = getTextBetweenTags(html, 'span', [['class="mw-headline"'], ['id="Economy']], 'span', [['class="mw-headline"']])
var education = getTextBetweenTags(html, 'span', [['class="mw-headline"'], ['id="Education']], 'span', [['class="mw-headline"']])
var returnString = '';
if (geography != 'No matches' && !/Wikipedia/.test(geography)) returnString += geography + '\n';
if (economy != 'No matches' && !/Wikipedia/.test(economy)) returnString += economy + '\n';
if (education != 'No matches' && !/Wikipedia/.test(education)) returnString += education + '\n';
return returnString
}
Thanks for posting your answer.

Can Google apps script be used to randomize page order on Google forms?

Update #2: Okay, I'm pretty sure my error in update #1 was because of indexing out of bounds over the array (I'm still not used to JS indexing at 0). But here is the new problem... if I write out the different combinations of the loop manually, setting the page index to 1 in moveItem() like so:
newForm.moveItem(itemsArray[0][0], 1);
newForm.moveItem(itemsArray[0][1], 1);
newForm.moveItem(itemsArray[0][2], 1);
newForm.moveItem(itemsArray[1][0], 1);
newForm.moveItem(itemsArray[1][1], 1);
newForm.moveItem(itemsArray[1][2], 1);
newForm.moveItem(itemsArray[2][0], 1);
...
...I don't get any errors but the items end up on different pages! What is going on?
Update #1:: Using Sandy Good's answer as well as a script I found at this WordPress blog, I have managed to get closer to what I needed. I believe Sandy Good misinterpreted what I wanted to do because I wasn't specific enough in my question.
I would like to:
Get all items from a page (section header, images, question etc)
Put them into an array
Do this for all pages, adding these arrays to an array (i.e: [[all items from page 1][all items from page 2][all items from page 3]...])
Shuffle the elements of this array
Repopulate a new form with each element of this array. In this way, page order will be randomized.
My JavaScript skills are poor (this is the first time I've used it). There is a step that produces null entries and I don't know why... I had to remove them manually. I am not able to complete step 5 as I get the following error:
Cannot convert Item,Item,Item to (class).
"Item,Item,Item" is the array element containing all the items from a particular page. So it seems that I can't add three items to a page at a time? Or is something else going on here?
Here is my code:
function shuffleForms() {
var itemsArray,shuffleQuestionsInNewForm,fncGetQuestionID,
newFormFile,newForm,newID,shuffle, sections;
// Copy template form by ID, set a new name
newFormFile = DriveApp.getFileById('1prfcl-RhaD4gn0b2oP4sbcKaRcZT5XoCAQCbLm1PR7I')
.makeCopy();
newFormFile.setName('AAAAA_Shuffled_Form');
// Get ID of new form and open it
newID = newFormFile.getId();
newForm = FormApp.openById(newID);
// Initialize array to put IDs in
itemsArray = [];
function getPageItems(thisPageNum) {
Logger.log("Getting items for page number: " + thisPageNum );
var thisPageItems = []; // Used for result
var thisPageBreakIndex = getPageItem(thisPageNum).getIndex();
Logger.log( "This is index num : " + thisPageBreakIndex );
// Get all items from page
var allItems = newForm.getItems();
thisPageItems.push(allItems[thisPageBreakIndex]);
Logger.log( "Added pagebreak item: " + allItems[thisPageBreakIndex].getIndex() );
for( var i = thisPageBreakIndex+1; ( i < allItems.length ) && ( allItems[i].getType() != FormApp.ItemType.PAGE_BREAK ); ++i ) {
thisPageItems.push(allItems[i]);
Logger.log( "Added non-pagebreak item: " + allItems[i].getIndex() );
}
return thisPageItems;
}
function shuffle(array) {
var currentIndex = array.length, temporaryValue, randomIndex;
Logger.log('shuffle ran')
// While there remain elements to shuffle...
while (0 !== currentIndex) {
// Pick a remaining element...
randomIndex = Math.floor(Math.random() * currentIndex);
currentIndex -= 1;
// And swap it with the current element.
temporaryValue = array[currentIndex];
array[currentIndex] = array[randomIndex];
array[randomIndex] = temporaryValue;
}
return array;
}
function shuffleAndMove() {
// Get page items for all pages into an array
for(i = 2; i <= 5; i++) {
itemsArray[i] = getPageItems(i);
}
// Removes null values from array
itemsArray = itemsArray.filter(function(x){return x});
// Shuffle page items
itemsArray = shuffle(itemsArray);
// Move page items to the new form
for(i = 2; i <= 5; ++i) {
newForm.moveItem(itemsArray[i], i);
}
}
shuffleAndMove();
}
Original post: I have used Google forms to create a questionnaire. For my purposes, each question needs to be on a separate page but I need the pages to be randomized. A quick Google search shows this feature has not been added yet.
I see that the Form class in the Google apps script has a number of methods that alter/give access to various properties of Google Forms. Since I do not know Javascript and am not too familiar with Google apps/API I would like to know if what I am trying to do is even possible before diving in and figuring it all out.
If it is possible, I would appreciate any insight on what methods would be relevant for this task just to give me some direction to get started.
Based on comments from Sandy Good and two SE questions found here and here, this is the code I have so far:
// Script to shuffle question in a Google Form when the questions are in separate sections
function shuffleFormSections() {
getQuestionID();
createNewShuffledForm();
}
// Get question IDs
function getQuestionID() {
var form = FormApp.getActiveForm();
var items = form.getItems();
arrayID = [];
for (var i in items) {
arrayID[i] = items[i].getId();
}
// Logger.log(arrayID);
return(arrayID);
}
// Shuffle function
function shuffle(a) {
var j, x, i;
for (i = a.length; i; i--) {
j = Math.floor(Math.random() * i);
x = a[i - 1];
a[i - 1] = a[j];
a[j] = x;
}
}
// Shuffle IDs and create new form with new question order
function createNewShuffledForm() {
shuffle(arrayID);
// Logger.log(arrayID);
var newForm = FormApp.create('Shuffled Form');
for (var i in arrayID) {
arrayID[i].getItemsbyId();
}
}

Try this. There's a few "constants" to be set at the top of the function, check the comments. Form file copying and opening borrowed from Sandy Good's answer, thanks!
// This is the function to run, all the others here are helper functions
// You'll need to set your source file id and your destination file name in the
// constants at the top of this function here.
// It appears that the "Title" page does not count as a page, so you don't need
// to include it in the PAGES_AT_BEGINNING_TO_NOT_SHUFFLE count.
function shuffleFormPages() {
// UPDATE THESE CONSTANTS AS NEEDED
var PAGES_AT_BEGINNING_TO_NOT_SHUFFLE = 2; // preserve X intro pages; shuffle everything after page X
var SOURCE_FILE_ID = 'YOUR_SOURCE_FILE_ID_HERE';
var DESTINATION_FILE_NAME = 'YOUR_DESTINATION_FILE_NAME_HERE';
// Copy template form by ID, set a new name
var newFormFile = DriveApp.getFileById(SOURCE_FILE_ID).makeCopy();
newFormFile.setName(DESTINATION_FILE_NAME);
// Open the duplicated form file as a form
var newForm = FormApp.openById(newFormFile.getId());
var pages = extractPages(newForm);
shuffleEndOfPages(pages, PAGES_AT_BEGINNING_TO_NOT_SHUFFLE);
var shuffledFormItems = flatten(pages);
setFormItems(newForm, shuffledFormItems);
}
// Builds an array of "page" arrays. Each page array starts with a page break
// and continues until the next page break.
function extractPages(form) {
var formItems = form.getItems();
var currentPage = [];
var allPages = [];
formItems.forEach(function(item) {
if (item.getType() == FormApp.ItemType.PAGE_BREAK && currentPage.length > 0) {
// found a page break (and it isn't the first one)
allPages.push(currentPage); // push what we've built for this page onto the output array
currentPage = [item]; // reset the current page to just this most recent item
} else {
currentPage.push(item);
}
});
// We've got the last page dangling, so add it
allPages.push(currentPage);
return allPages;
};
// startIndex is the array index to start shuffling from. E.g. to start
// shuffling on page 5, startIndex should be 4. startIndex could also be thought
// of as the number of pages to keep unshuffled.
// This function has no return value, it just mutates pages
function shuffleEndOfPages(pages, startIndex) {
var currentIndex = pages.length;
// While there remain elements to shuffle...
while (currentIndex > startIndex) {
// Pick an element between startIndex and currentIndex (inclusive)
var randomIndex = Math.floor(Math.random() * (currentIndex - startIndex)) + startIndex;
currentIndex -= 1;
// And swap it with the current element.
var temporaryValue = pages[currentIndex];
pages[currentIndex] = pages[randomIndex];
pages[randomIndex] = temporaryValue;
}
};
// Sourced from elsewhere on SO:
// https://stackoverflow.com/a/15030117/4280232
function flatten(array) {
return array.reduce(
function (flattenedArray, toFlatten) {
return flattenedArray.concat(Array.isArray(toFlatten) ? flatten(toFlatten) : toFlatten);
},
[]
);
};
// No safety checks around items being the same as the form length or whatever.
// This mutates form.
function setFormItems(form, items) {
items.forEach(function(item, index) {
form.moveItem(item, index);
});
};

I tested this code. It created a new Form, and then shuffled the questions in the new Form. It excludes page breaks, images and section headers. You need to provide a source file ID for the original template Form. This function has 3 inner sub-functions. The inner functions are at the top, and they are called at the bottom of the outer function. The arrayOfIDs variable does not need to be returned or passed to another function because it is available in the outer scope.
function shuffleFormSections() {
var arrayOfIDs,shuffleQuestionsInNewForm,fncGetQuestionID,
newFormFile,newForm,newID,items,shuffle;
newFormFile = DriveApp.getFileById('Put the source file ID here')
.makeCopy();
newFormFile.setName('AAAAA_Shuffled_Form');
newID = newFormFile.getId();
newForm = FormApp.openById(newID);
arrayOfIDs = [];
fncGetQuestionID = function() {
var i,L,thisID,thisItem,thisType;
items = newForm.getItems();
L = items.length;
for (i=0;i<L;i++) {
thisItem = items[i];
thisType = thisItem.getType();
if (thisType === FormApp.ItemType.PAGE_BREAK ||
thisType === FormApp.ItemType.SECTION_HEADER ||
thisType === FormApp.ItemType.IMAGE) {
continue;
}
thisID = thisItem.getId();
arrayOfIDs.push(thisID);
}
Logger.log('arrayOfIDs: ' + arrayOfIDs);
//the array arrayOfIDs does not need to be returned since it is available
//in the outermost scope
}// End of fncGetQuestionID function
shuffle = function() {// Shuffle function
var j, x, i;
Logger.log('shuffle ran')
for (i = arrayOfIDs.length; i; i--) {
j = Math.floor(Math.random() * i);
Logger.log('j: ' + j)
x = arrayOfIDs[i - 1];
Logger.log('x: ' + x)
arrayOfIDs[i - 1] = arrayOfIDs[j];
arrayOfIDs[j] = x;
}
Logger.log('arrayOfIDs: ' + arrayOfIDs)
}
shuffleQuestionsInNewForm = function() {
var i,L,thisID,thisItem,thisQuestion,questionType;
L = arrayOfIDs.length;
for (i=0;i<L;i++) {
thisID = arrayOfIDs[i];
Logger.log('thisID: ' + thisID)
thisItem = newForm.getItemById(thisID);
newForm.moveItem(thisItem, i)
}
}
fncGetQuestionID();//Get all the question ID's and put them into an array
shuffle();
shuffleQuestionsInNewForm();
}

Scrape site to report css selector occurrence in HTML

I want to see how much of my team's code has been integrated into a large scale site.
I believe I can achieve this (albeit roughly), by getting statistics on the number of occurrences certain CSS selectors appear across all the HTML pages. I have some unique CSS class selectors that I would like to use when scraping the site to analyze:
On how many pages the selector occurs.
On any page it does, how many times.
I've looked around but can't find any tools - does anyone know of any, or could suggest any idea's that may help me quickly achieve this ?
Thanks in advance.

Thanks to everyone for their advice.
In the end I decided that there was no one tool that could help me gather the statistics in the way I described so I already started to build up the application I needed in Node. Although I've not used Node before I've found it quick to grasp with an intermediate knowledge of Javascript.
For anyone looking to do the same:
I've used Simplecrawler to run over the site and Cheerio to find selectors and from this I can create a simple report created in Json using FS.

I'd recommend you to use Google App Scripting. You might manage to crawl site's pages and count the CSS selector occurrences with regex. Modify he following code to search each page for CSS selector. The code explanation is here.
Code
function onOpen() {
DocumentApp.getUi() // Or DocumentApp or FormApp.
.createMenu('New scrape web docs')
.addItem('Enter Url', 'showPrompt')
.addToUi();
}
function showPrompt() {
var ui = DocumentApp.getUi();
var result = ui.prompt(
'Scrape whole website into text!',
'Please enter website url (with http(s)://):',
ui.ButtonSet.OK_CANCEL);
// Process the user's response.
var button = result.getSelectedButton();
var url = result.getResponseText();
var links=[];
var base_url = url;
if (button == ui.Button.OK) { // User clicked "OK".
if(!isValidURL(url))
{
ui.alert('Your url is not valid.');
}
else {
// gather initial links
var inner_links_arr = scrapeAndPaste(url, 1); // first run and clear the document
links = links.concat(inner_links_arr); // append an array to all the links
var new_links=[]; // array for new links
var processed_urls =[url]; // processed links
var link, current;
while (links.length)
{
link = links.shift(); // get the most left link (inner url)
processed_urls.push(link);
current = base_url + link;
new_links = scrapeAndPaste(current, 0); // second and consecutive runs we do not clear up the document
//ui.alert('Processed... ' + current + '\nReturned links: ' + new_links.join('\n') );
// add new links into links array (stack) if appropriate
for (var i in new_links){
var item = new_links[i];
if (links.indexOf(item) === -1 && processed_urls.indexOf(item) === -1)
links.push(item);
}
/* // alert message for debugging
ui.alert('Links in stack: ' + links.join(' ')
+ '\nTotal links in stack: ' + links.length
+ '\nProcessed: ' + processed_urls.join(' ')
+ '\nTotal processed: ' + processed_urls.length);
*/
}
}
}
}
function scrapeAndPaste(url, clear) {
var text;
try {
var html = UrlFetchApp.fetch(url).getContentText();
// some html pre-processing
if (html.indexOf('</head>') !== -1 ){
html = html.split('</head>')[1];
}
if (html.indexOf('</body>') !== -1 ){ // thus we split the body only
html = html.split('</body>')[0] + '</body>';
}
// fetch inner links
var inner_links_arr= [];
var linkRegExp = /href="(.*?)"/gi; // regex expression object
var match = linkRegExp.exec(html);
while (match != null) {
// matched text: match[0]
if (match[1].indexOf('#') !== 0
&& match[1].indexOf('http') !== 0
//&& match[1].indexOf('https://') !== 0
&& match[1].indexOf('mailto:') !== 0
&& match[1].indexOf('.pdf') === -1 ) {
inner_links_arr.push(match[1]);
}
// match start: match.index
// capturing group n: match[n]
match = linkRegExp.exec(html);
}
text = getTextFromHtml(html);
outputText(url, text, clear); // output text into the current document with given url
return inner_links_arr; //we return all inner links of this doc as array
} catch (e) {
MailApp.sendEmail(Session.getActiveUser().getEmail(), "Scrape error report at "
+ Utilities.formatDate(new Date(), "GMT", "yyyy-MM-dd HH:mm:ss"),
"\r\nMessage: " + e.message
+ "\r\nFile: " + e.fileName+ '.gs'
+ "\r\nWeb page under scrape: " + url
+ "\r\nLine: " + e.lineNumber);
outputText(url, 'Scrape error for this page cause of malformed html!', clear);
}
}
function getTextFromHtml(html) {
return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
switch(x.toString()) {
case 'XmlText': return x.toXmlString();
case 'XmlElement': return x.getNodes().map(getTextFromNode).join(' ');
default: return '';
}
}
function outputText(url, text, clear){
var body = DocumentApp.getActiveDocument().getBody();
if (clear){
body.clear();
}
else {
body.appendHorizontalRule();
}
var section = body.appendParagraph(' * ' + url);
section.setHeading(DocumentApp.ParagraphHeading.HEADING2);
body.appendParagraph(text);
}
function isValidURL(url){
var RegExp = /^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/;
if(RegExp.test(url)){
return true;
}else{
return false;
}
}

Export multiple html tables to Excel

I've scavenged the inter web for answers and though I found some, they were mostly incomplete or not working.
What I'm trying to do is: I have a info page which displays information about a customer or server (or something else), this information is displayed in a table, sometimes multiple tables (I sometimes create my own table for some of the data and use Html.Grid(Model.list) to create tables for the rest of the data stored in lists, all on 1 page).
I found this website which is an awesome: http://www.excelmashup.com/ and does exactly what I want for 1 table, though I need this for multiple tables (they must all be in the same Excel file). I know I can create multiple files (1 for each table) but this is not the desired output.
So I kept on searching and I found a post on stackoverflow: Export multiple HTML tables to Excel with JavaScript function
This seemed promising so I tried using it but the code had some minor errors which I tried to fix:
var tableToExcel = (function () {
var uri = 'data:application/vnd.ms-excel;base64,'
, template = '<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"><head><!--[if gte mso 9]><xml><x:ExcelWorkbook><x:ExcelWorksheets><x:ExcelWorksheet><x:Name>{worksheet}</x:Name><x:WorksheetOptions><x:DisplayGridlines/></x:WorksheetOptions></x:ExcelWorksheet></x:ExcelWorksheets></x:ExcelWorkbook></xml><![endif]--></head><body><table>{table}</table></body></html>'
, base64 = function (s) { return window.btoa(unescape(encodeURIComponent(s))) }
, format = function (s, c) { return s.replace(/{(\w+)}/g, function (m, p) { return c[p]; }) }
return function (table, name) {
if (!table.nodeType) table = document.getElementById(table)
var ctx = { worksheet: name || 'Worksheet', table: table.innerHTML }
window.location.href = uri + base64(format(template, ctx))
}
})()
The button I use to trigger it:
<input type="button" onclick="tableToExcel('InformatieTable', 'W3C Example Table')" value="Export to Excel">
but alas to no avail (I did not know what to do with the if (!table.nodeType) table = table line so I just commented it since it seemed to do nothing special).
Now I get an error, or well not really an error but this is what it says when I try to run this code:
Resource interpreted as Document but transferred with MIME type application/vnd.ms-excel: "data:application/vnd.ms-excel;base64,PGh0bWwgeG1sbnM6bz0idXJuOnNjaGVtYXMtbW…JzZXQ9VVRGLTgiLz48L2hlYWQ+PGJvZHk+PHRhYmxlPjwvdGFibGU+PC9ib2R5PjwvaHRtbD4=".
And I get an Excel file as download in my browser but when I try to open it I get an error about the content and file extension not matching and if I would still like to open it. So if I click ok it opens a empty Excel sheet and that's it.
I am currently trying to fix that error, though i don't think it will make any difference to the content of the Excel file.
Is there anyone that can help me fix this? Or provide an other way of doing this?
I do prefer it to be run client side (so jQuery/java) instead of server side to minimize server load.
EDIT
I've found a better example of the jQuery (one that does work) on http://www.codeproject.com/Tips/755203/Export-HTML-table-to-Excel-With-CSS
This converts 1 table into an excel file which is obviously not good enough. But now I have the code to do this so I should be able to adapt it to loop trough all tables on the web page.
Also updated the code in this example to the correct version I'm using now.
I also still get the same error yet when I click on ok when trying to open the Excel file it does show me the content of the table, so I'm just ignoring that for now. anyone who has a solution for this please share.

Thanks to #Axel Richter I got my answer, he reffered me to the following question
I have adapted the code a bit so it would Take all the tables on the web page so it now looks like this:
<script type="text/javascript">
var tablesToExcel = (function () {
var uri = 'data:application/vnd.ms-excel;base64,'
, tmplWorkbookXML = '<?xml version="1.0"?><?mso-application progid="Excel.Sheet"?><Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">'
+ '<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office"><Author>Axel Richter</Author><Created>{created}</Created></DocumentProperties>'
+ '<Styles>'
+ '<Style ss:ID="Currency"><NumberFormat ss:Format="Currency"></NumberFormat></Style>'
+ '<Style ss:ID="Date"><NumberFormat ss:Format="Medium Date"></NumberFormat></Style>'
+ '</Styles>'
+ '{worksheets}</Workbook>'
, tmplWorksheetXML = '<Worksheet ss:Name="{nameWS}"><Table>{rows}</Table></Worksheet>'
, tmplCellXML = '<Cell{attributeStyleID}{attributeFormula}><Data ss:Type="{nameType}">{data}</Data></Cell>'
, base64 = function (s) { return window.btoa(unescape(encodeURIComponent(s))) }
, format = function (s, c) { return s.replace(/{(\w+)}/g, function (m, p) { return c[p]; }) }
return function (wsnames, wbname, appname) {
var ctx = "";
var workbookXML = "";
var worksheetsXML = "";
var rowsXML = "";
var tables = $('table');
for (var i = 0; i < tables.length; i++) {
for (var j = 0; j < tables[i].rows.length; j++) {
rowsXML += '<Row>'
for (var k = 0; k < tables[i].rows[j].cells.length; k++) {
var dataType = tables[i].rows[j].cells[k].getAttribute("data-type");
var dataStyle = tables[i].rows[j].cells[k].getAttribute("data-style");
var dataValue = tables[i].rows[j].cells[k].getAttribute("data-value");
dataValue = (dataValue) ? dataValue : tables[i].rows[j].cells[k].innerHTML;
var dataFormula = tables[i].rows[j].cells[k].getAttribute("data-formula");
dataFormula = (dataFormula) ? dataFormula : (appname == 'Calc' && dataType == 'DateTime') ? dataValue : null;
ctx = {
attributeStyleID: (dataStyle == 'Currency' || dataStyle == 'Date') ? ' ss:StyleID="' + dataStyle + '"' : ''
, nameType: (dataType == 'Number' || dataType == 'DateTime' || dataType == 'Boolean' || dataType == 'Error') ? dataType : 'String'
, data: (dataFormula) ? '' : dataValue.replace('<br>', '')
, attributeFormula: (dataFormula) ? ' ss:Formula="' + dataFormula + '"' : ''
};
rowsXML += format(tmplCellXML, ctx);
}
rowsXML += '</Row>'
}
ctx = { rows: rowsXML, nameWS: wsnames[i] || 'Sheet' + i };
worksheetsXML += format(tmplWorksheetXML, ctx);
rowsXML = "";
}
ctx = { created: (new Date()).getTime(), worksheets: worksheetsXML };
workbookXML = format(tmplWorkbookXML, ctx);
console.log(workbookXML);
var link = document.createElement("A");
link.href = uri + base64(workbookXML);
link.download = wbname || 'Workbook.xls';
link.target = '_blank';
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
}
})();
</script>
so now when ever I want a page to have an option to be exported to excel i add a refference to that script and i add the following button to my page:
<button onclick="tablesToExcel(['ServerInformatie', 'Relaties'], 'VirtueleMachineInfo.xls', 'Excel')">Export to Excel</button>
so the method:
tablesToExcel(WorksheetNames, fileName, 'Excel')
Where worksheetNames is an array which needs to contain as much names (or more) as there are tables on the page. You could ofcourse chose to create the worksheet names in a different way.
And where fileName is ofcourse the name of the file you'll be downloading.
Not having it all in 1 worksheet is a shame but at least this will do for now.

Here is the code that I used to put multiple HTML tables in the same Excel sheet:
import TableExport from 'tableexport';
const tbOptions = {
formats: ["xlsx"], // (String[]), filetype(s) for the export, (default: ['xlsx', 'csv', 'txt'])
bootstrap: true, // (Boolean), style buttons using bootstrap, (default: true)
exportButtons: false, // (Boolean), automatically generate the built-in export buttons for each of the specified formats (default: true)
position: "bottom", // (top, bottom), position of the caption element relative to table, (default: 'bottom')
}
DowlandExcel = (key) => {
const table = TableExport(document.getElementById(key), tbOptions);
var exportData = table.getExportData();
var xlsxData = exportData[key].xlsx;
console.log(xlsxData); // Replace with the kind of file you want from the exportData
table.export2file(xlsxData.data, xlsxData.mimeType, xlsxData.filename, xlsxData.fileExtension, xlsxData.merges, xlsxData.RTL, xlsxData.sheetname)
}
DowlandExcelMultiTable = (keys) => {
const tables = []
const xlsxDatas = []
keys.forEach(key => {
const selector = document.getElementById(key);
if (selector) {
const table = TableExport(selector, tbOptions);
tables.push(table);
xlsxDatas.push(table.getExportData()[key].xlsx)
}
});
const mergeXlsxData = {
RTL: false,
data: [],
fileExtension: ".xlsx",
filename: 'rapor',
merges: [],
mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
sheetname: "Rapor"
}
for (let i = 0; i < xlsxDatas.length; i++) {
const xlsxData = xlsxDatas[i];
mergeXlsxData.data.push(...xlsxData.data)
xlsxData.merges = xlsxData.merges.map(merge => {
const diff = mergeXlsxData.data.length - xlsxData.data.length;
merge.e.r += diff;
merge.s.r += diff;
return merge
});
mergeXlsxData.merges.push(...xlsxData.merges)
mergeXlsxData.data.push([null]);
}
console.log(mergeXlsxData);
tables[0].export2file(mergeXlsxData.data, mergeXlsxData.mimeType, mergeXlsxData.filename, mergeXlsxData.fileExtension, mergeXlsxData.merges, mergeXlsxData.RTL, mergeXlsxData.sheetname)
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to get all pages of a Google site with many pages - google-apps-script

Related

Retrieve all my subscriptions in Google Sheets

xpath in apps script?

Can Google apps script be used to randomize page order on Google forms?

Scrape site to report css selector occurrence in HTML

Export multiple html tables to Excel

Categories

Resources