Export multiple html tables to Excel

Export multiple html tables to Excel - html

I've scavenged the inter web for answers and though I found some, they were mostly incomplete or not working.
What I'm trying to do is: I have a info page which displays information about a customer or server (or something else), this information is displayed in a table, sometimes multiple tables (I sometimes create my own table for some of the data and use Html.Grid(Model.list) to create tables for the rest of the data stored in lists, all on 1 page).
I found this website which is an awesome: http://www.excelmashup.com/ and does exactly what I want for 1 table, though I need this for multiple tables (they must all be in the same Excel file). I know I can create multiple files (1 for each table) but this is not the desired output.
So I kept on searching and I found a post on stackoverflow: Export multiple HTML tables to Excel with JavaScript function
This seemed promising so I tried using it but the code had some minor errors which I tried to fix:
var tableToExcel = (function () {
var uri = 'data:application/vnd.ms-excel;base64,'
, template = '<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"><head><!--[if gte mso 9]><xml><x:ExcelWorkbook><x:ExcelWorksheets><x:ExcelWorksheet><x:Name>{worksheet}</x:Name><x:WorksheetOptions><x:DisplayGridlines/></x:WorksheetOptions></x:ExcelWorksheet></x:ExcelWorksheets></x:ExcelWorkbook></xml><![endif]--></head><body><table>{table}</table></body></html>'
, base64 = function (s) { return window.btoa(unescape(encodeURIComponent(s))) }
, format = function (s, c) { return s.replace(/{(\w+)}/g, function (m, p) { return c[p]; }) }
return function (table, name) {
if (!table.nodeType) table = document.getElementById(table)
var ctx = { worksheet: name || 'Worksheet', table: table.innerHTML }
window.location.href = uri + base64(format(template, ctx))
}
})()
The button I use to trigger it:
<input type="button" onclick="tableToExcel('InformatieTable', 'W3C Example Table')" value="Export to Excel">
but alas to no avail (I did not know what to do with the if (!table.nodeType) table = table line so I just commented it since it seemed to do nothing special).
Now I get an error, or well not really an error but this is what it says when I try to run this code:
Resource interpreted as Document but transferred with MIME type application/vnd.ms-excel: "data:application/vnd.ms-excel;base64,PGh0bWwgeG1sbnM6bz0idXJuOnNjaGVtYXMtbW…JzZXQ9VVRGLTgiLz48L2hlYWQ+PGJvZHk+PHRhYmxlPjwvdGFibGU+PC9ib2R5PjwvaHRtbD4=".
And I get an Excel file as download in my browser but when I try to open it I get an error about the content and file extension not matching and if I would still like to open it. So if I click ok it opens a empty Excel sheet and that's it.
I am currently trying to fix that error, though i don't think it will make any difference to the content of the Excel file.
Is there anyone that can help me fix this? Or provide an other way of doing this?
I do prefer it to be run client side (so jQuery/java) instead of server side to minimize server load.
EDIT
I've found a better example of the jQuery (one that does work) on http://www.codeproject.com/Tips/755203/Export-HTML-table-to-Excel-With-CSS
This converts 1 table into an excel file which is obviously not good enough. But now I have the code to do this so I should be able to adapt it to loop trough all tables on the web page.
Also updated the code in this example to the correct version I'm using now.
I also still get the same error yet when I click on ok when trying to open the Excel file it does show me the content of the table, so I'm just ignoring that for now. anyone who has a solution for this please share.

Thanks to #Axel Richter I got my answer, he reffered me to the following question
I have adapted the code a bit so it would Take all the tables on the web page so it now looks like this:
<script type="text/javascript">
var tablesToExcel = (function () {
var uri = 'data:application/vnd.ms-excel;base64,'
, tmplWorkbookXML = '<?xml version="1.0"?><?mso-application progid="Excel.Sheet"?><Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">'
+ '<DocumentProperties xmlns="urn:schemas-microsoft-com:office:office"><Author>Axel Richter</Author><Created>{created}</Created></DocumentProperties>'
+ '<Styles>'
+ '<Style ss:ID="Currency"><NumberFormat ss:Format="Currency"></NumberFormat></Style>'
+ '<Style ss:ID="Date"><NumberFormat ss:Format="Medium Date"></NumberFormat></Style>'
+ '</Styles>'
+ '{worksheets}</Workbook>'
, tmplWorksheetXML = '<Worksheet ss:Name="{nameWS}"><Table>{rows}</Table></Worksheet>'
, tmplCellXML = '<Cell{attributeStyleID}{attributeFormula}><Data ss:Type="{nameType}">{data}</Data></Cell>'
, base64 = function (s) { return window.btoa(unescape(encodeURIComponent(s))) }
, format = function (s, c) { return s.replace(/{(\w+)}/g, function (m, p) { return c[p]; }) }
return function (wsnames, wbname, appname) {
var ctx = "";
var workbookXML = "";
var worksheetsXML = "";
var rowsXML = "";
var tables = $('table');
for (var i = 0; i < tables.length; i++) {
for (var j = 0; j < tables[i].rows.length; j++) {
rowsXML += '<Row>'
for (var k = 0; k < tables[i].rows[j].cells.length; k++) {
var dataType = tables[i].rows[j].cells[k].getAttribute("data-type");
var dataStyle = tables[i].rows[j].cells[k].getAttribute("data-style");
var dataValue = tables[i].rows[j].cells[k].getAttribute("data-value");
dataValue = (dataValue) ? dataValue : tables[i].rows[j].cells[k].innerHTML;
var dataFormula = tables[i].rows[j].cells[k].getAttribute("data-formula");
dataFormula = (dataFormula) ? dataFormula : (appname == 'Calc' && dataType == 'DateTime') ? dataValue : null;
ctx = {
attributeStyleID: (dataStyle == 'Currency' || dataStyle == 'Date') ? ' ss:StyleID="' + dataStyle + '"' : ''
, nameType: (dataType == 'Number' || dataType == 'DateTime' || dataType == 'Boolean' || dataType == 'Error') ? dataType : 'String'
, data: (dataFormula) ? '' : dataValue.replace('<br>', '')
, attributeFormula: (dataFormula) ? ' ss:Formula="' + dataFormula + '"' : ''
};
rowsXML += format(tmplCellXML, ctx);
}
rowsXML += '</Row>'
}
ctx = { rows: rowsXML, nameWS: wsnames[i] || 'Sheet' + i };
worksheetsXML += format(tmplWorksheetXML, ctx);
rowsXML = "";
}
ctx = { created: (new Date()).getTime(), worksheets: worksheetsXML };
workbookXML = format(tmplWorkbookXML, ctx);
console.log(workbookXML);
var link = document.createElement("A");
link.href = uri + base64(workbookXML);
link.download = wbname || 'Workbook.xls';
link.target = '_blank';
document.body.appendChild(link);
link.click();
document.body.removeChild(link);
}
})();
</script>
so now when ever I want a page to have an option to be exported to excel i add a refference to that script and i add the following button to my page:
<button onclick="tablesToExcel(['ServerInformatie', 'Relaties'], 'VirtueleMachineInfo.xls', 'Excel')">Export to Excel</button>
so the method:
tablesToExcel(WorksheetNames, fileName, 'Excel')
Where worksheetNames is an array which needs to contain as much names (or more) as there are tables on the page. You could ofcourse chose to create the worksheet names in a different way.
And where fileName is ofcourse the name of the file you'll be downloading.
Not having it all in 1 worksheet is a shame but at least this will do for now.

Here is the code that I used to put multiple HTML tables in the same Excel sheet:
import TableExport from 'tableexport';
const tbOptions = {
formats: ["xlsx"], // (String[]), filetype(s) for the export, (default: ['xlsx', 'csv', 'txt'])
bootstrap: true, // (Boolean), style buttons using bootstrap, (default: true)
exportButtons: false, // (Boolean), automatically generate the built-in export buttons for each of the specified formats (default: true)
position: "bottom", // (top, bottom), position of the caption element relative to table, (default: 'bottom')
}
DowlandExcel = (key) => {
const table = TableExport(document.getElementById(key), tbOptions);
var exportData = table.getExportData();
var xlsxData = exportData[key].xlsx;
console.log(xlsxData); // Replace with the kind of file you want from the exportData
table.export2file(xlsxData.data, xlsxData.mimeType, xlsxData.filename, xlsxData.fileExtension, xlsxData.merges, xlsxData.RTL, xlsxData.sheetname)
}
DowlandExcelMultiTable = (keys) => {
const tables = []
const xlsxDatas = []
keys.forEach(key => {
const selector = document.getElementById(key);
if (selector) {
const table = TableExport(selector, tbOptions);
tables.push(table);
xlsxDatas.push(table.getExportData()[key].xlsx)
}
});
const mergeXlsxData = {
RTL: false,
data: [],
fileExtension: ".xlsx",
filename: 'rapor',
merges: [],
mimeType: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
sheetname: "Rapor"
}
for (let i = 0; i < xlsxDatas.length; i++) {
const xlsxData = xlsxDatas[i];
mergeXlsxData.data.push(...xlsxData.data)
xlsxData.merges = xlsxData.merges.map(merge => {
const diff = mergeXlsxData.data.length - xlsxData.data.length;
merge.e.r += diff;
merge.s.r += diff;
return merge
});
mergeXlsxData.merges.push(...xlsxData.merges)
mergeXlsxData.data.push([null]);
}
console.log(mergeXlsxData);
tables[0].export2file(mergeXlsxData.data, mergeXlsxData.mimeType, mergeXlsxData.filename, mergeXlsxData.fileExtension, mergeXlsxData.merges, mergeXlsxData.RTL, mergeXlsxData.sheetname)
}

Related

xpath in apps script?

I made a formula to extract some Wikipedia data in Google Seets which works fine. Here is the formula:
=regexreplace(join("",flatten(IMPORTXML(D2,".//p[preceding-sibling::h2[1][contains(., 'Geography')]]"))),"\[[^\]]+\]","")&char(10)&char(10)&iferror(regexreplace(join("",flatten(IMPORTXML(D2,".//p[preceding-sibling::h2[1][contains(., 'Education')]]"))),"\[[^\]]+\]",""))
Where D2 is a URL like https://en.wikipedia.org/wiki/Abbeville,_Alabama
This extracts some Geography and Education data from the Wikipedia page. Trouble is that importxml only runs a few times before it dies due to quota.
So I thought maybe better to use Apps Script where there are much higher limits on fetching and parsing. I could not see a good way however of using Xpath in Apps Script. Older posts on the web discuss using a deprecated service called Xml but it seems to no longer work. There is a Service called XmlService which looks like it may do the job but you can't just plug in an Xpath. It looks like a lot of sweating to get to the result. Any solutions out there where you can just plug in Xpath?

Here is an alternative solution I actually do in a case like this.
I have used XmlService but only for parsing the content, not for using Xpath. This makes use of the element tags and so far pretty consistent on my tests. Although, it might need tweaks when certain tags are in the result and you might have to include them into the exclusion condition.
Tested the code below in both links:
https://en.wikipedia.org/wiki/Abbeville,_Alabama#Geography
https://en.wikipedia.org/wiki/Montgomery,_Alabama#Education
My test shows that the formula above used did not return the proper output from the 2nd link while the code does. (Maybe because it was too long)
Code:
function getGeoAndEdu(path) {
var data = UrlFetchApp.fetch(path).getContentText();
// wikipedia is divided into sections, if output is cut, increase the number
var regex = /.{1,100000}/g;
var results = [];
// flag to determine if matches should be added
var foundFlag = false;
do {
m = regex.exec(data);
if (foundFlag) {
// if another header is found during generation of data, stop appending the matches
if (matchTag(m[0], "<h2>"))
foundFlag = false;
// exclude tables, sub-headers and divs containing image description
else if(matchTag(m[0], "<div") || matchTag(m[0], "<h3") ||
matchTag(m[0], "<td") || matchTag(m[0], "<th"))
continue;
else
results.push(m[0]);
}
// start capturing if either IDs are found
if (m != null && (matchTag(m[0], "id=\"Geography\"") ||
matchTag(m[0], "id=\"Education\""))) {
foundFlag = true;
}
} while (m);
var output = results.map(function (str) {
// clean tags for XmlService
str = str.replace(/<[^>]*>/g, '').trim();
decode = XmlService.parse('<d>' + str + '</d>')
// convert html entity codes (e.g.  ) to text
return decode.getRootElement().getText();
// filter blank results due to cleaning and empty sections
// separate data and remove citations before returning output
}).filter(result => result.trim().length > 1).join("\n").replace(/\[\d+\]/g, '');
return output;
}
// check if tag is found in string
function matchTag(string, tag) {
var regex = RegExp(tag);
return string.match(regex) && string.match(regex)[0] == tag;
}
Output:
Difference:
Formula ending output
Script ending output
Education ending in wikipedia
Note:
You still have quota when using UrlFetchApp but should be better than IMPORTXML's limit depending on the type of your account.
Reference:
Apps Script Quotas

Sorry I got very busy this week so I didn't reply. I took a look at your answer which seems to work fine, but it was quite code heavy. I wanted something I would understand so I coded my own solution. not that mine is any simpler. It's just my own code so it's easier for me to follow:
function getTextBetweenTags(html, paramatersInFirstTag, paramatersInLastTag) { //finds text values between 2 tags and removes internal tags to leave plain text.
//eg getTextBetweenTags(html,[['class="mw-headline"'],['id="Geography"']],[['class="wikitable mw-collapsible mw-made-collapsible"']])
// **Note: you may want to replace &#number; with ascII number
var openingTagPos = null;
var closingTagPos = null;
var previousChar = '';
var readingTag = false;
var newTag = '';
var tagEnd = false;
var regexFirstTagParams = [];
var regexLastTagParams = [];
//prepare regexes to test for parameters in opening and closing tags. put regexes in arrays so each condition can be tested separately
for (var i in paramatersInFirstTag) {
regexFirstTagParams.push(new RegExp(escapeRegex(paramatersInFirstTag[i][0])))
}
for (var i in paramatersInLastTag) {
regexLastTagParams.push(new RegExp(escapeRegex(paramatersInLastTag[i][0])))
}
var startTagIndex = null;
var endTagIndex = null;
var matches = 0;
for (var i = 0; i < html.length - 1; i++) {
var nextChar = html.substr(i, 1);
if (nextChar == '<' && previousChar != '\\') {
readingTag = true;
}
if (nextChar == '>' && previousChar != '\\') { //if end of tag found, check tag matches start or end tag
readingTag = false;
newTag += nextChar;
//test for firstTag
if (startTagIndex == null) {
var alltestsPass = true;
for (var j in regexFirstTagParams) {
if (!regexFirstTagParams[j].test(newTag)) alltestsPass = false;
}
if (alltestsPass) {
startTagIndex = i + 1;
//console.log('Start Tag',startTagIndex)
matches++;
}
}
//test for lastTag
else if (startTagIndex != null) {
var alltestsPass = true;
for (var j in regexLastTagParams) {
if (!regexLastTagParams[j].test(newTag)) alltestsPass = false;
}
if (alltestsPass) {
endTagIndex = i + 1;
matches++;
}
}
if(startTagIndex && endTagIndex) break;
newTag = '';
}
if (readingTag) newTag += nextChar;
previousChar = nextChar;
}
if (matches < 2) return 'No matches';
else return html.substring(startTagIndex, endTagIndex).replace(/<[^>]+>/g, '');
}
function escapeRegex(string) {
if (string == null) return string;
return string.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
}
My function requires an array of attributes for the start tag and an array of attributes for the end tag. It gets any text in between and removes any tags found inbetween. One issue I also noticed was there were often special characters (eg  ) so they need to be replaced. I did that outside the scope of the function above.
The function could be easily improved to check the tag type (eg h2), but it wasn't necessary for the wikipedia case.
Here is a function where I called the above function. the html variable is just the result of UrlFetchApp.fetch('some wikipedia city url').getContextText();
function getWikiTexts(html) {
var geography = getTextBetweenTags(html, [['class="mw-headline"'], ['id="Geography']], [['class="mw-headline"']]);
var economy = getTextBetweenTags(html, 'span', [['class="mw-headline"'], ['id="Economy']], 'span', [['class="mw-headline"']])
var education = getTextBetweenTags(html, 'span', [['class="mw-headline"'], ['id="Education']], 'span', [['class="mw-headline"']])
var returnString = '';
if (geography != 'No matches' && !/Wikipedia/.test(geography)) returnString += geography + '\n';
if (economy != 'No matches' && !/Wikipedia/.test(economy)) returnString += economy + '\n';
if (education != 'No matches' && !/Wikipedia/.test(education)) returnString += education + '\n';
return returnString
}
Thanks for posting your answer.

Google app script - getting all hyperlinks from document [duplicate]

Given a "normal document" in Google Docs/Drive (e.g. paragraphs, lists, tables) which contains external links scattered throughout the content, how do you compile a list of links present using Google Apps Script?
Specifically, I want to update all broken links in the document by searching for oldText in each url and replace it with newText in each url, but not the text.
I don't think the replacing text section of the Dev Documentation is what I need -- do I need to scan every element of the doc? Can I just editAsText and use an html regex? Examples would be appreciated.

This is only mostly painful! Code is available as part of a gist.
Yeah, I can't spell.
getAllLinks
Here's a utility function that scans the document for all LinkUrls, returning them in an array.
/**
* Get an array of all LinkUrls in the document. The function is
* recursive, and if no element is provided, it will default to
* the active document's Body element.
*
* #param {Element} element The document element to operate on.
* .
* #returns {Array} Array of objects, vis
* {element,
* startOffset,
* endOffsetInclusive,
* url}
*/
function getAllLinks(element) {
var links = [];
element = element || DocumentApp.getActiveDocument().getBody();
if (element.getType() === DocumentApp.ElementType.TEXT) {
var textObj = element.editAsText();
var text = element.getText();
var inUrl = false;
for (var ch=0; ch < text.length; ch++) {
var url = textObj.getLinkUrl(ch);
if (url != null) {
if (!inUrl) {
// We are now!
inUrl = true;
var curUrl = {};
curUrl.element = element;
curUrl.url = String( url ); // grab a copy
curUrl.startOffset = ch;
}
else {
curUrl.endOffsetInclusive = ch;
}
}
else {
if (inUrl) {
// Not any more, we're not.
inUrl = false;
links.push(curUrl); // add to links
curUrl = {};
}
}
}
if (inUrl) {
// in case the link ends on the same char that the element does
links.push(curUrl);
}
}
else {
var numChildren = element.getNumChildren();
for (var i=0; i<numChildren; i++) {
links = links.concat(getAllLinks(element.getChild(i)));
}
}
return links;
}
findAndReplaceLinks
This utility builds on getAllLinks to do a find & replace function.
/**
* Replace all or part of UrlLinks in the document.
*
* #param {String} searchPattern the regex pattern to search for
* #param {String} replacement the text to use as replacement
*
* #returns {Number} number of Urls changed
*/
function findAndReplaceLinks(searchPattern,replacement) {
var links = getAllLinks();
var numChanged = 0;
for (var l=0; l<links.length; l++) {
var link = links[l];
if (link.url.match(searchPattern)) {
// This link needs to be changed
var newUrl = link.url.replace(searchPattern,replacement);
link.element.setLinkUrl(link.startOffset, link.endOffsetInclusive, newUrl);
numChanged++
}
}
return numChanged;
}
Demo UI
To demonstrate the use of these utilities, here are a couple of UI extensions:
function onOpen() {
// Add a menu with some items, some separators, and a sub-menu.
DocumentApp.getUi().createMenu('Utils')
.addItem('List Links', 'sidebarLinks')
.addItem('Replace Link Text', 'searchReplaceLinks')
.addToUi();
}
function searchReplaceLinks() {
var ui = DocumentApp.getUi();
var app = UiApp.createApplication()
.setWidth(250)
.setHeight(100)
.setTitle('Change Url text');
var form = app.createFormPanel();
var flow = app.createFlowPanel();
flow.add(app.createLabel("Find: "));
flow.add(app.createTextBox().setName("searchPattern"));
flow.add(app.createLabel("Replace: "));
flow.add(app.createTextBox().setName("replacement"));
var handler = app.createServerHandler('myClickHandler');
flow.add(app.createSubmitButton("Submit").addClickHandler(handler));
form.add(flow);
app.add(form);
ui.showDialog(app);
}
// ClickHandler to close dialog
function myClickHandler(e) {
var app = UiApp.getActiveApplication();
app.close();
return app;
}
function doPost(e) {
var numChanged = findAndReplaceLinks(e.parameter.searchPattern,e.parameter.replacement);
var ui = DocumentApp.getUi();
var app = UiApp.createApplication();
sidebarLinks(); // Update list
var result = DocumentApp.getUi().alert(
'Results',
"Changed "+numChanged+" urls.",
DocumentApp.getUi().ButtonSet.OK);
}
/**
* Shows a custom HTML user interface in a sidebar in the Google Docs editor.
*/
function sidebarLinks() {
var links = getAllLinks();
var sidebar = HtmlService
.createHtmlOutput()
.setTitle('URL Links')
.setWidth(350 /* pixels */);
// Display list of links, url only.
for (var l=0; l<links.length; l++) {
var link = links[l];
sidebar.append('<p>'+link.url);
}
DocumentApp.getUi().showSidebar(sidebar);
}

I offer another, shorter answer for your first question, concerning iterating through all links in a document's body. This instructive code returns a flat array of links in the current document's body, where each link is represented by an object with entries pointing to the text element (text), the paragraph element or list item element in which it's contained (paragraph), the offset index in the text where the link appears (startOffset) and the URL itself (url). Hopefully, you'll find it easy to suit it for your own needs.
It uses the getTextAttributeIndices() method rather than iterating over every character of the text, and is thus expected to perform much more quickly than previously written answers.
EDIT: Since originally posting this answer, I modified the function a couple of times. It now also (1) includes the endOffsetInclusive property for each link (note that it can be null for links that extend to the end of the text element - in this case one can use link.text.length-1 instead); (2) finds links in all sections of the document, not only the body, and (3) includes the section and isFirstPageSection properties to indicate where the link is located; (4) accepts the argument mergeAdjacent, which when set to true, will return only a single link entry for a continuous stretch of text linked to the same URL (which would be considered separate if, for instance, part of the text is styled differently than another part).
For the purpose of including links under all sections, a new utility function, iterateSections(), was introduced.
/**
* Returns a flat array of links which appear in the active document's body.
* Each link is represented by a simple Javascript object with the following
* keys:
* - "section": {ContainerElement} the document section in which the link is
* found.
* - "isFirstPageSection": {Boolean} whether the given section is a first-page
* header/footer section.
* - "paragraph": {ContainerElement} contains a reference to the Paragraph
* or ListItem element in which the link is found.
* - "text": the Text element in which the link is found.
* - "startOffset": {Number} the position (offset) in the link text begins.
* - "endOffsetInclusive": the position of the last character of the link
* text, or null if the link extends to the end of the text element.
* - "url": the URL of the link.
*
* #param {boolean} mergeAdjacent Whether consecutive links which carry
* different attributes (for any reason) should be returned as a single
* entry.
*
* #returns {Array} the aforementioned flat array of links.
*/
function getAllLinks(mergeAdjacent) {
var links = [];
var doc = DocumentApp.getActiveDocument();
iterateSections(doc, function(section, sectionIndex, isFirstPageSection) {
if (!("getParagraphs" in section)) {
// as we're using some undocumented API, adding this to avoid cryptic
// messages upon possible API changes.
throw new Error("An API change has caused this script to stop " +
"working.\n" +
"Section #" + sectionIndex + " of type " +
section.getType() + " has no .getParagraphs() method. " +
"Stopping script.");
}
section.getParagraphs().forEach(function(par) {
// skip empty paragraphs
if (par.getNumChildren() == 0) {
return;
}
// go over all text elements in paragraph / list-item
for (var el=par.getChild(0); el!=null; el=el.getNextSibling()) {
if (el.getType() != DocumentApp.ElementType.TEXT) {
continue;
}
// go over all styling segments in text element
var attributeIndices = el.getTextAttributeIndices();
var lastLink = null;
attributeIndices.forEach(function(startOffset, i, attributeIndices) {
var url = el.getLinkUrl(startOffset);
if (url != null) {
// we hit a link
var endOffsetInclusive = (i+1 < attributeIndices.length?
attributeIndices[i+1]-1 : null);
// check if this and the last found link are continuous
if (mergeAdjacent && lastLink != null && lastLink.url == url &&
lastLink.endOffsetInclusive == startOffset - 1) {
// this and the previous style segment are continuous
lastLink.endOffsetInclusive = endOffsetInclusive;
return;
}
lastLink = {
"section": section,
"isFirstPageSection": isFirstPageSection,
"paragraph": par,
"textEl": el,
"startOffset": startOffset,
"endOffsetInclusive": endOffsetInclusive,
"url": url
};
links.push(lastLink);
}
});
}
});
});
return links;
}
/**
* Calls the given function for each section of the document (body, header,
* etc.). Sections are children of the DocumentElement object.
*
* #param {Document} doc The Document object (such as the one obtained via
* a call to DocumentApp.getActiveDocument()) with the sections to iterate
* over.
* #param {Function} func A callback function which will be called, for each
* section, with the following arguments (in order):
* - {ContainerElement} section - the section element
* - {Number} sectionIndex - the child index of the section, such that
* doc.getBody().getParent().getChild(sectionIndex) == section.
* - {Boolean} isFirstPageSection - whether the section is a first-page
* header/footer section.
*/
function iterateSections(doc, func) {
// get the DocumentElement interface to iterate over all sections
// this bit is undocumented API
var docEl = doc.getBody().getParent();
var regularHeaderSectionIndex = (doc.getHeader() == null? -1 :
docEl.getChildIndex(doc.getHeader()));
var regularFooterSectionIndex = (doc.getFooter() == null? -1 :
docEl.getChildIndex(doc.getFooter()));
for (var i=0; i<docEl.getNumChildren(); ++i) {
var section = docEl.getChild(i);
var sectionType = section.getType();
var uniqueSectionName;
var isFirstPageSection = (
i != regularHeaderSectionIndex &&
i != regularFooterSectionIndex &&
(sectionType == DocumentApp.ElementType.HEADER_SECTION ||
sectionType == DocumentApp.ElementType.FOOTER_SECTION));
func(section, i, isFirstPageSection);
}
}

I was playing around and incorporated #Mogsdad's answer -- here's the really complicated version:
var _ = Underscorejs.load(); // loaded via http://googleappsdeveloper.blogspot.com/2012/11/using-open-source-libraries-in-apps.html, rolled my own
var ui = DocumentApp.getUi();
// #region --------------------- Utilities -----------------------------
var gDocsHelper = (function(P, un) {
// heavily based on answer https://stackoverflow.com/a/18731628/1037948
var updatedLinkText = function(link, offset) {
return function() { return 'Text: ' + link.getText().substring(offset,100) + ((link.getText().length-offset) > 100 ? '...' : ''); }
}
P.updateLink = function updateLink(link, oldText, newText, start, end) {
var oldLink = link.getLinkUrl(start);
if(0 > oldLink.indexOf(oldText)) return false;
var newLink = oldLink.replace(new RegExp(oldText, 'g'), newText);
link.setLinkUrl(start || 0, (end || oldLink.length), newLink);
log(true, "Updating Link: ", oldLink, newLink, start, end, updatedLinkText(link, start) );
return { old: oldLink, "new": newLink, getText: updatedLinkText(link, start) };
};
// moving this reused block out to 'private' fn
var updateLinkResult = function(text, oldText, newText, link, urls, sidebar, updateResult) {
// and may as well update the link while we're here
if(false !== (updateResult = P.updateLink(text, oldText, newText, link.start, link.end))) {
sidebar.append('<li>' + updateResult['old'] + ' → ' + updateResult['new'] + ' at ' + updateResult['getText']() + '</li>');
}
urls.push(link.url); // so multiple links get added to list
};
P.updateLinksMenu = function() {
// https://developers.google.com/apps-script/reference/base/prompt-response
var oldText = ui.prompt('Old link text to replace').getResponseText();
var newText = ui.prompt('New link text to replace with').getResponseText();
log('Replacing: ' + oldText + ', ' + newText);
var sidebar = gDocUiHelper.createSidebar('Update All Links', '<h3>Replacing</h3><p><code>' + oldText + '</code> → <code>' + newText + '</code></p><hr /><ol>');
// current doc available to script
var doc = DocumentApp.getActiveDocument().getBody();//.getActiveSection();
// Search until a link is found
var links = P.findAllElementsFor(doc, function(text) {
var i = -1, n = text.getText().length, link = false, url, urls = [], updateResult;
// note: the following only gets the FIRST link in the text -- while(i < n && !(url = text.getLinkUrl(i++)));
// scan the text element for links
while(++i < n) {
// getLinkUrl will continue to get a link while INSIDE the stupid link, so only do this once
if(url = text.getLinkUrl(i)) {
if(false === link) {
link = { start: i, end: -1, url: url };
// log(true, 'Type: ' + text.getType(), 'Link: ' + url, function() { return 'Text: ' + text.getText().substring(i,100) + ((n-i) > 100 ? '...' : '')});
}
else {
link.end = i; // keep updating the end position until we leave
}
}
// just left the link -- reset link tracking
else if(false !== link) {
// and may as well update the link while we're here
updateLinkResult(text, oldText, newText, link, urls, sidebar);
link = false; // reset "counter"
}
}
// once we've reached the end of the text, must also check to see if the last thing we found was a link
if(false !== link) updateLinkResult(text, oldText, newText, link, urls, sidebar);
return urls;
});
sidebar.append('</ol><p><strong>' + links.length + ' links reviewed</strong></p>');
gDocUiHelper.attachSidebar(sidebar);
log(links);
};
P.findAllElementsFor = function(el, test) {
// generic utility function to recursively find all elements; heavily based on https://stackoverflow.com/a/18731628/1037948
var results = [], searchResult = null, i, result;
// https://developers.google.com/apps-script/reference/document/body#findElement(ElementType)
while (searchResult = el.findElement(DocumentApp.ElementType.TEXT, searchResult)) {
var t = searchResult.getElement().editAsText(); // .asParagraph()
// check to add to list
if(test && (result = test(t))) {
if( _.isArray(result) ) results = results.concat(result); // could be big? http://jsperf.com/self-concatenation/
else results.push(result);
}
}
// recurse children if not plain text item
if(el.getType() !== DocumentApp.ElementType.TEXT) {
i = el.getNumChildren();
var result;
while(--i > 0) {
result = P.findAllElementsFor(el.getChild(i));
if(result && result.length > 0) results = results.concat(result);
}
}
return results;
};
return P;
})({});
// really? it can't handle object properties?
function gDocsUpdateLinksMenu() {
gDocsHelper.updateLinksMenu();
}
gDocUiHelper.addMenu('Zaus', [ ['Update links', 'gDocsUpdateLinksMenu'] ]);
// #endregion --------------------- Utilities -----------------------------
And I'm including the "extra" utility classes for creating menus, sidebars, etc below for completeness:
var log = function() {
// return false;
var args = Array.prototype.slice.call(arguments);
// allowing functions delegates execution so we can save some non-debug cycles if code left in?
if(args[0] === true) Logger.log(_.map(args, function(v) { return _.isFunction(v) ? v() : v; }).join('; '));
else
_.each(args, function(v) {
Logger.log(_.isFunction(v) ? v() : v);
});
}
// #region --------------------- Menu -----------------------------
var gDocUiHelper = (function(P, un) {
P.addMenuToSheet = function addMenu(spreadsheet, title, items) {
var menu = ui.createMenu(title);
// make sure menu items are correct format
_.each(items, function(v,k) {
var err = [];
// provided in format [ [name, fn],... ] instead
if( _.isArray(v) ) {
if ( v.length === 2 ) {
menu.addItem(v[0], v[1]);
}
else {
err.push('Menu item ' + k + ' missing name or function: ' + v.join(';'))
}
}
else {
if( !v.name ) err.push('Menu item ' + k + ' lacks name');
if( !v.functionName ) err.push('Menu item ' + k + ' lacks function');
if(!err.length) menu.addItem(v.name, v.functionName);
}
if(err.length) {
log(err);
ui.alert(err.join('; '));
}
});
menu.addToUi();
};
// list of things to hook into
var initializers = {};
P.addMenu = function(menuTitle, menuItems) {
if(initializers[menuTitle] === un) {
initializers[menuTitle] = [];
}
initializers[menuTitle] = initializers[menuTitle].concat(menuItems);
};
P.createSidebar = function(title, content, options) {
var sidebar = HtmlService
.createHtmlOutput()
.setTitle(title)
.setWidth( (options && options.width) ? width : 350 /* pixels */);
sidebar.append(content);
if(options && options.on) DocumentApp.getUi().showSidebar(sidebar);
// else { sidebar.attach = function() { DocumentApp.getUi().showSidebar(this); }; } // should really attach to prototype...
return sidebar;
};
P.attachSidebar = function(sidebar) {
DocumentApp.getUi().showSidebar(sidebar);
};
P.onOpen = function() {
var spreadsheet = SpreadsheetApp.getActive();
log(initializers);
_.each(initializers, function(v,k) {
P.addMenuToSheet(spreadsheet, k, v);
});
};
return P;
})({});
// #endregion --------------------- Menu -----------------------------
/**
* A special function that runs when the spreadsheet is open, used to add a
* custom menu to the spreadsheet.
*/
function onOpen() {
gDocUiHelper.onOpen();
}

Had some trouble getting Mogsdad's solution to work. Specifically it misses links which end their parent element so there isn't a trailing non-link character to terminate it. I've implemented something which addresses this and returns a standard range element. Sharing here incase someone finds it useful.
function getAllLinks(element) {
var rangeBuilder = DocumentApp.getActiveDocument().newRange();
// Parse the text iteratively to find the start and end indices for each link
if (element.getType() === DocumentApp.ElementType.TEXT) {
var links = [];
var string = element.getText();
var previousUrl = null; // The URL of the previous character
var currentLink = null; // The latest link being built
for (var charIndex = 0; charIndex < string.length; charIndex++) {
var currentUrl = element.getLinkUrl(charIndex);
// New URL means create a new link
if (currentUrl !== null && previousUrl !== currentUrl) {
if (currentLink !== null) links.push(currentLink);
currentLink = {};
currentLink.url = String(currentUrl);
currentLink.startOffset = charIndex;
}
// In a URL means extend the end of the current link
if (currentUrl !== null) {
currentLink.endOffsetInclusive = charIndex;
}
// Not in a URL means close and push the link if ready
if (currentUrl === null) {
if (currentLink !== null) links.push(currentLink);
currentLink = null;
}
// End the loop and go again
previousUrl = currentUrl;
}
// Handle the end case when final character is a link
if (currentLink !== null) links.push(currentLink);
// Convert the links into a range before returning
links.forEach(function(link) {
rangeBuilder.addElement(element, link.startOffset, link.endOffsetInclusive);
});
}
// If not a text element then recursively get links from child elements
else if (element.getNumChildren) {
for (var i = 0; i < element.getNumChildren(); i++) {
rangeBuilder.addRange(getAllLinks(element.getChild(i)));
}
}
return rangeBuilder.build();
}

You are right ... search and replace is not applicable here.
Use setLinkUrl() https://developers.google.com/apps-script/reference/document/container-element#setLinkUrl(String)
Basically you have to iterate through the elements recursively (elements can contain elements) and for each
use getLinkUrl() to get the oldText
if not null , setLinkUrl(newText) .... leaves displayed text unchanged

This Excel macro lists the links from a Word doc. You'd need to copy your data into a Word doc first.
Sub getLinks()
Dim wApp As Word.Application, wDoc As Word.Document
Dim i As Integer, r As Range
Const filePath = "C:\test\test.docx"
Set wApp = CreateObject("Word.Application")
'wApp.Visible = True
Set wDoc = wApp.Documents.Open(filePath)
Set r = Range("A1")
For i = 1 To wDoc.Hyperlinks.Count
r = wDoc.Hyperlinks(i).Address
Set r = r.Offset(1, 0)
Next i
wApp.Quit
Set wDoc = Nothing
Set wApp = Nothing
End Sub

Here's a quick and dirty way to accomplish the same goal with no scripting:
From Google Docs, save the document in RTF format.
In your editor of choice, edit the links in the RTF file (in my case, I wanted to modify all the hyperlinks, so I used Emacs and regexp-replace). Save the file when you're done.
Create a fresh, new Google Doc, and from the menu, select File>Open and open the RTF file. Docs will convert your edited RTF file back into a proper Google Doc, restoring all formatting.
Google Docs' RTF format is pretty complete--I haven't noticed any loss of fidelity in making the round trip, and it has the advantage of fully exposing all the hyperlinks, formatting, and everything else about the document in a form that's easy to edit and to apply regex tools to.

Can Google apps script be used to randomize page order on Google forms?

Update #2: Okay, I'm pretty sure my error in update #1 was because of indexing out of bounds over the array (I'm still not used to JS indexing at 0). But here is the new problem... if I write out the different combinations of the loop manually, setting the page index to 1 in moveItem() like so:
newForm.moveItem(itemsArray[0][0], 1);
newForm.moveItem(itemsArray[0][1], 1);
newForm.moveItem(itemsArray[0][2], 1);
newForm.moveItem(itemsArray[1][0], 1);
newForm.moveItem(itemsArray[1][1], 1);
newForm.moveItem(itemsArray[1][2], 1);
newForm.moveItem(itemsArray[2][0], 1);
...
...I don't get any errors but the items end up on different pages! What is going on?
Update #1:: Using Sandy Good's answer as well as a script I found at this WordPress blog, I have managed to get closer to what I needed. I believe Sandy Good misinterpreted what I wanted to do because I wasn't specific enough in my question.
I would like to:
Get all items from a page (section header, images, question etc)
Put them into an array
Do this for all pages, adding these arrays to an array (i.e: [[all items from page 1][all items from page 2][all items from page 3]...])
Shuffle the elements of this array
Repopulate a new form with each element of this array. In this way, page order will be randomized.
My JavaScript skills are poor (this is the first time I've used it). There is a step that produces null entries and I don't know why... I had to remove them manually. I am not able to complete step 5 as I get the following error:
Cannot convert Item,Item,Item to (class).
"Item,Item,Item" is the array element containing all the items from a particular page. So it seems that I can't add three items to a page at a time? Or is something else going on here?
Here is my code:
function shuffleForms() {
var itemsArray,shuffleQuestionsInNewForm,fncGetQuestionID,
newFormFile,newForm,newID,shuffle, sections;
// Copy template form by ID, set a new name
newFormFile = DriveApp.getFileById('1prfcl-RhaD4gn0b2oP4sbcKaRcZT5XoCAQCbLm1PR7I')
.makeCopy();
newFormFile.setName('AAAAA_Shuffled_Form');
// Get ID of new form and open it
newID = newFormFile.getId();
newForm = FormApp.openById(newID);
// Initialize array to put IDs in
itemsArray = [];
function getPageItems(thisPageNum) {
Logger.log("Getting items for page number: " + thisPageNum );
var thisPageItems = []; // Used for result
var thisPageBreakIndex = getPageItem(thisPageNum).getIndex();
Logger.log( "This is index num : " + thisPageBreakIndex );
// Get all items from page
var allItems = newForm.getItems();
thisPageItems.push(allItems[thisPageBreakIndex]);
Logger.log( "Added pagebreak item: " + allItems[thisPageBreakIndex].getIndex() );
for( var i = thisPageBreakIndex+1; ( i < allItems.length ) && ( allItems[i].getType() != FormApp.ItemType.PAGE_BREAK ); ++i ) {
thisPageItems.push(allItems[i]);
Logger.log( "Added non-pagebreak item: " + allItems[i].getIndex() );
}
return thisPageItems;
}
function shuffle(array) {
var currentIndex = array.length, temporaryValue, randomIndex;
Logger.log('shuffle ran')
// While there remain elements to shuffle...
while (0 !== currentIndex) {
// Pick a remaining element...
randomIndex = Math.floor(Math.random() * currentIndex);
currentIndex -= 1;
// And swap it with the current element.
temporaryValue = array[currentIndex];
array[currentIndex] = array[randomIndex];
array[randomIndex] = temporaryValue;
}
return array;
}
function shuffleAndMove() {
// Get page items for all pages into an array
for(i = 2; i <= 5; i++) {
itemsArray[i] = getPageItems(i);
}
// Removes null values from array
itemsArray = itemsArray.filter(function(x){return x});
// Shuffle page items
itemsArray = shuffle(itemsArray);
// Move page items to the new form
for(i = 2; i <= 5; ++i) {
newForm.moveItem(itemsArray[i], i);
}
}
shuffleAndMove();
}
Original post: I have used Google forms to create a questionnaire. For my purposes, each question needs to be on a separate page but I need the pages to be randomized. A quick Google search shows this feature has not been added yet.
I see that the Form class in the Google apps script has a number of methods that alter/give access to various properties of Google Forms. Since I do not know Javascript and am not too familiar with Google apps/API I would like to know if what I am trying to do is even possible before diving in and figuring it all out.
If it is possible, I would appreciate any insight on what methods would be relevant for this task just to give me some direction to get started.
Based on comments from Sandy Good and two SE questions found here and here, this is the code I have so far:
// Script to shuffle question in a Google Form when the questions are in separate sections
function shuffleFormSections() {
getQuestionID();
createNewShuffledForm();
}
// Get question IDs
function getQuestionID() {
var form = FormApp.getActiveForm();
var items = form.getItems();
arrayID = [];
for (var i in items) {
arrayID[i] = items[i].getId();
}
// Logger.log(arrayID);
return(arrayID);
}
// Shuffle function
function shuffle(a) {
var j, x, i;
for (i = a.length; i; i--) {
j = Math.floor(Math.random() * i);
x = a[i - 1];
a[i - 1] = a[j];
a[j] = x;
}
}
// Shuffle IDs and create new form with new question order
function createNewShuffledForm() {
shuffle(arrayID);
// Logger.log(arrayID);
var newForm = FormApp.create('Shuffled Form');
for (var i in arrayID) {
arrayID[i].getItemsbyId();
}
}

Try this. There's a few "constants" to be set at the top of the function, check the comments. Form file copying and opening borrowed from Sandy Good's answer, thanks!
// This is the function to run, all the others here are helper functions
// You'll need to set your source file id and your destination file name in the
// constants at the top of this function here.
// It appears that the "Title" page does not count as a page, so you don't need
// to include it in the PAGES_AT_BEGINNING_TO_NOT_SHUFFLE count.
function shuffleFormPages() {
// UPDATE THESE CONSTANTS AS NEEDED
var PAGES_AT_BEGINNING_TO_NOT_SHUFFLE = 2; // preserve X intro pages; shuffle everything after page X
var SOURCE_FILE_ID = 'YOUR_SOURCE_FILE_ID_HERE';
var DESTINATION_FILE_NAME = 'YOUR_DESTINATION_FILE_NAME_HERE';
// Copy template form by ID, set a new name
var newFormFile = DriveApp.getFileById(SOURCE_FILE_ID).makeCopy();
newFormFile.setName(DESTINATION_FILE_NAME);
// Open the duplicated form file as a form
var newForm = FormApp.openById(newFormFile.getId());
var pages = extractPages(newForm);
shuffleEndOfPages(pages, PAGES_AT_BEGINNING_TO_NOT_SHUFFLE);
var shuffledFormItems = flatten(pages);
setFormItems(newForm, shuffledFormItems);
}
// Builds an array of "page" arrays. Each page array starts with a page break
// and continues until the next page break.
function extractPages(form) {
var formItems = form.getItems();
var currentPage = [];
var allPages = [];
formItems.forEach(function(item) {
if (item.getType() == FormApp.ItemType.PAGE_BREAK && currentPage.length > 0) {
// found a page break (and it isn't the first one)
allPages.push(currentPage); // push what we've built for this page onto the output array
currentPage = [item]; // reset the current page to just this most recent item
} else {
currentPage.push(item);
}
});
// We've got the last page dangling, so add it
allPages.push(currentPage);
return allPages;
};
// startIndex is the array index to start shuffling from. E.g. to start
// shuffling on page 5, startIndex should be 4. startIndex could also be thought
// of as the number of pages to keep unshuffled.
// This function has no return value, it just mutates pages
function shuffleEndOfPages(pages, startIndex) {
var currentIndex = pages.length;
// While there remain elements to shuffle...
while (currentIndex > startIndex) {
// Pick an element between startIndex and currentIndex (inclusive)
var randomIndex = Math.floor(Math.random() * (currentIndex - startIndex)) + startIndex;
currentIndex -= 1;
// And swap it with the current element.
var temporaryValue = pages[currentIndex];
pages[currentIndex] = pages[randomIndex];
pages[randomIndex] = temporaryValue;
}
};
// Sourced from elsewhere on SO:
// https://stackoverflow.com/a/15030117/4280232
function flatten(array) {
return array.reduce(
function (flattenedArray, toFlatten) {
return flattenedArray.concat(Array.isArray(toFlatten) ? flatten(toFlatten) : toFlatten);
},
[]
);
};
// No safety checks around items being the same as the form length or whatever.
// This mutates form.
function setFormItems(form, items) {
items.forEach(function(item, index) {
form.moveItem(item, index);
});
};

I tested this code. It created a new Form, and then shuffled the questions in the new Form. It excludes page breaks, images and section headers. You need to provide a source file ID for the original template Form. This function has 3 inner sub-functions. The inner functions are at the top, and they are called at the bottom of the outer function. The arrayOfIDs variable does not need to be returned or passed to another function because it is available in the outer scope.
function shuffleFormSections() {
var arrayOfIDs,shuffleQuestionsInNewForm,fncGetQuestionID,
newFormFile,newForm,newID,items,shuffle;
newFormFile = DriveApp.getFileById('Put the source file ID here')
.makeCopy();
newFormFile.setName('AAAAA_Shuffled_Form');
newID = newFormFile.getId();
newForm = FormApp.openById(newID);
arrayOfIDs = [];
fncGetQuestionID = function() {
var i,L,thisID,thisItem,thisType;
items = newForm.getItems();
L = items.length;
for (i=0;i<L;i++) {
thisItem = items[i];
thisType = thisItem.getType();
if (thisType === FormApp.ItemType.PAGE_BREAK ||
thisType === FormApp.ItemType.SECTION_HEADER ||
thisType === FormApp.ItemType.IMAGE) {
continue;
}
thisID = thisItem.getId();
arrayOfIDs.push(thisID);
}
Logger.log('arrayOfIDs: ' + arrayOfIDs);
//the array arrayOfIDs does not need to be returned since it is available
//in the outermost scope
}// End of fncGetQuestionID function
shuffle = function() {// Shuffle function
var j, x, i;
Logger.log('shuffle ran')
for (i = arrayOfIDs.length; i; i--) {
j = Math.floor(Math.random() * i);
Logger.log('j: ' + j)
x = arrayOfIDs[i - 1];
Logger.log('x: ' + x)
arrayOfIDs[i - 1] = arrayOfIDs[j];
arrayOfIDs[j] = x;
}
Logger.log('arrayOfIDs: ' + arrayOfIDs)
}
shuffleQuestionsInNewForm = function() {
var i,L,thisID,thisItem,thisQuestion,questionType;
L = arrayOfIDs.length;
for (i=0;i<L;i++) {
thisID = arrayOfIDs[i];
Logger.log('thisID: ' + thisID)
thisItem = newForm.getItemById(thisID);
newForm.moveItem(thisItem, i)
}
}
fncGetQuestionID();//Get all the question ID's and put them into an array
shuffle();
shuffleQuestionsInNewForm();
}

Scrape site to report css selector occurrence in HTML

I want to see how much of my team's code has been integrated into a large scale site.
I believe I can achieve this (albeit roughly), by getting statistics on the number of occurrences certain CSS selectors appear across all the HTML pages. I have some unique CSS class selectors that I would like to use when scraping the site to analyze:
On how many pages the selector occurs.
On any page it does, how many times.
I've looked around but can't find any tools - does anyone know of any, or could suggest any idea's that may help me quickly achieve this ?
Thanks in advance.

Thanks to everyone for their advice.
In the end I decided that there was no one tool that could help me gather the statistics in the way I described so I already started to build up the application I needed in Node. Although I've not used Node before I've found it quick to grasp with an intermediate knowledge of Javascript.
For anyone looking to do the same:
I've used Simplecrawler to run over the site and Cheerio to find selectors and from this I can create a simple report created in Json using FS.

I'd recommend you to use Google App Scripting. You might manage to crawl site's pages and count the CSS selector occurrences with regex. Modify he following code to search each page for CSS selector. The code explanation is here.
Code
function onOpen() {
DocumentApp.getUi() // Or DocumentApp or FormApp.
.createMenu('New scrape web docs')
.addItem('Enter Url', 'showPrompt')
.addToUi();
}
function showPrompt() {
var ui = DocumentApp.getUi();
var result = ui.prompt(
'Scrape whole website into text!',
'Please enter website url (with http(s)://):',
ui.ButtonSet.OK_CANCEL);
// Process the user's response.
var button = result.getSelectedButton();
var url = result.getResponseText();
var links=[];
var base_url = url;
if (button == ui.Button.OK) { // User clicked "OK".
if(!isValidURL(url))
{
ui.alert('Your url is not valid.');
}
else {
// gather initial links
var inner_links_arr = scrapeAndPaste(url, 1); // first run and clear the document
links = links.concat(inner_links_arr); // append an array to all the links
var new_links=[]; // array for new links
var processed_urls =[url]; // processed links
var link, current;
while (links.length)
{
link = links.shift(); // get the most left link (inner url)
processed_urls.push(link);
current = base_url + link;
new_links = scrapeAndPaste(current, 0); // second and consecutive runs we do not clear up the document
//ui.alert('Processed... ' + current + '\nReturned links: ' + new_links.join('\n') );
// add new links into links array (stack) if appropriate
for (var i in new_links){
var item = new_links[i];
if (links.indexOf(item) === -1 && processed_urls.indexOf(item) === -1)
links.push(item);
}
/* // alert message for debugging
ui.alert('Links in stack: ' + links.join(' ')
+ '\nTotal links in stack: ' + links.length
+ '\nProcessed: ' + processed_urls.join(' ')
+ '\nTotal processed: ' + processed_urls.length);
*/
}
}
}
}
function scrapeAndPaste(url, clear) {
var text;
try {
var html = UrlFetchApp.fetch(url).getContentText();
// some html pre-processing
if (html.indexOf('</head>') !== -1 ){
html = html.split('</head>')[1];
}
if (html.indexOf('</body>') !== -1 ){ // thus we split the body only
html = html.split('</body>')[0] + '</body>';
}
// fetch inner links
var inner_links_arr= [];
var linkRegExp = /href="(.*?)"/gi; // regex expression object
var match = linkRegExp.exec(html);
while (match != null) {
// matched text: match[0]
if (match[1].indexOf('#') !== 0
&& match[1].indexOf('http') !== 0
//&& match[1].indexOf('https://') !== 0
&& match[1].indexOf('mailto:') !== 0
&& match[1].indexOf('.pdf') === -1 ) {
inner_links_arr.push(match[1]);
}
// match start: match.index
// capturing group n: match[n]
match = linkRegExp.exec(html);
}
text = getTextFromHtml(html);
outputText(url, text, clear); // output text into the current document with given url
return inner_links_arr; //we return all inner links of this doc as array
} catch (e) {
MailApp.sendEmail(Session.getActiveUser().getEmail(), "Scrape error report at "
+ Utilities.formatDate(new Date(), "GMT", "yyyy-MM-dd HH:mm:ss"),
"\r\nMessage: " + e.message
+ "\r\nFile: " + e.fileName+ '.gs'
+ "\r\nWeb page under scrape: " + url
+ "\r\nLine: " + e.lineNumber);
outputText(url, 'Scrape error for this page cause of malformed html!', clear);
}
}
function getTextFromHtml(html) {
return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
switch(x.toString()) {
case 'XmlText': return x.toXmlString();
case 'XmlElement': return x.getNodes().map(getTextFromNode).join(' ');
default: return '';
}
}
function outputText(url, text, clear){
var body = DocumentApp.getActiveDocument().getBody();
if (clear){
body.clear();
}
else {
body.appendHorizontalRule();
}
var section = body.appendParagraph(' * ' + url);
section.setHeading(DocumentApp.ParagraphHeading.HEADING2);
body.appendParagraph(text);
}
function isValidURL(url){
var RegExp = /^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?#)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/;
if(RegExp.test(url)){
return true;
}else{
return false;
}
}

Get All Links in a Document

Given a "normal document" in Google Docs/Drive (e.g. paragraphs, lists, tables) which contains external links scattered throughout the content, how do you compile a list of links present using Google Apps Script?
Specifically, I want to update all broken links in the document by searching for oldText in each url and replace it with newText in each url, but not the text.
I don't think the replacing text section of the Dev Documentation is what I need -- do I need to scan every element of the doc? Can I just editAsText and use an html regex? Examples would be appreciated.

This is only mostly painful! Code is available as part of a gist.
Yeah, I can't spell.
getAllLinks
Here's a utility function that scans the document for all LinkUrls, returning them in an array.
/**
* Get an array of all LinkUrls in the document. The function is
* recursive, and if no element is provided, it will default to
* the active document's Body element.
*
* #param {Element} element The document element to operate on.
* .
* #returns {Array} Array of objects, vis
* {element,
* startOffset,
* endOffsetInclusive,
* url}
*/
function getAllLinks(element) {
var links = [];
element = element || DocumentApp.getActiveDocument().getBody();
if (element.getType() === DocumentApp.ElementType.TEXT) {
var textObj = element.editAsText();
var text = element.getText();
var inUrl = false;
for (var ch=0; ch < text.length; ch++) {
var url = textObj.getLinkUrl(ch);
if (url != null) {
if (!inUrl) {
// We are now!
inUrl = true;
var curUrl = {};
curUrl.element = element;
curUrl.url = String( url ); // grab a copy
curUrl.startOffset = ch;
}
else {
curUrl.endOffsetInclusive = ch;
}
}
else {
if (inUrl) {
// Not any more, we're not.
inUrl = false;
links.push(curUrl); // add to links
curUrl = {};
}
}
}
if (inUrl) {
// in case the link ends on the same char that the element does
links.push(curUrl);
}
}
else {
var numChildren = element.getNumChildren();
for (var i=0; i<numChildren; i++) {
links = links.concat(getAllLinks(element.getChild(i)));
}
}
return links;
}
findAndReplaceLinks
This utility builds on getAllLinks to do a find & replace function.
/**
* Replace all or part of UrlLinks in the document.
*
* #param {String} searchPattern the regex pattern to search for
* #param {String} replacement the text to use as replacement
*
* #returns {Number} number of Urls changed
*/
function findAndReplaceLinks(searchPattern,replacement) {
var links = getAllLinks();
var numChanged = 0;
for (var l=0; l<links.length; l++) {
var link = links[l];
if (link.url.match(searchPattern)) {
// This link needs to be changed
var newUrl = link.url.replace(searchPattern,replacement);
link.element.setLinkUrl(link.startOffset, link.endOffsetInclusive, newUrl);
numChanged++
}
}
return numChanged;
}
Demo UI
To demonstrate the use of these utilities, here are a couple of UI extensions:
function onOpen() {
// Add a menu with some items, some separators, and a sub-menu.
DocumentApp.getUi().createMenu('Utils')
.addItem('List Links', 'sidebarLinks')
.addItem('Replace Link Text', 'searchReplaceLinks')
.addToUi();
}
function searchReplaceLinks() {
var ui = DocumentApp.getUi();
var app = UiApp.createApplication()
.setWidth(250)
.setHeight(100)
.setTitle('Change Url text');
var form = app.createFormPanel();
var flow = app.createFlowPanel();
flow.add(app.createLabel("Find: "));
flow.add(app.createTextBox().setName("searchPattern"));
flow.add(app.createLabel("Replace: "));
flow.add(app.createTextBox().setName("replacement"));
var handler = app.createServerHandler('myClickHandler');
flow.add(app.createSubmitButton("Submit").addClickHandler(handler));
form.add(flow);
app.add(form);
ui.showDialog(app);
}
// ClickHandler to close dialog
function myClickHandler(e) {
var app = UiApp.getActiveApplication();
app.close();
return app;
}
function doPost(e) {
var numChanged = findAndReplaceLinks(e.parameter.searchPattern,e.parameter.replacement);
var ui = DocumentApp.getUi();
var app = UiApp.createApplication();
sidebarLinks(); // Update list
var result = DocumentApp.getUi().alert(
'Results',
"Changed "+numChanged+" urls.",
DocumentApp.getUi().ButtonSet.OK);
}
/**
* Shows a custom HTML user interface in a sidebar in the Google Docs editor.
*/
function sidebarLinks() {
var links = getAllLinks();
var sidebar = HtmlService
.createHtmlOutput()
.setTitle('URL Links')
.setWidth(350 /* pixels */);
// Display list of links, url only.
for (var l=0; l<links.length; l++) {
var link = links[l];
sidebar.append('<p>'+link.url);
}
DocumentApp.getUi().showSidebar(sidebar);
}

I offer another, shorter answer for your first question, concerning iterating through all links in a document's body. This instructive code returns a flat array of links in the current document's body, where each link is represented by an object with entries pointing to the text element (text), the paragraph element or list item element in which it's contained (paragraph), the offset index in the text where the link appears (startOffset) and the URL itself (url). Hopefully, you'll find it easy to suit it for your own needs.
It uses the getTextAttributeIndices() method rather than iterating over every character of the text, and is thus expected to perform much more quickly than previously written answers.
EDIT: Since originally posting this answer, I modified the function a couple of times. It now also (1) includes the endOffsetInclusive property for each link (note that it can be null for links that extend to the end of the text element - in this case one can use link.text.length-1 instead); (2) finds links in all sections of the document, not only the body, and (3) includes the section and isFirstPageSection properties to indicate where the link is located; (4) accepts the argument mergeAdjacent, which when set to true, will return only a single link entry for a continuous stretch of text linked to the same URL (which would be considered separate if, for instance, part of the text is styled differently than another part).
For the purpose of including links under all sections, a new utility function, iterateSections(), was introduced.
/**
* Returns a flat array of links which appear in the active document's body.
* Each link is represented by a simple Javascript object with the following
* keys:
* - "section": {ContainerElement} the document section in which the link is
* found.
* - "isFirstPageSection": {Boolean} whether the given section is a first-page
* header/footer section.
* - "paragraph": {ContainerElement} contains a reference to the Paragraph
* or ListItem element in which the link is found.
* - "text": the Text element in which the link is found.
* - "startOffset": {Number} the position (offset) in the link text begins.
* - "endOffsetInclusive": the position of the last character of the link
* text, or null if the link extends to the end of the text element.
* - "url": the URL of the link.
*
* #param {boolean} mergeAdjacent Whether consecutive links which carry
* different attributes (for any reason) should be returned as a single
* entry.
*
* #returns {Array} the aforementioned flat array of links.
*/
function getAllLinks(mergeAdjacent) {
var links = [];
var doc = DocumentApp.getActiveDocument();
iterateSections(doc, function(section, sectionIndex, isFirstPageSection) {
if (!("getParagraphs" in section)) {
// as we're using some undocumented API, adding this to avoid cryptic
// messages upon possible API changes.
throw new Error("An API change has caused this script to stop " +
"working.\n" +
"Section #" + sectionIndex + " of type " +
section.getType() + " has no .getParagraphs() method. " +
"Stopping script.");
}
section.getParagraphs().forEach(function(par) {
// skip empty paragraphs
if (par.getNumChildren() == 0) {
return;
}
// go over all text elements in paragraph / list-item
for (var el=par.getChild(0); el!=null; el=el.getNextSibling()) {
if (el.getType() != DocumentApp.ElementType.TEXT) {
continue;
}
// go over all styling segments in text element
var attributeIndices = el.getTextAttributeIndices();
var lastLink = null;
attributeIndices.forEach(function(startOffset, i, attributeIndices) {
var url = el.getLinkUrl(startOffset);
if (url != null) {
// we hit a link
var endOffsetInclusive = (i+1 < attributeIndices.length?
attributeIndices[i+1]-1 : null);
// check if this and the last found link are continuous
if (mergeAdjacent && lastLink != null && lastLink.url == url &&
lastLink.endOffsetInclusive == startOffset - 1) {
// this and the previous style segment are continuous
lastLink.endOffsetInclusive = endOffsetInclusive;
return;
}
lastLink = {
"section": section,
"isFirstPageSection": isFirstPageSection,
"paragraph": par,
"textEl": el,
"startOffset": startOffset,
"endOffsetInclusive": endOffsetInclusive,
"url": url
};
links.push(lastLink);
}
});
}
});
});
return links;
}
/**
* Calls the given function for each section of the document (body, header,
* etc.). Sections are children of the DocumentElement object.
*
* #param {Document} doc The Document object (such as the one obtained via
* a call to DocumentApp.getActiveDocument()) with the sections to iterate
* over.
* #param {Function} func A callback function which will be called, for each
* section, with the following arguments (in order):
* - {ContainerElement} section - the section element
* - {Number} sectionIndex - the child index of the section, such that
* doc.getBody().getParent().getChild(sectionIndex) == section.
* - {Boolean} isFirstPageSection - whether the section is a first-page
* header/footer section.
*/
function iterateSections(doc, func) {
// get the DocumentElement interface to iterate over all sections
// this bit is undocumented API
var docEl = doc.getBody().getParent();
var regularHeaderSectionIndex = (doc.getHeader() == null? -1 :
docEl.getChildIndex(doc.getHeader()));
var regularFooterSectionIndex = (doc.getFooter() == null? -1 :
docEl.getChildIndex(doc.getFooter()));
for (var i=0; i<docEl.getNumChildren(); ++i) {
var section = docEl.getChild(i);
var sectionType = section.getType();
var uniqueSectionName;
var isFirstPageSection = (
i != regularHeaderSectionIndex &&
i != regularFooterSectionIndex &&
(sectionType == DocumentApp.ElementType.HEADER_SECTION ||
sectionType == DocumentApp.ElementType.FOOTER_SECTION));
func(section, i, isFirstPageSection);
}
}

I was playing around and incorporated #Mogsdad's answer -- here's the really complicated version:
var _ = Underscorejs.load(); // loaded via http://googleappsdeveloper.blogspot.com/2012/11/using-open-source-libraries-in-apps.html, rolled my own
var ui = DocumentApp.getUi();
// #region --------------------- Utilities -----------------------------
var gDocsHelper = (function(P, un) {
// heavily based on answer https://stackoverflow.com/a/18731628/1037948
var updatedLinkText = function(link, offset) {
return function() { return 'Text: ' + link.getText().substring(offset,100) + ((link.getText().length-offset) > 100 ? '...' : ''); }
}
P.updateLink = function updateLink(link, oldText, newText, start, end) {
var oldLink = link.getLinkUrl(start);
if(0 > oldLink.indexOf(oldText)) return false;
var newLink = oldLink.replace(new RegExp(oldText, 'g'), newText);
link.setLinkUrl(start || 0, (end || oldLink.length), newLink);
log(true, "Updating Link: ", oldLink, newLink, start, end, updatedLinkText(link, start) );
return { old: oldLink, "new": newLink, getText: updatedLinkText(link, start) };
};
// moving this reused block out to 'private' fn
var updateLinkResult = function(text, oldText, newText, link, urls, sidebar, updateResult) {
// and may as well update the link while we're here
if(false !== (updateResult = P.updateLink(text, oldText, newText, link.start, link.end))) {
sidebar.append('<li>' + updateResult['old'] + ' → ' + updateResult['new'] + ' at ' + updateResult['getText']() + '</li>');
}
urls.push(link.url); // so multiple links get added to list
};
P.updateLinksMenu = function() {
// https://developers.google.com/apps-script/reference/base/prompt-response
var oldText = ui.prompt('Old link text to replace').getResponseText();
var newText = ui.prompt('New link text to replace with').getResponseText();
log('Replacing: ' + oldText + ', ' + newText);
var sidebar = gDocUiHelper.createSidebar('Update All Links', '<h3>Replacing</h3><p><code>' + oldText + '</code> → <code>' + newText + '</code></p><hr /><ol>');
// current doc available to script
var doc = DocumentApp.getActiveDocument().getBody();//.getActiveSection();
// Search until a link is found
var links = P.findAllElementsFor(doc, function(text) {
var i = -1, n = text.getText().length, link = false, url, urls = [], updateResult;
// note: the following only gets the FIRST link in the text -- while(i < n && !(url = text.getLinkUrl(i++)));
// scan the text element for links
while(++i < n) {
// getLinkUrl will continue to get a link while INSIDE the stupid link, so only do this once
if(url = text.getLinkUrl(i)) {
if(false === link) {
link = { start: i, end: -1, url: url };
// log(true, 'Type: ' + text.getType(), 'Link: ' + url, function() { return 'Text: ' + text.getText().substring(i,100) + ((n-i) > 100 ? '...' : '')});
}
else {
link.end = i; // keep updating the end position until we leave
}
}
// just left the link -- reset link tracking
else if(false !== link) {
// and may as well update the link while we're here
updateLinkResult(text, oldText, newText, link, urls, sidebar);
link = false; // reset "counter"
}
}
// once we've reached the end of the text, must also check to see if the last thing we found was a link
if(false !== link) updateLinkResult(text, oldText, newText, link, urls, sidebar);
return urls;
});
sidebar.append('</ol><p><strong>' + links.length + ' links reviewed</strong></p>');
gDocUiHelper.attachSidebar(sidebar);
log(links);
};
P.findAllElementsFor = function(el, test) {
// generic utility function to recursively find all elements; heavily based on https://stackoverflow.com/a/18731628/1037948
var results = [], searchResult = null, i, result;
// https://developers.google.com/apps-script/reference/document/body#findElement(ElementType)
while (searchResult = el.findElement(DocumentApp.ElementType.TEXT, searchResult)) {
var t = searchResult.getElement().editAsText(); // .asParagraph()
// check to add to list
if(test && (result = test(t))) {
if( _.isArray(result) ) results = results.concat(result); // could be big? http://jsperf.com/self-concatenation/
else results.push(result);
}
}
// recurse children if not plain text item
if(el.getType() !== DocumentApp.ElementType.TEXT) {
i = el.getNumChildren();
var result;
while(--i > 0) {
result = P.findAllElementsFor(el.getChild(i));
if(result && result.length > 0) results = results.concat(result);
}
}
return results;
};
return P;
})({});
// really? it can't handle object properties?
function gDocsUpdateLinksMenu() {
gDocsHelper.updateLinksMenu();
}
gDocUiHelper.addMenu('Zaus', [ ['Update links', 'gDocsUpdateLinksMenu'] ]);
// #endregion --------------------- Utilities -----------------------------
And I'm including the "extra" utility classes for creating menus, sidebars, etc below for completeness:
var log = function() {
// return false;
var args = Array.prototype.slice.call(arguments);
// allowing functions delegates execution so we can save some non-debug cycles if code left in?
if(args[0] === true) Logger.log(_.map(args, function(v) { return _.isFunction(v) ? v() : v; }).join('; '));
else
_.each(args, function(v) {
Logger.log(_.isFunction(v) ? v() : v);
});
}
// #region --------------------- Menu -----------------------------
var gDocUiHelper = (function(P, un) {
P.addMenuToSheet = function addMenu(spreadsheet, title, items) {
var menu = ui.createMenu(title);
// make sure menu items are correct format
_.each(items, function(v,k) {
var err = [];
// provided in format [ [name, fn],... ] instead
if( _.isArray(v) ) {
if ( v.length === 2 ) {
menu.addItem(v[0], v[1]);
}
else {
err.push('Menu item ' + k + ' missing name or function: ' + v.join(';'))
}
}
else {
if( !v.name ) err.push('Menu item ' + k + ' lacks name');
if( !v.functionName ) err.push('Menu item ' + k + ' lacks function');
if(!err.length) menu.addItem(v.name, v.functionName);
}
if(err.length) {
log(err);
ui.alert(err.join('; '));
}
});
menu.addToUi();
};
// list of things to hook into
var initializers = {};
P.addMenu = function(menuTitle, menuItems) {
if(initializers[menuTitle] === un) {
initializers[menuTitle] = [];
}
initializers[menuTitle] = initializers[menuTitle].concat(menuItems);
};
P.createSidebar = function(title, content, options) {
var sidebar = HtmlService
.createHtmlOutput()
.setTitle(title)
.setWidth( (options && options.width) ? width : 350 /* pixels */);
sidebar.append(content);
if(options && options.on) DocumentApp.getUi().showSidebar(sidebar);
// else { sidebar.attach = function() { DocumentApp.getUi().showSidebar(this); }; } // should really attach to prototype...
return sidebar;
};
P.attachSidebar = function(sidebar) {
DocumentApp.getUi().showSidebar(sidebar);
};
P.onOpen = function() {
var spreadsheet = SpreadsheetApp.getActive();
log(initializers);
_.each(initializers, function(v,k) {
P.addMenuToSheet(spreadsheet, k, v);
});
};
return P;
})({});
// #endregion --------------------- Menu -----------------------------
/**
* A special function that runs when the spreadsheet is open, used to add a
* custom menu to the spreadsheet.
*/
function onOpen() {
gDocUiHelper.onOpen();
}

Had some trouble getting Mogsdad's solution to work. Specifically it misses links which end their parent element so there isn't a trailing non-link character to terminate it. I've implemented something which addresses this and returns a standard range element. Sharing here incase someone finds it useful.
function getAllLinks(element) {
var rangeBuilder = DocumentApp.getActiveDocument().newRange();
// Parse the text iteratively to find the start and end indices for each link
if (element.getType() === DocumentApp.ElementType.TEXT) {
var links = [];
var string = element.getText();
var previousUrl = null; // The URL of the previous character
var currentLink = null; // The latest link being built
for (var charIndex = 0; charIndex < string.length; charIndex++) {
var currentUrl = element.getLinkUrl(charIndex);
// New URL means create a new link
if (currentUrl !== null && previousUrl !== currentUrl) {
if (currentLink !== null) links.push(currentLink);
currentLink = {};
currentLink.url = String(currentUrl);
currentLink.startOffset = charIndex;
}
// In a URL means extend the end of the current link
if (currentUrl !== null) {
currentLink.endOffsetInclusive = charIndex;
}
// Not in a URL means close and push the link if ready
if (currentUrl === null) {
if (currentLink !== null) links.push(currentLink);
currentLink = null;
}
// End the loop and go again
previousUrl = currentUrl;
}
// Handle the end case when final character is a link
if (currentLink !== null) links.push(currentLink);
// Convert the links into a range before returning
links.forEach(function(link) {
rangeBuilder.addElement(element, link.startOffset, link.endOffsetInclusive);
});
}
// If not a text element then recursively get links from child elements
else if (element.getNumChildren) {
for (var i = 0; i < element.getNumChildren(); i++) {
rangeBuilder.addRange(getAllLinks(element.getChild(i)));
}
}
return rangeBuilder.build();
}

You are right ... search and replace is not applicable here.
Use setLinkUrl() https://developers.google.com/apps-script/reference/document/container-element#setLinkUrl(String)
Basically you have to iterate through the elements recursively (elements can contain elements) and for each
use getLinkUrl() to get the oldText
if not null , setLinkUrl(newText) .... leaves displayed text unchanged

This Excel macro lists the links from a Word doc. You'd need to copy your data into a Word doc first.
Sub getLinks()
Dim wApp As Word.Application, wDoc As Word.Document
Dim i As Integer, r As Range
Const filePath = "C:\test\test.docx"
Set wApp = CreateObject("Word.Application")
'wApp.Visible = True
Set wDoc = wApp.Documents.Open(filePath)
Set r = Range("A1")
For i = 1 To wDoc.Hyperlinks.Count
r = wDoc.Hyperlinks(i).Address
Set r = r.Offset(1, 0)
Next i
wApp.Quit
Set wDoc = Nothing
Set wApp = Nothing
End Sub

Here's a quick and dirty way to accomplish the same goal with no scripting:
From Google Docs, save the document in RTF format.
In your editor of choice, edit the links in the RTF file (in my case, I wanted to modify all the hyperlinks, so I used Emacs and regexp-replace). Save the file when you're done.
Create a fresh, new Google Doc, and from the menu, select File>Open and open the RTF file. Docs will convert your edited RTF file back into a proper Google Doc, restoring all formatting.
Google Docs' RTF format is pretty complete--I haven't noticed any loss of fidelity in making the round trip, and it has the advantage of fully exposing all the hyperlinks, formatting, and everything else about the document in a form that's easy to edit and to apply regex tools to.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Export multiple html tables to Excel - html

Related

xpath in apps script?

Google app script - getting all hyperlinks from document [duplicate]

Can Google apps script be used to randomize page order on Google forms?

Scrape site to report css selector occurrence in HTML

Get All Links in a Document

Categories

Resources