Extract href attribute from HTML text in Google Sheets - html

I have about 3000 rows in my Google Spreadsheet and each row contains data about one article from our website. In one column (e.g. A:A) is stored formated text in HTML. I need extract all URLs inside href="" attribute from this column and work with them later. (It could be array or text string separated with coma or space in B column)
I tryied to use REGEXTRACT formula but it gives me only the first result. Then I tryied to use REGEXREPLACE but I'm unable to write proper expression to get only URL links.
I know that it is not proper way to use regex to get anything from HTML. Is there another way to extract these values from HTML text in one cell?
Link to sample data: Google Spreadsheet
Thak you in advance! I'm real newbie here and in scripting, parsing etc. too.

How about this samples? I used href=\"(.*?)\" for retrieving the URL. The sample of regex101.com is here.
1. Using Google spreadsheets functions :
=TEXTJOIN(CHAR(10),TRUE,ARRAYFORMULA(IFERROR(REGEXEXTRACT(SPLIT(a1,">"),"href="&CHAR(34)&"(.*?)"&CHAR(34)))))
In this case, since REGEXEXTRACT retrieves only the first matched string, after the cell data is separated by SPLIT, the URL is retrieved by REGEXEXTRACT.
Result :
2. Using Google Apps Script :
function myFunction(str){
var re = /href=\"(.*?)\"/g;
var result = "";
while ((res=re.exec(str)) !== null) {
result += res[1] + "\n";
};
return result.slice(0,-1);
}
This script can be used as a custom function. When you use this, please put =myFunction(A1) to a cell.
Result :
The result is the same to above method.
If I misunderstand your question, I'm sorry.

Related

Format a Google Sheets cell in numerical formatting 000 via Apps Script

I'm looking to set a column to format 000, which will display the zeros at begenning.
So, if a cell displays "3", I want that the script will set it to display "003".
This column is located in BDD tab, 13th column starting from the second row.
function FormattingGpeTrait() {
const sheet = SpreadsheetApp.getActiveSheet().getSheetByName("BDD").getRange(2,13)
sheet.setNumberFormat('000')
Modification points:
The method of "getSheetByName" is for Class Spreadsheet. In your showing script, you try to use it to Class Sheet. By this, an error occurs. This has already been mentioned in the comment. Ref
From 13th column starting from the second row., I thought that you might have wanted to set the number format of 000 to "M2:M". In your showing script, the number format is set to only a cell "M2".
If you want to set the number format to the cells "M2:M" of the sheet name of "BDD", how about the following modification?
Modified script:
function FormattingGpeTrait() {
const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("BDD");
sheet.getRange("M2:M" + sheet.getLastRow()).setNumberFormat('000');
}
When you run this script, the number format of "000" is set to the cells "M2:M" of "BDD" sheet.
If you want to set the number format to the "M2:M", please modify getRange("M2:M" + sheet.getLastRow()) to getRange("M2:M").
References:
getActiveSpreadsheet()
getSheetByName(name)
The easiest way to get a range on a named sheet is to include the sheet name in the range reference, like this:
function formattingGpeTrait() {
SpreadsheetApp.getActive().getRange('BDD!M3:M').setNumberFormat('000');
}
I think that you can't use the standard number formats as they all will only evaluate your value to a real number value where '003' in reality is equal to '3' from a numeric sense.
You have two real options which is to either store the value in a Text column as "003" or prepend the value with an apostrophe "'003" which is basically the same as storing it as Text but the column can remain numeric.
You can create a custom number format for a cell/column to also do this but I am not certain how to accomplish this programatically. Basically, this is still going to end up like the Text variations I mention above, only you have a named format you can call. The data will still be stored as Text.

How to use custom function ExtractAllRegex() as an array formula? [Google Sheets]

I'm using #Wiktor Stribiżew 's custom function ExtractAllRegex(). The script extracts all occurrences of a Regex pattern. In the example, I extract all words in column A starting with "v_"
Here is a Google Sheet showing what I'm trying to do.
The original strings are stored in column A. The custom function/the matches are in column B.
Wictors function works great for single cells. It also works great when I manually drag the formula down the column.
Here's Wictor's original code:
function ExtractAllRegex(input, pattern,groupId,separator) {return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);}
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
The question is, how do I turn column B into a working array formula? Or, perhaps better, how do I modify Wictor's script so it accepts a range instead and auto-fills down column B?
I updated your script to:
function ExtractAllRegex(input, pattern,groupId,separator) {
return input.map ? input.map( inp => ExtractAllRegex(inp, pattern, groupId, separator)) :
Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
and changed the formula in B2 to
=ExtractAllRegex(A2:A13,"(v_.+?\b)",0," ")
See if that works for you?

Google Apps Script: count occurences of string in HTML query

I am trying to get the count of occurrences of a string in the text fetched from a website
var html = UrlFetchApp.fetch('https://www.larvalabs.com/cryptopunks/details/0000').getContentText();
var offers = html.match('Offered');
Logger.log(offers);
However I get the following data returned: [Offered]
I tried several methods but I do not find much documentation on those I can use to do this task that sounds simple.
I add that I tried to parse with XMLservice but some errors in the HTML code makes it fail.
For example, as one method, how about using matchAll()?
Modified script:
var html = UrlFetchApp.fetch('https://www.larvalabs.com/cryptopunks/details/0000').getContentText();
var offers = [...html.matchAll('Offered')]; // or [...html.matchAll(/Offered/g)]
Logger.log(offers.length);
When I tested above, 3 is returned.
Note:
In this case, the upper- and lowercase letters are distinguished. Please be careful this.
Reference:
matchAll()

Remove characters from a cell with imported API data in Google Sheets - or format api import

I'll start by saying that my knowledge on using APIs is extremely limited. I'm impressed I've gotten as far as I have on this.
I've created a workbook in Google Sheets with imported data from the iexcloud API, which I'm using for data on stocks.
The requests have a cell reference in them so they update whenever a different symbol is selected.
So far, everything I've needed to request from it has the option to format as csv, so I can get cells with just the values.
However, this last thing I want doesn't have that option, so the whole response is wrapped in ["" ].
That really messes up what I need it for.
Here's an example
["PSA" CCI SHO ACC]
with each symbol being in its own cell.
I'm using the Peer Groups request.
A sample request:
> https://sandbox.iexapis.com/stable/stock/aapl/peers?token=Tsk_2b4c7c6fd98542f6a99f904cb7a3e721
Using Find and Replace doesn't work. I'm assuming because it's imported.
I need to use the cells with those symbols: PSA, CCI, SHO, ACC to reference in another request.
I recreated this in another Google Sheet that you can edit. The section in question in highlighted in blue
https://docs.google.com/spreadsheets/d/1BQ6FBD0S2YkDtDGZGIkDmQoKrQT4VmVDjuNsgV4mrXM/edit?usp=sharing
So I'm wondering if there's a way to have [ " ] automatically removed from any cells in that row, or if I copy and paste the values only, to have the values updated when the original cells are updated with new symbols (since I can have those characters removed in that row)
Or if there's a way I can format the response in sheets.
Any ideas?
I believe your goal as follows.
You want to achieve from ["CCI" SBAC CTL TDS RCI RCI-A-CT DTEGY] to CCI SBAC CTL TDS RCI RCI-A-CT DTEGY using the built-in functions of Google Spreadsheet.
Modified formula:
=ARRAYFORMULA(REGEXREPLACE(IMPORTDATA("https://cloud.iexapis.com/stable/stock/"&B3&"/peers?format=psv&token=###"),"[\[\]""]",""))
In this modified formula, [, ] and " are removed using REGEXREPLACE.
Please replace ### with your token at the above formula.
Result:
In this result, the values retrieved with =IMPORTDATA("https://cloud.iexapis.com/stable/stock/"&B3&"/peers?format=psv&token=###") are used. So the formula of cell "C9" is =ARRAYFORMULA(REGEXREPLACE(C6:I6,"[\[\]""]","")). But in this case, above modified formula can be used.
Note:
In this answer, I removed your token because I thought that it is your personal information.
Reference:
REGEXREPLACE

Issue with body.replaceText() in Google Docs

I am populating a Google Doc template based on a Google Form submission. Upon submit, the program copies the template Google Doc, captures the first item from the Google Form which is always the person's name (because this is a required field), and then replaces {{Name}} in the new file with the entered name using:
var name = itemResponses[0].getResponse();
body.replaceText('{{Name}}', name);
That works correctly. But then I iterate through the rest of the item response and not all the items are required, so I use a lookup table in a Google sheet. The loop takes the item id in the item response and then looks up the text that the response will replace. Then the program does:
var textToReplace //this value is from column B in the Google Sheet lookup table
var newText //this value is the entered response from the Google Form
body.replaceText(textToReplace, newText);
When I do this, I am getting a "Exception: Invalid argument: searchPattern" error. Why are these two body.replaceText() functions different? They are both finding a variable with in double brackets in the Google doc, but it only works in one case.
And to be clear, this was previously working correctly for the last couple of months and only recently started to not work (maybe Google changed something??). My hypothesis is that it has to do with a regex pattern in the first parameter of replaceText.
The "searchPattern" error is a good clue, in certain circumstances, it tells us the value of "textTopReplace" is not a valid search pattern. Since the code hasn't changed, the lookup table in your spreadsheet, or the fields on the form probably have.
One of your lookups is returning a value that isn't a valid search pattern. Perhaps it is returning Null, or an empty string?
You can get more information by using console.log to log debug info to the stackdriver log interface provided by Google, like so:
console.log('text to replace: "'+textToReplace+'"'); //this value is from column B in the Google Sheet lookup table
console.log('value: "'+newText+'"'); //this value is the entered response from the Google Form
body.replaceText(textToReplace, newText);
Then, to view the logs, select "stackdriver logging" from the View menu.