Search in DriveApp doesn't find many files - google-apps-script

I'm working with the DriveApp in Google Apps Script and try to find documents, that contain a certain word or phrase:
function SearchFile(Phrase, FolderID) {
var SearchString = 'fullText contains "' + Phrase + '" and "' + FolderID + '" in parents';
var files = DriveApp.searchFiles(SearchString);
return Output(files);
}
That works quite well so far, but it does only find a few files. I don't understand why. I have about 30 documents in that Folder that all contain the word "Hello". But search only find's 8 of them. Same with other words.
It seems like there is a bug in search?

Not enough information, but as you only menion that many files "contain" the word Hello, I will assume that the word Hello is anywhere in the name in your case.
First thing that comes to mind are the following 2 from the documentation
This one is when using the name field
The contains operator only performs prefix matching for a name. For
example, the name "HelloWorld" would match for name contains 'Hello'
but not name contains 'World'.
This one applies when using fullText fields
The contains operator only performs matching on entire string tokens
for fullText. For example, if the full text of a doc contains the
string "HelloWorld" only the query fullText contains 'HelloWorld'
returns a result. Queries such as fullText contains 'Hello' do not
return results in this scenario.

Related

Google Drive API: find a file by name using wildcards?

When using the Google Drive API v3, can one search for a file by its name using wildcards or regular expressions? The docs don't mention anything.
I am trying to match a set of files whose names have the format
backup_YYYY-MM-DD-XXXX_NameOfWebsite_xxxxxxxxxx.zip
And am wondering what's the best way to construct a pattern that might match it. Of course, I could follow the docs and just do something like:
q="name contains 'backup' and name contains 'NameOfWebsite'"
But if I need to match a different pattern, or something with more than 2 distinctive strings in its filename ("backup_" and "NameOfWebsite"), you can quickly see what a pain would be to construct a query in that way:
q="name contains 'string1' and name contains 'string2' and name contains...
Answer:
You can't use a wildcard in the middle of a file name when making a Drive.list request with a q parameter.
More Information:
The name field only takes three operators - =, != and contains:
The = operator is regular equivalence and with this you can not use a wildcard.
name = 'backup*' will return no results.
The != operator is not equivalence, not relevant here but does the opposite of =
The contains operator. You can use wildcards with this, but with restrictions:
name contains 'backup*' will return all files with filenames starting with the string backup.
name contains '*NameOfWebsite' will return all files with filenames that have a word starting with the string NameOfWebsite. The file name backup0194364-NameOfWebsite.zip will not be returned, because there is no space before the string.
Therefore, the only way for this to work is if you do it the way you have already started to realise; string chaining:
name contains 'backup' and name contains 'NameOfWebsite' and name contains ...
References:
Files: list | Google Drive API | Google Developers
Search for files | Google Drive API | Google Developers

Extract href attribute from HTML text in Google Sheets

I have about 3000 rows in my Google Spreadsheet and each row contains data about one article from our website. In one column (e.g. A:A) is stored formated text in HTML. I need extract all URLs inside href="" attribute from this column and work with them later. (It could be array or text string separated with coma or space in B column)
I tryied to use REGEXTRACT formula but it gives me only the first result. Then I tryied to use REGEXREPLACE but I'm unable to write proper expression to get only URL links.
I know that it is not proper way to use regex to get anything from HTML. Is there another way to extract these values from HTML text in one cell?
Link to sample data: Google Spreadsheet
Thak you in advance! I'm real newbie here and in scripting, parsing etc. too.
How about this samples? I used href=\"(.*?)\" for retrieving the URL. The sample of regex101.com is here.
1. Using Google spreadsheets functions :
=TEXTJOIN(CHAR(10),TRUE,ARRAYFORMULA(IFERROR(REGEXEXTRACT(SPLIT(a1,">"),"href="&CHAR(34)&"(.*?)"&CHAR(34)))))
In this case, since REGEXEXTRACT retrieves only the first matched string, after the cell data is separated by SPLIT, the URL is retrieved by REGEXEXTRACT.
Result :
2. Using Google Apps Script :
function myFunction(str){
var re = /href=\"(.*?)\"/g;
var result = "";
while ((res=re.exec(str)) !== null) {
result += res[1] + "\n";
};
return result.slice(0,-1);
}
This script can be used as a custom function. When you use this, please put =myFunction(A1) to a cell.
Result :
The result is the same to above method.
If I misunderstand your question, I'm sorry.

KNIME manually modify node settings

I have a wide table filled with ID numbers (starting with a variable number of zeros) and I want to import it into KNIME but the columns are automatically detected as Integer. I tried to manually modify the settings.xml file corresponding to the import node in order to enforce a String type import without spending my afternoon clicking on each column, every time I get a new file. The entry is now:
<entry key="cell_class" type="xstring" value="org.knime.core.data.def.StringCell"/>
I get an error when re-opening the workflow. So I also modified the MissValuePattern entry to:
<entry key="MissValuePattern" type="xstring" value="?"/>
Still getting an error when re-opening the workflow. I don't see any difference between a string and an integer column so I'm a bit stuck.
Use the Line Reader Node to read each line in one at a time into one column. Then attach it to a Cell Splitter node and use the a space character (or whatever it is) that is separating the columns. Select the "as new columns" radio button and the new columns will have the same type as the original column, i.e., a String.
You can manually execute arbitrary java code to create a new column or to replace an existing one using Java Snippet (simple) or Java Snippen. For example, you can concatenate your number of values of the Integer columns Col0, Col1, Col2 as
String myCol = $Col0$ + " " + $Col1$ + " " + $Col2$;
return myCol;
//or return $Col0$ + " " + $Col1$ + " " + $Col2$;
In general it is relly useful method for creating new parameters of a dataset.
Use the Number to String Node. You can click the "always include all columns" and it should automatically select all columns every time you import a new file.
Probably it would work better if you were using a Line Reader and split the columns based on your preferred delimiter.
You can specify the column type for each column in the dialog of the File Reader node. Therefore open the dialog and double click on the header of the column you want to change. A little window will open, in which you can specify the type of the column. Change it from Integer to String.

Drive Files:List query returns no results when searching for a number in a title

File query returns no result when searching for a number in a title. For example, if file title is "File.123", then search query "title contains 'File'" is successful, but query "title contains '123'" is not.
The problem can be easily reproduced in API explorer.
Also. the IndexableText field is no longer searchable, i.e. queries return no results when searching for values in that field.
Thank you.
My mistake. It seems that search is searching for whole words, and not for character sequences. So, 123 would never be found. If file name was "File 123" then the search would have results.

difference between key value in adressline and getId() value in Google Script

I wanted to ask what's the difference between the value in the adressline and the id I get when i use getId().
For example for one document the getId() value is:
t8K_TLQPmKzgB72pY4TblUg
while in the adressline the key is:
0Amu7sNvd2IoudDhLX1RMUVBtS3pnQjcycFk0VGJsVWc
what i figured out so far is that when you encode getId in base64 you get more or less the last part of the key in the adressline
(base64Encode(t8K_TLQPmKzgB72pY4TblUg) = dDhLX1RMUVBtS3pnQjcycFk0VGJsVWc=).
But I still don't know what 0Amu7sNvd2Iou stands for, because i have the impression that this parts also is different in older documents, therefore i can't just combine the key using all the time 0Amu7sNvd2Iou at the beginning
Why I need to know this: my scripts use the getId method but some users fill in their ids manually (they just copypaste it from the key from the adressline). The result is that when i try to compare them although they refer to the same document i can't match them like they are completly different...
thanks a lot for bringing light into this problem
edit #taras:
i can also open the document with the key and the id. It's just weird that there are kind of two different id's for one document. If for example i want to compare if a value somebody copypasted from the adressline to a document is the same as the file i have opened i won't get a true, even it is the same file
var keyFromHeadline = "0Amu7sNvd2IoudDhLX1RMUVBtS3pnQjcycFk0VGJsVWc"
var id = SpreadsheetApp.getActiveSpreadsheet.getId();
if (keyFromHeadline==id) Browser.msgBox("blabla")
Therefore i would be interested what is the reason for the two different values and how i could match them
If you need to have unique file IDs just normalize them. Everytime a user enters an ID manually just run it trough the fileIdNormalize function:
function fileIdNormalize(id) {
if (typeof id == 'string' && id.length > 0)
return DocsList.getFileById(id).getId();
return '';
}
Just a suggestion :
Since base64Encode seems to give you a significative part of the adress url you could use a match to check if the document is the same.
Something like :
if('manually_entered_key'.match(base64Encode('the_value_obtained_by_getId')==base64Encode('the_value_obtained_by_getId')){
// consider as the same doc ...