I've added a GCS script to a static blog based on Jekyll. Development is done on localhost, then pushed to a github pages location. The blog is a combination of posts & static pages.
Results thus far are very inconsistent. Some keywords return zero results despite being clearly visible in a document's body or title. The same keywords return different results when entered into a standalone GCS URL.
Any suggestions on test strategies for this type of problem?
Related
I'm making an application that monitors URLs for changes. To program the application logic I am using Google Apps Script and a Google Sheet.
I explain the monitoring mechanism I have thought of. First of all the Script will read data from a sheet with the following columns:
URL: We indicate the URLs we want to monitor
First Time: Indicates if it is the first time that a URL is analyzed.
Changes: Indicates if changes have been made or not with respect to the previous time it has been analyzed.
HashValue: HTML code of the URL analyzed after applying an MD5 hash.
At the moment of the execution of the script the rows of the sheet will start to be read. For each row:
The URL will be read and the URLFetchApp method will be executed to get a response from that web page.
The getContentText method will be applied on the obtained answer to obtain the HTML code of the web page and we will save it in a variable.
We will apply the MD5 Hash algorithm on the HTML code and we will save it in a variable.
In case the URL is being analyzed for the first time we will indicate in the column Changes that no changes have been made (it is the first time we analyze it) and we will save in the column HashValue the content of the variable with the hashed HTML code.
In case the URL has already been analyzed previously, we will compare the previously registered HashValue value with the one we have obtained now.
In case the value is different we will indicate in the Changes column that there have been changes and we will save in the HashValue column the new hash value.
I have already programmed the code. And it works with some web sites. But with other web sites it does not work. After analyzing the HTML code of the websites where it did not work, looking for differences in the code with an online text comparator I noticed the following:
There are websites in which when reloading twice the same page the code changes a little even if the content is static. For example what can change is that an HTML tag has an ID box-wrap-140 and when reloading the page again the ID is box-wrap-148.
Therefore the script as it is implemented would detect that changes are made, because the HTML code is different. After researching many things I can't find an alternative that solves this problem, hence the question in the title
PS: We can ignore details such as the website not being down or giving us 404, 301, etc. response codes. This has already been programmed and works correctly.
PS2: Sorry for my level of English.
Yon can use cheerio GS to look for custom tags and exclude those changes(<footer>) or include those changes(like <div>).
I am designing a blog and I want others to be able to log in and create new blog posts.
The contents will be stored in a database and if a person visits the url, its content will be loaded from the database and presented in a template file.
Since the html is not stored in files but rather on a database, will Google be able to index it?
Clients on the web, be they browsers, search engines, or anything else have no way of knowing and no reason to care what a web server does (be it read a static file, generate HTML by combining data in a database with a template, generate random data with a Markov chain text generator or anything else) to determine what content to return when it is asked for the resource associated with a URL.
My current site (Golf League) uses several scripts to allow players to schedule whether they are playing, display various results pages etc. It seems as though the New Google Sites implementation does not allow a parameter to be passed in the page url and get picked up by an embedded Google Web App (published from my script)
This link shows an example https://sites.google.com/site/kitchenergaffers/home/general-gaffers-information/publish/directory-of-results?display=directory
There is my webapp (built from a GAS) that does a doGet(e). The "display" parameter tells this script which page to format and display which it gets by extracting the e.queryString. I use a similar approach for players scheduling their absences. Another url parameter identifies the player who may be changing their availability.
It seems as though this ability is not going to be supported in the New Google Sites, so I am looking for an alternative (and free) web building facility where I can launch GAS web apps and access the page url parameters the same (or similar) way. Wordpress, Wix etc may be candidates, but it is difficult to tell from their introductory info whether it can be done. If someone has already found a site facility and methodology I would appreciate the guidance.
Just in case anyone finds this in a search, I have found a workaround.
What I had missed is that a script can be the target of a URL and will execute in a browser on its own. It does not need a "hosting" page. So to achieve what I need to do, instead of sending the link with the Google sites page, I can send a link with the script directly and it will happily execute in its own browser environment. In some cases, I may need to add a bit of text to the html returned by the script to replace that which was on the Sites page
So this link (below) achieves what I needed. Be aware that the links displayed by the script, are currently still to the original sites page.
https://script.google.com/macros/s/AKfycbxichdoGrHbImuudkJbuhhD00GpHvVvc-Ph_BTpSI4863pMevVx/exec?display=directory
Background
The UN Secretary-General and other organs issue hundreds of reports to the General Assembly each year, and there is no unified list of these reports, like there are for other documents. There is, however, a simplified url for reading these reports using their document codes http://undocs.org/[document code] with the document codes having the format A/[Session]/[Document Number]. An example document code would be "A/71/1" and the url for accessing it would be "https://undocs.org/A/71/1".
I'm trying to download all of these documents for the past 15 years, but instead of manually typing in each of these, I'd like to set up a Google Apps Script to do it for me.
Problem
When I try to use the simple method UrlFetchApp.fetch("http://undocs.org/A/71/1"); for example, it fetches an error page saying that I am using an unauthorized method of accessing the page. This is the same page that shows up if you block cookies or sometimes when you try to access the page in an incognito window.
Now, I'm not looking to hack into the UN, but simply to download some PDFs that are up for public access. I need to figure out what sort of parameters I need to pass with the .fetch() method for the request to be authorized by the page.
Note: I scoured the undocs.org site looking for any guidance, and I found none.
tl;dr
Trying to access United Nations Official Document System using the UrlFetchApp from Google Apps Script, but I can't figure out how to get the request to be authorized.
Short answer - I don't think you'll be able to get it with a one-line fetch.
If you look at the HTML returned when you fetch https://undocs.org/A/71/1, you'll see that it embeds a frame that gets its content from https://daccess-ods.un.org/access.nsf/Get?OpenAgent&DS=A/71/1&Lang=E. Then, if you look at the HTML returned by that frame, you'll see two things:
A frame that loads https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234
A redirect to the actual PDF at https://documents-dds-ny.un.org/doc/UNDOC/GEN/N16/206/02/PDF/N1620602.pdf?OpenElement
I presume that the first link sets a cookie indicating that the login has occurred, which the second link then verifies before returning the content.
Things you could try:
A multi-step fetch, where you first get the content from undocs.org, parse it to get the link to the actual PDF, then login and fetch the PDF. Google Apps Script would have to persist cookies between fetches though.
Write your script in different tool (such as Python).
Use a spider/crawler tool to navigate the UN site as if it was a real human.
Our application needs a full list of the user's files and folders. We use files.list() via the Javascript library (essentially the same code as shown in the official API reference as an example).
We use the "drive.files" scope.
Examining the response to the list, we find that some files are always missing. I did various tests to understand the problem:
The files clearly exist. They show up in the Google Drive Webapp and, if I explicitly request them via ID, I can get them via the API without problems.
It's reproducible, always the same files are missing.
It is not transient. I tried a day after and still the same files are missing. I know of a few strange effects in the API that go away after some time but not this one.
It is not a one time thing (e.g. some weird things went wrong during upload). If I repeat with a completely different Google Account again files are missing. Of a small set of 147 uploaded files in one test 4 are missed by the files.list call, in another test with the same 147 files on another account 23 files are missing.
It only occurs when I use the drive.files scope. If I relax the scope to drive all files are returned. If look at "Details" in the Google Drive Webapp also the missing files are shown as created by our Application. So it does not seem that they lost their origin somehow.
It also occurs when I specify a search query. If I call files.list with a search term "q: modifiedDate > '2012-06-04T12:00:00'" which also should return all files, the same files are missing.
I re-implemented the same thing as pure REST call to the API to rule out that it is an issue with the Javascript library. The error remains.
Update: I could track it down to an issue with the paging and the maxResults parameter. If I use different values the API returns different number of items:
With maxResults=100 I get 100+100+7=207.
With maxResults=99 I get 99+99+28=226.
With maxResults=101 I get 101+101+0=202.
The last result is interesting which gave me a nextLink indicating there are more results but the items array in the last response was actually empty. This alone might indicate a bug.
Still, this only occurs in drive.file scope, the counts are consistent in the full drive scope.
I'd be glad to hear ideas for a workaround. I'm aware of other ways to keep track of the users files, e.g. using the changes feed. I'm using that already but for a specific part in our application I simply need a reliable and complete list of all our application's items in a user's account.
One more note: We had other issues with the "drive.files" scope before (see Listing files with search query returns out-of-scope results (drive.files.list call, using drive.files scope)). This turned out to be an easy fix. Perhaps this issue is related.
Are there any difference in the files belonging to "shared to me" and own files/folders, was the issue for me ?
The way it is presented in Google Drive was not the same result I got when searching without the correct flags.
I found out when I did this file list with all the folders, that I did have to include from where the search scope of files should be.
- Include deleted files
- Include shared to me files