Script for counting total pages on website - json

I am trying to write a script that will check our website everyday for the total amount of web pages we have. How can I do this using an API like Google Analytics? Using JSON would be nice. So here is what it might look like and maybe someone can help please?
{
"startDate": "{date.startOfMonth.format()}",
"endDate": "{date.today}",
"dimensions": ["query","page"]
}

As nyuen has pointed out you cannot count every page in your web presence with Google Analytics. GA will only register pages that a) have GA tracking code and b) have executed this tracking code at least once in your selected timeframe. Usually that's most of the pages, but you can't be sure.
What you can do is issuing a query that queries the page path dimension and at least one metric - pageviews would be obvious. That's not because you actually need the number of pageviews for your purpose, but because a query without at least one metric will not work. Send the query via the API or the query explorer and then simply count the number of rows in the result set. Since the page path is unique the number of results is the number of distinct pages with pageviews in the selected timeframe, which is the closes you will get with GA.
But there are actually tools for what you are trying to do, so you might want to start with those - for example you might have your script make a system call (assuming a linux system) to wget with the --spider option which will create a list of files on a given domain. This does not require tracking code (it works by following links in the pages source code). There is also web spider software like Screaming frog on Windows (doesn't really work in a script, but I guess Windows has some task scheduling tool that allow you to start programs at pre-defined times) which does not only do the counting but also returns information about the health of your site (dead links etc).
Or, since this seems to be your server, you might write a script that traverses the file system and makes a list of the files it encounters there (will not work if your pages are dynamically generated, since this counts only physical files).
Or you write a script that parses your server logs and extracts call to content files (will work only for files that have actually been viewed).
So there are a number of better alternatives to using Google Analytics for that purpose, you might want to look into one of them first.

Related

Counting views of any element on website

I am using such MySQL request for measuring views count
UPDATE content SET views=views+1 WHERE id='$id'
For example if I want to check how many times some single page has been viewed I've just putting it on top of page code. Unfortunately I always receiving about 5-10x bigger amount than results in Google Analytics.
If I am correct one refresh should increase value in my data base about +1. Doesn't "Views" in Google Analytics works in the same way?
If e.g. Google Analytics provides me that single page has been viewed 100x times and my data base says it was e.g. 450x times. How such simple request could generate additional 350 views? And I don't mean visits or unique visits. Just regular views.
Is it possible that Google Analytics interprates such data in a little bit different way and my data base result is correct?
There are quite a few reasons why this could be occurring. The most usual culprit is bots and spiders. As soon as you use a third-party API like Google Analytics, or Facebook's API, you'll get their bots making hits to your page.
You need to examine each request in more detail. The user agent is a good place to start, although I do recommend researching this area further - discriminating between human and bot traffic is quite a deep subject.
In Google Analytics the data is provided by the user, for example:
A user view a page on your domain, now he is on charge to comunicate to Google The PageView, if something fails in the road, the data will no be included in the reports.
In the other case , the SQL sistem that you have is a Log Based Analytic, the data is collected by your system reducing the data collection failures.
If we see this in that way, that means taht some data can be missed with the slow conections and users that dont execute javascriopt (Adbloquers or bots), or the HTML page is not properly printed***.
Now 5x times more it's a huge discrepancy, in my experiences must be near 8-25% of discrepancy. (tested over transaction level, maybe in Pageview can be more)
What i recomend you is:
Save device, browser information, the ip, and some other metadata information that can be useful and dont forget the timesatmp, so in that way yo can isolate the problem, maybe are robots or users with adblock, in the worst case you code is not properly implemented ( located in the Footer as example)
*** i added this because one time i had a huge discrepancy, but it was a server error, the HTML code was not properly printed showing to the user a empty HTTP. The MYSQL was no so fast to save the information and process the HTML code. I notice it when the effort test (via Screaming frog) showed a lot of 500x errors. ( Wordpress Blog with no cache)

Trying to achieve predictable search results from Google Drive API

Short version:
What is the proper way to list/query files by suffix, "fullText contains 'ext', "fileExtension = 'ext'" or "title contains 'ext'"? These do not always return the same results; only one of them is documented (the first), and it's not consistent.
Long versions:
I've been developing Google Drive apps for years. Every now and then I have to change to my list queries to get the correct results. My application needs to find files with a certain suffix. Official documentation indicates that I need to use the "fullText contains 'ext'" syntax, but sometimes this fails to find some files. At one time I switched to the undocumented "fileExtension = 'ext'" syntax, but again after some time I found files that wouldn't show up and went back to fullText searches. However, again I've seen files not showing up with that search and tried using "title contains 'ext'" (or v3 "name contains 'ext'") which seems to work, but for how long? I don't like using undocumented queries which might just suddenly stop working.
I feel like I'm going in circles since I don't know why fullText fails (and only for some users, and when it does work I've seen the parents field come up empty sometimes...which doesn't happen with other queries) or why the title search works (not documented to search suffixes...and I'm pretty sure it didn't used to work). I might just perform all three searches, but this affects performance, and the "or" keyword with some combinations of those three searches returns no results at all.
My application has thousands of files, each with multiple revisions, in hundreds of folders and each folder is shared with dozens of users and those permissions are changing on a regular basis as people are added and removed from projects. There are hundreds of different owners of the individual files. I suspect this complexity and the time it takes to propagate permissions and file changes affects my queries, but doesn't explain why one search would work and another wouldn't or why the information returned on a file in one query would be different from the other. That is, even after several days the problem doesn't correct itself and often a file must be remove and re-uploaded for everyone to see it. I have experienced the slow updates to meta data for shared files resulting in mismatches between meta data, files, and search results, but I take all of that into account and still have queries which simply won't work properly.
Maybe I'm expecting too much from a free API? Overall I'm very happy with what i can do, but it can be very frustrating when it's not working and you know you're doing it right! :)
You can search or filter files with the 'files.list' or 'children.list' methods of the Drive API. These methods accept the 'q' parameter which is search query.
For more information, see: https://developers.google.com/drive/v3/web/search-parameters

Get all files in box account

I need to fetch a list of all the files in a user's box account, such that the list of files can then be displayed in a table view (iOS).
I have successfully implemented this by recursively using /folders/{folder id}/items on all the folder's in my user's box.
However, while this works, it's kind of dirty, seeing as how a request is made for each of the users's folders, which could be quite a large number.
Is there any way to get a list of all the files (it's no issue if folders are included, I can ignore those manually) available?
I tried implementing this using search, but I couldn't identify a value for the query parameter that returned everything.
Any help would be appreciated.
Help me, Obi-Wan Kenobi. You're my only hope.
What you are looking for (recursive call through a Box account) is not available. We have enterprise customers will bajillions of files and millions of folders. Recursively asking for everything would take too long.
What we generally recommend is that you ask for as little as you can, and that you use multiple threads and anticipate what you'll need just a little bit, so that you can deliver a high-performance user-interface to your end-users.
For example ?fields=item_collection is expensive to retrieve, and can add a lot to a paylaod. It can double, or 10x the time that it takes to get back a payload from the Box API. Most UI's don't need to show all the items inside every folder. So they are better off asking for ?fields=.
You can make your application responsive to the user if you make the smallest possible call. Of course there is a balance. Mobile networks have high latency, and sometimes that next API call to show some extra thing is slow. But for a folder tree, you can get high performance by retrieving only the current level, displaying that, and then starting to fetch one-level down while the user is looking at the first level.
Same goes for displaying thumbnails. If a user drills into a folder and starts looking at thumbnails for pictures, there's a good chance they'll want to see other thumbnails in that same folder. Your app should anticipate that, and start to pull one or two extras down in the background. Yes, it means more API calls, but your users will give your app a higher rating for being fast.

Syncing File Name for Drive Realtime Document

My real-time document allows the user to edit the file name within the editor (much like Google's own apps). I represent this as a collaborative string so all collaborators see the file renames as soon as possible.
I'm trying to determined the best and most efficient way to keep this collaborative string in sync with the actual file name. There are two scenarios to consider:
In Editor Changes
If a user edits the document name within the editor. In this case we need to use the Drive API to push that change out to the file on Google drive. To avoid race conditions, it is best if only one of the collaborators pushes the change out. The easiest way to do this seems to check if the rename event was local.
I also found it best to add a delay so we are not pushing the rename out to the Drive API with every character change. If a few seconds pass with no more name changes at that point it pushes the change out. This all seems to work well.
External Changes
The harder one and the one I am interested in requesting advice on, the case when the file name is changed externally. For example, if the user renamed the file within the Drive interface itself. We want this change to update our collaborative string to match.
My application is entirely client-side so I can't use webhook push notifications. So my only solution is to poll the file name every X seconds (currently set to 10). But this presents the following problems:
It is API intensive. If you have 4 collaborators that keep the screen open for 8 hour that is 11520 API calls. If my app has lots of users with lots of documents I could see how this might push me past my API limits.
To avoid race conditions (and reduce API calls) we only want one collaborator to check for changes and update the collaborative string if the file name has changed. But how to pick when collaborators might join/exit at any time? Currently I am having each collaborator check anytime the collaborators change if they are the "leader". The "leader" is the collaborator whose session id is the highest. This seems to work but it all seems fairly hackey. Also if collaborators join close together I wonder if it might be possible that a race condition would cause multiple collaborators to think they are the leader.
Is there an easier way? An real-time API function I am missing?
It would be ideal if the real-time API just provided a method that stored the document name. Anytime the real-time API checks for mutations it could grab the latest document name.
I think you've identified the options. There isn't any built in functionality currently to sync it via the Realtime API specifically.
Personally I'd probably back off the poll time a lot.. its probably not critical that the title is always exactly up to date, so asking every few minutes is probably sufficient and would greatly reduce your qps.
In terms of identifying a "leader", I can't think of anything better than something deterministic based on the session id. So long as each rechecks on each session join/leave event, I don't think there should be any issues.

News Aggregater of sorts

There is a website that my company uses that updates information about 3 specific things throughout the day. We use the information from 1 of them and what we are wanting to do is pull this information as it is added to their site and add it to a page of our own to view easier. Is this even possible? Can anyone point me in the direction of setting this up? It is all text that we want to pull.
Pick a language (e.g. Perl). Find an HTTP library for it (e.g. LWP). Fetch the page and run it through an HTTP parser (e.g. HTML::TreeBuilder). Pull out the bits you want and shove them into a template (e.g. TT) then dump to a file. Stick the program in cron or Windows Scheduler.