Counting views of any element on website - mysql

I am using such MySQL request for measuring views count
UPDATE content SET views=views+1 WHERE id='$id'
For example if I want to check how many times some single page has been viewed I've just putting it on top of page code. Unfortunately I always receiving about 5-10x bigger amount than results in Google Analytics.
If I am correct one refresh should increase value in my data base about +1. Doesn't "Views" in Google Analytics works in the same way?
If e.g. Google Analytics provides me that single page has been viewed 100x times and my data base says it was e.g. 450x times. How such simple request could generate additional 350 views? And I don't mean visits or unique visits. Just regular views.
Is it possible that Google Analytics interprates such data in a little bit different way and my data base result is correct?

There are quite a few reasons why this could be occurring. The most usual culprit is bots and spiders. As soon as you use a third-party API like Google Analytics, or Facebook's API, you'll get their bots making hits to your page.
You need to examine each request in more detail. The user agent is a good place to start, although I do recommend researching this area further - discriminating between human and bot traffic is quite a deep subject.

In Google Analytics the data is provided by the user, for example:
A user view a page on your domain, now he is on charge to comunicate to Google The PageView, if something fails in the road, the data will no be included in the reports.
In the other case , the SQL sistem that you have is a Log Based Analytic, the data is collected by your system reducing the data collection failures.
If we see this in that way, that means taht some data can be missed with the slow conections and users that dont execute javascriopt (Adbloquers or bots), or the HTML page is not properly printed***.
Now 5x times more it's a huge discrepancy, in my experiences must be near 8-25% of discrepancy. (tested over transaction level, maybe in Pageview can be more)
What i recomend you is:
Save device, browser information, the ip, and some other metadata information that can be useful and dont forget the timesatmp, so in that way yo can isolate the problem, maybe are robots or users with adblock, in the worst case you code is not properly implemented ( located in the Footer as example)
*** i added this because one time i had a huge discrepancy, but it was a server error, the HTML code was not properly printed showing to the user a empty HTTP. The MYSQL was no so fast to save the information and process the HTML code. I notice it when the effort test (via Screaming frog) showed a lot of 500x errors. ( Wordpress Blog with no cache)

Related

why can I only get 250 rows (/shoe transactions) of stockx data when accessing their API?

I have tried to gather data directly from the API of stockx which seemed possible according to an article from Jan 2019: https://medium.com/#thewillmundy/stockx-sneaker-data-in-three-simple-steps-8977d0016b80 . I am thereby able to get a request url which gives me some transactions in JSON-format.
I have tried changing the parameters within the request url (limit as well as page), which is possible, but only for the latest 250 transactions (due to high volume of sales for some shoes, I can thereby only receive the sales history for the last few days)...
My Goal: getting the whole sales history (often several thousand transactions) - in the article mentioned above, thats possible
Could it be a restriction from stockx?
or is there a way?
Would be so so grateful for help!!!
Best regards, Marvin
I think the API will only give you the 250 most recent sales because that's all the product webpage itself will allow you to load when you click view all sales. Any sales further back in time aren't directly accessible from the product page, and we're essentially requesting the same data that page can request using the link it would use. I guess those are stored and accessed in a different way internally.
I'm guessing StockX changed its API since that article is a little old. I would try to contact StockX about their API via email, but I don't think they're really continuing developer support:
https://twitter.com/stockx/status/1000004306844647424?lang=en
It's pretty disappointing because I was also looking to work with the sales data but what can you do :/

Are google's search results influenced by our data?

I have always wondered that.
For example, If I search for the term "composer" or "what is composer", it shows the php package manager. Why does it show programmer-related results? Obviously, it makes sense that it does that, since the results I get are much more relevant to me.
What if an aspiring composer googles that? What results will they get?
Another example is, if I enter the word "spring" to the search engine, it shows the spring framework, instead of, let's say, the season.
So, my question(s):
Does google actually use the data it collects to show relevant search results? (I am not talking about ads, but search results)
If yes, why doesn't incognito mode work?
How can I avoid google using other parameters, besides the very term I typed in, to affect the search results?
Yes. This is the very core of Google's business model. The same data that influences search results is also applied to ad placement (see their real-time bidding system); when you do searches, it's likely you will see ads about the same subjects fairly soon afterwards.
Incognito mode is a very limited form of anonymisation; it's really not very anonymous at all. If you visit a page in a browser that has some google-controlled element (e.g. Google Analytics, a CDN JS library, or a font), then shortly afterwards perform a google search, there will be very many points in common that allow google to match you as very likely the same person (e.g. your IP, time of day, recent similar requests, user agent string, window size, fonts available) even if it blocks cookies that would identify you explicitly. This form of fingerprinting is quite hard to avoid, though Safari is a lot better at it than Chrome. Tor provides much more robust anonymisation by normalising many fingerprintable elements, as well as hiding your IP.
That's difficult because making use of all this information will indeed lead to generally more relevant search results, so it's in Google's interests to use whatever it can (within technical and mostly legal limits). Tor will disconnect the search results from you, but it may instead provide you with results linked to whoever else might have been using the same Tor exit node as you recently, which might not be pleasant! The same would apply to using VPN services.

How do deal with bots using the in-site search and overflowing the SQL with too many requests?

What is the best practise to not annoy users with flood limits, but yet block off bots doing automated searches?
What is going on:
I am been more aware of odd search behaviour and I finally had the time, to catch who it is. It is 157.55.39.* also known as Bing. Which is odd, because when _GET['q'] is detected, noindex is added.
Problem however is, that they are slowing down the SQL server, as there is just too many instances of requests coming in.
What I have done so far:
I have implemented searching flood limit. But since I did it with a session-cookie, checking and calculating from the last search timestamp -- bing obviously ignores cookies and continues on.
Worst case scenario is to add reCAPTHA, but I don't want the "Are you human?" tickbox everytime you search. It should appear only, when flood is detected. So basically, the real question is, how to detect too many requests from client to trigger some sort of recaptcha to stop requests..
EDIT #1:
I handled the situation currently, with:
<?
# Get end IP
define('CLIENT_IP', (filter_var(#$_SERVER['HTTP_X_FORWARDED_IP'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_IP'] : (filter_var(#$_SERVER['HTTP_X_FORWARDED_FOR'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'])));
# Detect BING:
if (substr(CLIENT_IP, 0, strrpos(CLIENT_IP, '.')) == '157.55.39') {
# Tell them not right now:
Header('HTTP/1.1 503 Service Temporarily Unavailable');
# ..and block the request
die();
}
It works. But it seems like another temp solution to a more systematic problem.
I would like to mention, that I still would like search engines, including Bing to index /search.html, just not to actually search there. There is no "latest searches" or anything like that, so its a mystery where they are getting the queries from.
EDIT #2 -- How I solved it
If someone else in the future has these problems, I hope this helps.
First of all, it turns out that Bing has the same URL parameter feature, that Google has. So I was able to tell Bing to ignore URL parameter "q".
Based on the correct answer, I added disallow rows for parameter q to robots.txt:
Disallow: /*?q=*
Disallow: /*?*q=*
I also told inside the bing webmaster console, to not bother us on peak traffic.
Overall, this right away showed positive feedback from server resource usage. I will however, implement overall flood limit for identical queries, specifically where _GET is involved. So in case Bing should ever decide to visit an AJAX call (example: ?action=upvote&postid=1).
Spam is a problem that all website owners struggle to deal with.
And there are a lot of ways to build good protection, starting from very easy ways and finishing with very hard and strong protection mechanisms.
But for you right now I see one simple solution.
Use robots.txt and disallow Bing spider to crawl your search page.
You can do this very easy.
Your robots.txt file would look like:
User-agent: bingbot
Disallow: /search.html?q=
But this will totally block search engine spider from crawling your search results.
If you want just to limit such requests, but not totally block them, try this:
User-agent: bingbot
crawl-delay: 10
This will force Bing to crawl your website pages only every 10 seconds.
But with such delay, it will crawl only 8,640 pages a day (which is very small amount of requests per/day).
If you good with this, then you ok.
But, what if you want manually control this behavior by the server itself, protecting search form not only from web crawlers, but also from hackers?
They could send to your server over 50,000 requests per/hour with the ease.
In this case, I would recommend you 2 solutions.
Firstly, connect CloudFlare to your website, and don't forget to check if your server real IP is still available via services like ViewDNS IP History, cuz many websites with CF protection lack on this (even popular once).
If your active server IP is visible in the history, then you may consider changing it (highly recommended).
Secondly, you could use MemCached to store flood data and detect if a certain IP is querying too much (i.e. 30 q/min).
And if they do, block their opportunity to use perform (via MemCached) for some time.
Of course, this is not the best solution you could use, but it will work and will cost not much for your server.

Script for counting total pages on website

I am trying to write a script that will check our website everyday for the total amount of web pages we have. How can I do this using an API like Google Analytics? Using JSON would be nice. So here is what it might look like and maybe someone can help please?
{
"startDate": "{date.startOfMonth.format()}",
"endDate": "{date.today}",
"dimensions": ["query","page"]
}
As nyuen has pointed out you cannot count every page in your web presence with Google Analytics. GA will only register pages that a) have GA tracking code and b) have executed this tracking code at least once in your selected timeframe. Usually that's most of the pages, but you can't be sure.
What you can do is issuing a query that queries the page path dimension and at least one metric - pageviews would be obvious. That's not because you actually need the number of pageviews for your purpose, but because a query without at least one metric will not work. Send the query via the API or the query explorer and then simply count the number of rows in the result set. Since the page path is unique the number of results is the number of distinct pages with pageviews in the selected timeframe, which is the closes you will get with GA.
But there are actually tools for what you are trying to do, so you might want to start with those - for example you might have your script make a system call (assuming a linux system) to wget with the --spider option which will create a list of files on a given domain. This does not require tracking code (it works by following links in the pages source code). There is also web spider software like Screaming frog on Windows (doesn't really work in a script, but I guess Windows has some task scheduling tool that allow you to start programs at pre-defined times) which does not only do the counting but also returns information about the health of your site (dead links etc).
Or, since this seems to be your server, you might write a script that traverses the file system and makes a list of the files it encounters there (will not work if your pages are dynamically generated, since this counts only physical files).
Or you write a script that parses your server logs and extracts call to content files (will work only for files that have actually been viewed).
So there are a number of better alternatives to using Google Analytics for that purpose, you might want to look into one of them first.

Get all files in box account

I need to fetch a list of all the files in a user's box account, such that the list of files can then be displayed in a table view (iOS).
I have successfully implemented this by recursively using /folders/{folder id}/items on all the folder's in my user's box.
However, while this works, it's kind of dirty, seeing as how a request is made for each of the users's folders, which could be quite a large number.
Is there any way to get a list of all the files (it's no issue if folders are included, I can ignore those manually) available?
I tried implementing this using search, but I couldn't identify a value for the query parameter that returned everything.
Any help would be appreciated.
Help me, Obi-Wan Kenobi. You're my only hope.
What you are looking for (recursive call through a Box account) is not available. We have enterprise customers will bajillions of files and millions of folders. Recursively asking for everything would take too long.
What we generally recommend is that you ask for as little as you can, and that you use multiple threads and anticipate what you'll need just a little bit, so that you can deliver a high-performance user-interface to your end-users.
For example ?fields=item_collection is expensive to retrieve, and can add a lot to a paylaod. It can double, or 10x the time that it takes to get back a payload from the Box API. Most UI's don't need to show all the items inside every folder. So they are better off asking for ?fields=.
You can make your application responsive to the user if you make the smallest possible call. Of course there is a balance. Mobile networks have high latency, and sometimes that next API call to show some extra thing is slow. But for a folder tree, you can get high performance by retrieving only the current level, displaying that, and then starting to fetch one-level down while the user is looking at the first level.
Same goes for displaying thumbnails. If a user drills into a folder and starts looking at thumbnails for pictures, there's a good chance they'll want to see other thumbnails in that same folder. Your app should anticipate that, and start to pull one or two extras down in the background. Yes, it means more API calls, but your users will give your app a higher rating for being fast.