Bulk loading Google Drive Performance Optimization - google-drive-api

We're building a system that migrates documents from a different data store into Drive. We'll be doing this for different clients on a regular basis. Therefore, we're interested in performance, because it impacts our customer's experience as well as our time to market in that we need to do testing, and waiting for files to load prolongs each testing cycle.
We have 3 areas of drive interaction
Create folders (there are many, potentially 30,000+)
Upload files (similar in magnitude to the number of folders)
Recursively delete a file structure
In both cases 1 and 2, we run into "User rate limit exceeded" errors with just 2 and 3 threads, respectively. We have an exponential backup policy as suggested that starts at 1 second, and retries 8 times. We're setting the quotaUser on all requests to a random uuid in an attempt to indicate to the server that we don't require user specific rate limiting - but this seems to have had not impact as compared to when we didn't set the quotaUser.
Number 3 currently uses batch queries. 1 and 2 currently use "normal" requests.
I'm looking for guidance on how best to improve the performance of this system.

Related

Firebase Functions Count toward my GB bandwidth

I have wrote two functions in firebase, which maintain data. Like daily delete old data.
Question is when I write query to get data. does it count toward my GB downloaded limit which is $1/1GB for Blaze plan.
Since the data is transferred from Firebase Servers (Google Servers) to a user's computer (that is you in this case), you will be charged for all those data transfer into your computer.

Should I use files or a database?

I'm building a cloud sync application which syncs a users data across multiple devices. I am at a crossroads and am deciding whether to store the data on the server as files, or in a relational database. I am using Amazon Web Services and will use S3 for user files or their database service if I choose to store the data in a table instead. The data I'm storing is the state of the application every ten seconds. This could be problematic to be storing in a database because the average number of rows per user that would be stored is 100,000 and with my current user base of 20,000 people that's 2 billion rows right off the bat. Would I be better off storing that information in files? Because that would be about 100 files totaling 6 megabytes per user.
As discussed in the comments, I would store these as files.
S3 is perfectly suited to be a key/value store and if you're able to diff the changes and ensure that you aren't unnecessarily duplicating loads of data, the sync will be far easier to do by downloading the relevant files from S3 and syncing them client side.
You get a big cost saving of not having to operate a database server that can store tonnes of rows and stay up to provide them to the clients quickly.
My only real concern would be that the data in these files can be difficult to parse if you wanted to aggregate stats/data/info across multiple users as a backend or administrative view. You wouldn't be able to write simple SQL queries to sum up values etc, and would have to open the relevant files, process them with something like awk or regular expressions etc, and then compute the values that way.
You're likely doing that on the client side any for the specific files that relate to that user though, so there's probably some overlap there!

Does Google Drive SDK/API have a throttling/limiting policy (bandwidth limit)?

I created a program that downloads an entire user's drive. To improve the performance, it's a .NET multi-threaded application and I increased the value of System.Net.ServicePointManager.DefaultConnectionLimit to increase the limit of simultaneous connections. I can confirm that if the application asks for 50 concurrent connections, they are correctly opened and used.
Currently, what I have experimented is that I can increase the number of the threads to improve the number of files processed per second. However, after a certain numbers of threads, there is no difference in terms of performance (throttling?).
I have profiled the bandwidth and it seems to have a limit around 1.5 Mo/s (maximum).
The application can download as many files as the bandwidth allows and after a certain threshold, the threads that download lose in speed.
Does Google limit the number of concurrent connections or the amount of bandwidth? In the documentation, I only saw that they impose a limit of API calls per day.
Thanks for your help.

Loadrunner vuser limit

So I've read elsewhere that LoadRunner is well known to support 2-4k users easily enough, but what that didn't tell me was what sort of environment LoadRunner needed to do that. Is there any sort of guidance available on what the environment needs to be for various loads?
For example, Would a single dual-core 2.4Ghz CPU, 4GB RAM support 1,000 concurrent vUsers easily? What about if we were testing something at a larger scale (say 10,000 users) where I assume we'd need a small server farm to generate ? What would be the effect of fewer machines but with more network cards?
There have been tests run with loadrunner well into the several hundred thousand user ranges. You can imagine the logistical effort on the infrastructure required to run such tests.
Your question on how many users can a server support is actually quite a complex question. Just like any other piece of engineered software, each virtual user takes a slice of resources to operate from the finite pool of CPU, DISK, Network and RAM. So, simply adding more network cards doesn't buy you anything if your limiting factor is CPU for your virtual users. Each virtual user type has a base weight and then your own development and deployment models alter that weight. I have observed a single load generator that could take 1000 winsock users easily with less than 50% of all used resources and then drop to 25 users for a web application which had significantly high network data flows, lots of state management variables and the need for some disk activity related to the loading of files as part of the business process. You also don't want to max load your virtual user hosts in order to limit the possibility of test bed influences on your test results.
If you have immature loadrunner users then you can virtually guarantee you will be running less than optimal virtual user code in terms of resource utilization which could result in as few as 10% of what you should expect to run on a given host to produce load because of choices made in virtual user type, development and deployment run time settings.
I know this is not likely the answer you wanted to hear, i.e, "for your hosts you can get 5732 of virtual user type xfoo," but there is no finite answer without holding the application as a constant and the skills of the user of the tool as a constant. Then you can move from protocol to protocol and from host to host and find out how many users you can get per box.
Ideally each virtual user needs around 4 mb of ram memory.. so you can calculate what number your existing machine can reach up to..

Most efficient method of logging data to MySQL

We have a service which sees several hundred simultaneous connections throughout the day, peeking at about 2000, for about 3 million hits a day, and growing. With each request I need to log 4 or 5 pieces of data to MySQL, we originally used the logging that came with the app were using however it was terribly inefficient and would run my db server at >3x the average cpu load, and would eventually bring the server to it knees.
At this point we are going to add our own logging to the application (php), the only option I have for logging data is the MySQL db, as this is the only common resource available to all of the http servers. This data will be mostly writes however everyday we generate reports based on the data, then crunch and archive the old data.
What recommendations can be made to ensure that I don't take down our services with logging data?
The solution we took with this problem was to create an archive table then regularly ( every 15 minutes, on an app server) crunch the data and put it back into the tables that were used to generate reports. The archive table of course did not have any indices, the tables which the reports are generated from have several indices.
Some stats on this approach:
Short Version: >360 times faster
Long Version:
The original code/model did direct inserts into the indexed table, and the average insert took .036 seconds, using the new code/model inserts took less than .0001 seconds (I was not able to get an accurate fix on the insert time I had to measure 100,000 inserts and average for the insert time). The post-processing (crunch) took an average 12 seconds for several tens-of-thousands records. Overall we were greatly pleased with this approach and so far it has worked incredibly well for us.
Based on what you describe, I recommend you try to leverage the fact that you don't need to read this data immediately and pursue a "periodic bulk commit route". That is, buffer the logging data in RAM on the app servers and doing periodic bulk commits. If you have multiple application nodes, some sort of randomized approach would help even more (e.g., commit updated info every 5 +/- 2 minutes).
The main drawback with this approach is that if an app server fails, you lose the buffered data. However, that's only bad if (a) you absolutely need all of the data and (b) your app servers crash regularly. Small chance that both are true, but in the event they are, you can simply persist your buffer to local disk (temporarily) on an app server if that's really a concern.
The main idea is:
buffering the data
periodic bulk commits (leveraging some sort of randomization in a distributed system would help)
Another approach is to stop opening and closing connections if possible (e.g., keep longer lived connections open). While that's likely a good first step, it may require a fair amount of work on your part on a part of the system that you may not have control over. But if you do, it's worth exploring.