Working on a Google Sites site, which takes data from a spreadsheet and builds several charts dynamically, I mentioned that Google Apps Script works quite slow. I profiled the code and optimized it, by using the Cache Service, where it is possible. After optimization the charting code takes approx. 3 secs (2759 ms is one of the fastest times, which I have ever seen) to draw 11 charts having 127 rows. And this time is for the case when all data are placed to the cache. The 1st execution time, which fetches data from the spreadsheet and places them to the cache, is around of 10 sec. The profiled code required sufficient time (tens of milliseconds) in simple places. To measure the GAS performance, I wrote a very simple procedure and executed it in the GAS environment, as deployed web application, and in the Caja Playground. Also I submitted an issue to the GAS issue tracker.
Eric Koleda reasonably mentioned, that it is not correct to compare a server code with a code running on a client. I rewrote the benchmark code and here are the results. The details and explanations are the following.
Engine |List To Map|Adjust|Quick Sort|Sort|Complete|
GAS | 138| 196| 155| 38| 570|
rhino-1.6.5 | 67| 44| 31| 9| 346|
spidermonkey-1.7| 40| 36| 11| 5| 104|
GAS - a row containing the execution times of different functions ran on the GAS engine. All the times are in milliseconds. The GAS execution time drifts in quite wide limits. In the table are the most fast times which I had across 5-10 executions. The worst Complete time, which I have seen, was 1194 ms. The source code is here. The results are here.
rhino-1.6.5 and spidermonkey-1.7 - rows contain the execution times of the same functions as GAS but executed on correspondent Javascript engines using ideone.com. The code and times for these engines are here and here.
The benchmark code contains a few functions.
List To Map [listToMap] - a function which converts a list of objects to a map having a compound key. It is taken from the site script and takes approx. 9.2% (256 of 2759 ms) of the charting code.
Adjust [adjustData_] - a function which converts all date columns in a matrix to a text in a predefined format, transposes it and converts rows from the [[[a], [1]], [[b], [2]]] form to the [[a, 1], [b, 2]] one. It is also taken from the script and consumes approx. 30.7% (857 of 2759 ms).
Sort - a standard Array.sort function, it is included to the test to see how fast work standard functions.
Quick Sort [quick_sort] - a quick sort function taken here. It is added to the benchmark to compare with the Array.sort function execution time.
Complete [test] - a function which includes calls of functions, preparing test data, and the functions mentioned above. This time is not summary of times in a raw.
Conclusion: The GAS functions execution time drifts. The GAS Complete function works 1.6 times slower than the slowest competitor. The GAS standard Array.sort function is 4 times slower than the slowest of two other engines. The service List To Map and Adjust in summary are 3 times slower (334 ms vs 111 ms) than slowest competitor. The functions take 39.2% (1113 of 2759 ms) of the charting function. I did not expect that these functions work so slow. It is possible to optimize them, for instance, using the cache. Let's assume that after optimization, these functions execution time will be 0 ms. In this case the charting function execution is 1646 ms.
Wishes: If GAS Team could optimize their engine to the speed of the slowest competitor, it is possible to expect that the execution time reduces till 1 sec or less. Also it would be great to optimize time to fetch data from a spreadsheet. I understand that the spreadsheets are not designed to handle a big amount of data, but in any case, it will increase overall performance.
I've been able to replicate this performance, and I'll post updates on the issue as I receive them.
Related
I am trying to test geomesa cassandra backend.
I have ingested a ~2M points from OSM and send DWITHIN and BBOX queries to cassandra using geomesa with geotools ecql.
Then I've done some performance tests, the results do not look reasonable for me.
Cassandra is installed to linux machine with 16 cores xeon, 32GB RAM and 1 SSD drive. I got ~150 queries per second.
I started to investigate geomesa execution plan for my queries.
Trace logs coming from org.locationtech.geomesa.index.utils.Explainer were really helpful, they do a great job explaining what is going on.
What looks confusing to me is the number of range scans that go though cassandra.
For example, I see the following in my logs:
Table: osm_poi_a7_c_osm_5fpoi_5fa7_attr_v2
Ranges (49): SELECT * FROM ..
The number 49 means the actual number of range scans sent to cassandra.
Different queries give me different results, they vary approximately from ~10 to ~130.
10 looks quite reasonable to me but 130 looks enormous.
Could you please explain what causing geomesa to send so huge amount of range scans?
Is there any way to decrease the number of range scans?
Maybe there are some configuration options?
Are there other options? like descreasing the presicion of z-index to improve such queries?
Thanks anyway!
In general, GeoMesa uses common query planning algorithms among its various back-end implementations. The default values are tilted more towards HBase and Accumulo, which support scans with large numbers of ranges. However, there are various knobs you can use to modify the behavior.
You can reduce the number of ranges that are generated at runtime through the system property geomesa.scan.ranges.target (see here). Note that this will be a rough upper limit, so you will generally get more ranges than specified.
When creating your simple feature type schema, you can also disable sharding, which defaults to 4. The number of ranges generated will be multiplied by the number of shards. See here and here.
If you are querying multiple 'time bins' (weeks by default), then the number of ranges will be multiplied by the number of time bins you are querying. You can set this to a longer interval when creating your schema; see here.
Thanks,
As per the attached, we have a Balanced Data Distributor set up in a data transformation covering about 2 million rows. The script tasks are identical - each one opens a connection to oracle and executes first a delete and then an insert. (This isn't relevant but it's done that way due to parameter issues with the Ole DB command and the Microsoft Ole DB provider for Oracle...)
The issue I'm running into is no matter how large I make my buffers or how many concurrent executions I configure, the BDD will not execute more than five concurrent processes at a time.
I've pulled back hundreds of thousands of rows in a larger buffer, and it just gets divided 5 ways. I've tried this on multiple machines - the current shot is from a 16 core server with -1 concurrent executions configured on the package - and no matter what, it's always 5 parallel jobs.
5 is better than 1, but with 2.5 million rows to insert/update, 15 rows per second at 5 concurrent executions isn't much better than 2-3 rows per second with 1 concurrent execution.
Can I force the BDD to use more paths, and if so how?
Short answer:
Yes BDD can make use of more than five paths. You shouldn't be doing anything special to force it, by definition it should automatically do it for you. Then why isn't it using more than 5 paths? Because your source is producing data faster than your destination can consume causing backpressure. To resolve it, you've to tune your destination components.
Long answer:
In theory, "the BDD takes input data and routes it in equal proportions to it's outputs, however many there are." In your set up, there are 10 outputs. So input data should be equally distributed to all the 10 outputs at the same time and you should see 10 paths executing at the same time - again in theory.
But another concept of BDD is "instead of routing individual rows, the BDD operates on buffers on data." Which means data flow engine initiates a buffer, fills it with as many rows as possible, and moves that buffer to the next component (script destination in your case). As you can see 5 buffers are used each with the same number of rows. If additional buffers were started, you'd have seen more paths being used. SSIS couldn't use additional buffers and ultimately additional paths because of a mechanism called backpressure; it happens when the source produces data faster than the destination can consume it. If it happens all memory would be used up by the source data and SSIS will not have any memory to use for the transformation and destination components. So to avoid it, SSIS limits the number of active buffers. It is set to 5 (can't be changed) which is exactly the number of threads you're seeing.
PS: The text within quotes is from this article
There is a property in SSIS data flow tasks called EngineThreads which determines how many flows can be run concurrently, and its default value is 5 (in SSIS 2012 its default value is 10, so I'm assuming you're using SSIS 2008 or earlier.) The optimal value is dependent on your environment, so some testing will probably be required to figure out what to put there.
Here's a Jamie Thomson article with a bit more detail.
Another interesting thing I've discovered via this article on CodeProject.
[T]his component uses an internal buffer of 9,947 rows (as per the
experiment, I found so) and it is pre-set. There is no way to override
this. As a proof, instead of 10 lac rows, we will use only 9,947 (Nine
thousand nine forty seven ) rows in our input file and will observe
the behavior. After running the package, we will find that all the
rows are being transferred to the first output component and the other
components received nothing.
Now let us increase the number of rows in our input file from 9,947 to
9,948 (Nine thousand nine forty eight). After running the package, we
find that the first output component received 9,947 rows while the
second output component received 1 row.
So I notice in your first buffer run that you pulled 50,000 records. Those got divided into 9,984 record buckets and passed to each output. So essentially the BDD takes the records it gets from the buffer and passes them out in ~10,000 record increments to each output. So in this case perhaps your source is the bottleneck.
Perhaps you'll need to split your original Source query in half and create two BDD-driven data flows to in essence double your parallel throughput.
I have a script that generates about 20,000 small objects with about 8 simple properties. My desire was to toss these objects into ScriptDb for later processing of the data.
What I'm experiencing though is that even with a savebatch operation that the process takes much longer then desired and then silently stops. By too much time, it's often greater then the 5 min execution limit, though without throwing any error. The script runs so long that I've not attempted to check a mutation result to see what didn't make it, but from a check after exectution it appears that most do not.
So though I'm quite certain that my collection of objects is below the storage size limit, is there a lesser known limit or throttle on accesses that is causing me problems? Are the number of objects the culprit here, should I be instead attempting to save one big object that's a collection of the lessers?
I think it's the amount of data you're writing. I know you can store 20,000 small objects, you just can't write that much in 5 minutes. Write 1000 then quit. Write the next thousand, etc. Run your function 20 times and the data is loaded. If you need to do this more/automated, use ScriptApp.
So I have a bit of a performance problem. I have made a java program that constructs a database. The problem is when loading in the data. I am loading in 5,000 files into a sql Database. When the program starts off, it can process about 10% of the files in 10 minutes however it gets much slower as it progresses. Currently at 28% it is going to finish in 16 hours at its current rate. However that rate is slowing down considerably.
My question is why does the program get progressively slower as it runs and how to fix that.
EDIT: I have two versions. One is threaded (capped at 5 threads) and one is not. The difference between the two is negligible. I can post the code again if any one likes, but I took it out because I am now fairly certain that the bottle neck is the MySQL (Also appropriately re tagged). I went ahead and used batch inserts. This did cause an initial increase in speed but once again after processing about 30% of the data it does drop of quickly.
So SQL Points
My Engine for all 64 tables is InnoDB version 10.
The table have about 300k rows at this point (~30% of the data)
All tables have one "joint" primary key. A id and a date.
Looking at MySQL WorkBench I see that there is a query per thread (5 queries)
I am not sure the unit of time (Just reading from MySQL Administrator), but the queries to check if a file is already inserted are taking `300. (This query should be fast as it is a SELECT MyIndex from MyTable Limit 1 to 1 where Date = date.) As I have been starting and stopping the program I built in this check to see if the file was already inserted. That way I am able to start it after each change and see what if any improvement there is without starting the process again.
I am fairly certain that the degradation of preformance is related to the tables' sizes. (I can stop and start the program now and the process remains slow. It is only when the tables are small that the process is going at an acceptable speed.)
Please, please ask and I will post what ever information you need.
DONE! Well I just let it run for the 4 Days it needed to. Thank you all for the help.
Cheers,
--Orlan
Q1: Why does the program get progressively slower?
In your problem space, you have 2 systems interacting: a producer that reads from the file system and produces data, and a consumer that transforms that data into records and stores them in the db. Your code is currently hard linking these two processes and your system works at the slowest speed of the two.
In your program you have a fixed arrival rate (1/sec - the wait when you've more than 10 threads running). If you have indexes in the tables being filled, as the table grows bigger, inserts will take longer. That means that while your arrival rate is fixed at 1/sec, your exit rate is continuosly increasing. Therefore, you will be creating more and more threads that share the same CPU/IO resources and getting less things done per unit of time. Creating threads is also a very expensive operation.
Q2: Could it have to do with how I am constructing the queries from Strings?
Only partially. Your string manipulation is a fixed cost in the system. It increases the cost it takes to service one request. But string operations are CPU bounded and your problem is I/O bounded, meaning that improving the string handling (that you should) will only marginally improve the performance of the system. (See Amdahl's Law).
Q3: how to fix that (performance issue)
Separate the file reader process from the db insert process. See the Consumer-Producer pattern. See also Completion Service for an implementation built-in the JDK:
(FileReaderProducer) --> queue --> (DBBulkInsertConsumer)
Don't create new Threads. Use the facilities provided by the java.util.concurrent package, like the executor service or the Completion service mentioned above. For a "bare" threadpool, use the Executors factory.
For this specific proble, having 2 separate thread pools, (one for the consumer, one for the producer) will allow you to tune your system for best performance. File reading improves with parallelization (up to your I/O bound), but db inserts are not (I/O + indexes + relational consistency checks), so you might need to limit the amount of file reading threads (3-5) to match the insertion rate (2-3). You can monitor the queue size to evaluate your system performance.
Use JDBC bulk inserts: http://viralpatel.net/blogs/batch-insert-in-java-jdbc/
Use StringBuilder instead of String concatenation. Strings in Java are immutable. That means that every time you do: myString += ","; you are creating a new String and making the old String elegible for garbage collection. In turn, this increases garbage collection performance penalties.
You can use direct insert from file to database (read here). It works faster. When I do same for postgres I get 20 times performance increase.
And also dounload Your kit profiler and profile your application for performance. Than you will see what takes your time.
There's a number of things in your code that could contribute to the speed problems and you are correct in suspecting that the Strings play a role.
Take for example this code:
String rowsString = "";
// - an extra 1 to not have a comma at the end
for (int i = 0; i <= numberOfRows - 3; i++) {
rowsString += "(DATA), \n";
}
rowsString += "(DATA)";
Depending on how many rows there are, this is a potential bottle-neck and memory hog. I think it's best if you use a StringBuilder here. I see a lot of String manipulation that are better suited to StringBuilders. Might I suggest you read up on String handling a bit and optimise these, especially where you += Strings?
Then the next question is how is your table designed? There could be things making your inserts slow, like incorrect default lengths for varchars, no indexes or too many indexes etc.
Most databases load data more efficiently if,
you load in batches of data,
you load in a relatively small numebr of threads e.g. one or two.
As you add more threads you add more overhead, so you expect it to be slower.
Try using an ExecutorService with a fixed size pool e.g. 2-4 and try loading the data in batches of say 100 at a time in a transaction.
You have several good tried and tested options for speeding up database access.
Use an ExecutorService for your threads. This may not help speed-wise but it will help you implement the following.
Hold a ThreadLocal Connection instead of making a new connection for every file. Also, obviously, don't close it.
Create a single PreparedStatement instead of making a new one every time around.
Batch up your statement executions.
Inspired by this xckd cartoon I wondered exactly what is the best mechanism to provide an estimate to the user of a file copy / movement?
The alt tag on xkcd reads as follows:
They could say "the connection is probably lost," but it's more fun to do naive time-averaging to give you hope that if you wait around for 1,163 hours, it will finally finish.
Ignoring the funny, is that really how it's done in Windows? How about other OS? Is there a better way?
Have a look at my answer to a similar question (and the other answers there) on how the remaining time is estimated in Windows Explorer.
In my opinion, there is only one way to get good estimates:
Calculate the exact number of bytes to be copied before you begin the copy process
Recalculate you estimate regularly (every 1, 5 or 10 seconds, YMMV) based on the current transfer speed
The current transfer speed can fluctuate heavily when you are copying on a network, so use an average, for example based on the amount of bytes transfered since your last estimate.
Note that the first point may require quite some work, if you are copying many files. That is probably why the guys from Microsoft decided to go without it. You need to decide yourself if the additional overhead created by that calculation is worth giving your user a better estimate.
I've done something similar to estimate when a queue will be empty, given that items are being dequeued faster than they are being enqueued. I used linear regression over the most recent N readings of (time,queue size).
This gives better results than a naive
(bytes_copied_so_far / elapsed_time) * bytes_left_to_copy
Start a global timer that fires say, every 1000 milliseconds and update a total elpased time counter. Let's call this variable "elapsedTime"
While the file is being copied, update some local variable with the amount already copied. Let's call this variable "totalCopied"
In the timer event that is periodically raised, divide totalCopied by totalElapsed to give the number of bytes copied per timer interval (in this case, 1000ms). Let's call this variable "bytesPerSec"
Divide the total file size by bytesPerSec and obtain the total number of seconds theoretically required to copy this file. Let's call this variable remainingTime
Subtract elapsedTime from remainingTime and you a somewhat accurate calculation for file copy time.
I think dialogs should just admit their limitations. It's not annoying because it's failing to give a useful time estimate, it's annoying because it's authoritatively offering an estimate that's obvious nonsense.
So, estimate however you like, based on current rate or average rate so far, rolling averages discarding outliers, or whatever. Depends on the operation and the typical durations of events which delay it, so you might have different algorithms when you know the file copy involves a network drive. But until your estimate has been fairly consistent for a period of time equal to the lesser of 30 seconds or 10% of the estimated time, display "oh dear, there seems to be some kind of holdup" when it's massively slowed, or just ignore it if it's massively sped up.
For example, dialog messages taken at 1-second intervals when a connection briefly stalls:
remaining: 60 seconds // estimate is 60 seconds
remaining: 59 seconds // estimate is 59 seconds
remaining: delayed [was 59 seconds] // estimate is 12 hours
remaining: delayed [was 59 seconds] // estimate is infinity
remaining: delayed [was 59 seconds] // got data: estimate is 59 seconds
// six seconds later
remaining: 53 seconds // estimate is 53 seconds
Most of all I would never display seconds (only hours and minutes). I think it's really frustrating when you sit there and wait for a minute while the timer jumps between 10 and 20 seconds. And always display real information like: xxx/yyyy MB copied.
I would also include something like this:
if timeLeft > 5h --> Inform user that this might not work properly
if timeLeft > 10h --> Inform user that there might be better ways to move the file
if timeLeft > 24h --> Abort and check for problems
I would also inform the user if the estimated time varies too much
And if it's not too complicated, there should be an auto-check function that checks if the process is still alive and working properly every 1-10 minutes (depending on the application).
speaking about network file copy, the best thing is to calculate file size to be transfered, network response and etc. An approach that i used once was:
Connection speed = Ping and calculate the round trip time for packages with 15 Kbytes.
Get my file size and see, theorically, how many time it would take if i would break it in
15 kb packages using my connection speed.
Recalculate my connection speed after transfer is started and ajust the time that will be spended.
I've been pondering on this one myself. I have a copy routine - via a Windows Explorer style interface - which allows the transfer of selected files from an Android Device, to a PC.
At the start, I know the total size of the file(s) that are to be copied, and as I am using C#.NET, I am using a Stopwatch, to get the elapsed time, and while the copy is in progress, I am keeping a total of what is copied so far, in terms of bytes.
I haven't actually tested it yet, but the best way seems to be this -
estimated = elapsed * ((totalSize - copiedSoFar) / copiedSoFar)
I never saw it the way you guys are explaining it-by trasfeed bytes & total bytes.
The "experience" always made a lot more sense (not good/accurate) if you instead use bytes of each file, and file count. This is how the estimate swings wildly.
If you are transferring large files first, the estimate goes long-even with the connection static. It is like it naively thinks that all files are the average size of those thus transferred, and then makes a guess assuming that the average file size will remain accurate for the entire time.
This, and the other ways, all get worse when the connection 'speed' varies...