I have Neo4j with quite simple schema. There is only one type of node and one type of relationship which can bind nodes. Each node have one property (indexed) and each relationship has four properties. These are the numbers:
neo4j-sh (?)$ dbinfo -g "Primitive count"
{
"NumberOfNodeIdsInUse": 19713210,
"NumberOfPropertyIdsInUse": 109295019,
"NumberOfRelationshipIdsInUse": 44903404,
"NumberOfRelationshipTypeIdsInUse": 1
}
I run this database on virtual machine with Debian, 7 cores and 26GB of RAM. This is my Neo4j configuration:
neo4j.properties:
neostore.nodestore.db.mapped_memory=3000M
neostore.relationshipstore.db.mapped_memory=4000M
neostore.propertystore.db.mapped_memory=4000M
neostore.propertystore.db.strings.mapped_memory=300M
neostore.propertystore.db.arrays.mapped_memory=300M
neo4j-wrapper.conf:
wrapper.java.additional=-XX:+UseParallelGC
#wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.initmemory=2000
wrapper.java.maxmemory=10000
I use UseParallelGC instead of UseConcMarkSweepGC, because I noticed that with UseConcMarkSweepGC only one CPU core is used during query and when I changed to UseParallelGC all cores are utilzed. I do not run any queries in parallel. Only one at a time in neo4j-shell, but mostly concerning the whole set of nodes for example:
match (n:User)-->(k:User)
return n.id, count(k) as degree
order by degree desc limit 100;
and it takes 726230 ms to execute it. I also tried:
match (n:User)-->()-->(k:User)
return n.id, count(DISTINCT k) as degree
order by degree desc limit 100;
but after a long time I get only "Error occurred in server thread; nested exception is:
java.lang.OutOfMemoryError: GC overhead limit exceeded". I did not try queries with restrictions taking into account relationships properties, but it is also planned.
I think that my configuration is not optimal. I noticed that Neo4j uses at most 50% of system memory during query and remaining memory is free. I could change this by setting larger value in wrapper.java.maxmemory, but I have read that I have to leave some memory for mapped_memory setings. However, I am not sure if they are taken into account, because during query there is a lot of free memory. How should I set the configuration for such queries?
Your queries are global queries that get slower with increasing amount of data. For every user node the number of outgoing relationships is calculated, put into a collection and sorted by count. This kind of operation consumes a lot of CPU and memory. Instead of tweaking config I guess you're better off refactoring your graph model.
Depending on your use case consider storing the degree of a user in a property on the user node. Of course any operation adding/removing a relationship for a user needs to be reflected in the degree property. Additionally you might want to index the degree property.
Related
We have a web application backed by MySQL serving hundreds of queries per second. I'm looking for a way to measure the "cost" of every query in production. I'm imagining some option where, for every query, MySQL returns the query results along with the CPU and I/O cost of executing that query.
The end goal is to aggregate those costs by endpoint (e.g. "/search") and by the logged-in user ID. That way, when we're having issues with site, we can quickly see if there's a particular action or user ID that is using up a large chunk of our MySQL resources.
Close but not quite (AFAICT):
This answer comes close: https://stackoverflow.com/a/12880997/163832
It describes the precision and accuracy problems with EXPLAIN and recommends an alternative that measures what actually happened rather than estimating what will happen.
The alternative does seem better for my use case, but there are still problems:
I looked at the available stats and can't find ones that measure CPU or I/O.
I don't think I can afford to do FLUSH STATUS and then SHOW SESSION STATUS ... on every query.
This doesn't work when many queries are running concurrently.
As per the attached, we have a Balanced Data Distributor set up in a data transformation covering about 2 million rows. The script tasks are identical - each one opens a connection to oracle and executes first a delete and then an insert. (This isn't relevant but it's done that way due to parameter issues with the Ole DB command and the Microsoft Ole DB provider for Oracle...)
The issue I'm running into is no matter how large I make my buffers or how many concurrent executions I configure, the BDD will not execute more than five concurrent processes at a time.
I've pulled back hundreds of thousands of rows in a larger buffer, and it just gets divided 5 ways. I've tried this on multiple machines - the current shot is from a 16 core server with -1 concurrent executions configured on the package - and no matter what, it's always 5 parallel jobs.
5 is better than 1, but with 2.5 million rows to insert/update, 15 rows per second at 5 concurrent executions isn't much better than 2-3 rows per second with 1 concurrent execution.
Can I force the BDD to use more paths, and if so how?
Short answer:
Yes BDD can make use of more than five paths. You shouldn't be doing anything special to force it, by definition it should automatically do it for you. Then why isn't it using more than 5 paths? Because your source is producing data faster than your destination can consume causing backpressure. To resolve it, you've to tune your destination components.
Long answer:
In theory, "the BDD takes input data and routes it in equal proportions to it's outputs, however many there are." In your set up, there are 10 outputs. So input data should be equally distributed to all the 10 outputs at the same time and you should see 10 paths executing at the same time - again in theory.
But another concept of BDD is "instead of routing individual rows, the BDD operates on buffers on data." Which means data flow engine initiates a buffer, fills it with as many rows as possible, and moves that buffer to the next component (script destination in your case). As you can see 5 buffers are used each with the same number of rows. If additional buffers were started, you'd have seen more paths being used. SSIS couldn't use additional buffers and ultimately additional paths because of a mechanism called backpressure; it happens when the source produces data faster than the destination can consume it. If it happens all memory would be used up by the source data and SSIS will not have any memory to use for the transformation and destination components. So to avoid it, SSIS limits the number of active buffers. It is set to 5 (can't be changed) which is exactly the number of threads you're seeing.
PS: The text within quotes is from this article
There is a property in SSIS data flow tasks called EngineThreads which determines how many flows can be run concurrently, and its default value is 5 (in SSIS 2012 its default value is 10, so I'm assuming you're using SSIS 2008 or earlier.) The optimal value is dependent on your environment, so some testing will probably be required to figure out what to put there.
Here's a Jamie Thomson article with a bit more detail.
Another interesting thing I've discovered via this article on CodeProject.
[T]his component uses an internal buffer of 9,947 rows (as per the
experiment, I found so) and it is pre-set. There is no way to override
this. As a proof, instead of 10 lac rows, we will use only 9,947 (Nine
thousand nine forty seven ) rows in our input file and will observe
the behavior. After running the package, we will find that all the
rows are being transferred to the first output component and the other
components received nothing.
Now let us increase the number of rows in our input file from 9,947 to
9,948 (Nine thousand nine forty eight). After running the package, we
find that the first output component received 9,947 rows while the
second output component received 1 row.
So I notice in your first buffer run that you pulled 50,000 records. Those got divided into 9,984 record buckets and passed to each output. So essentially the BDD takes the records it gets from the buffer and passes them out in ~10,000 record increments to each output. So in this case perhaps your source is the bottleneck.
Perhaps you'll need to split your original Source query in half and create two BDD-driven data flows to in essence double your parallel throughput.
"The output column "A" (67) on output "Output0" (5) and component "Data Flow Task" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance."
Please resolve my problem
These warnings indicate that you have columns in your data flow that are not used. A Data Flow works by allocating "buckets" of fixed size memory, filling it with data from the source and allowing the downstream components to directly access the memory address to perform synchronous transformations.
Memory is a finite resource. If SSIS detects is has 1 GB to work with and one row of data will cost 4096 MB, then you could have at most 256 rows of data in the pipeline before running out of memory space. Those 256 rows would get split into N buckets of rows because as much as you can, you want to perform set based operations when working with databases.
Why does all this matter? SSIS detects whether you've used everything you've brought into the pipeline. If it's never used, then you're wasting memory. Instead of a single row costing 4096, by excluding unused columns, you reduce the amount of memory required for each row down to 1024 MB and now you can have 1024 rows in the pipeline just by only taking what you needed.
How do you get there? In your data source, write a query instead of selecting a table. Don't use SELECT * FROM myTable instead, explicitly enumerate all of the columns you need and nothing more. Same goes for Flat File Sources---uncheck the columns that are never used. You'll still pay a disk penalty for having to read the whole row in but they don't have to hit your DF and consume that memory. Same story for any Lookups - only query the data you need.
Asynchronous components are the last thing to be aware of as this has turned into a diatribe on performance. The above calculations are much like freshman calculus classes: assume a cow is a sphere to make the math easier. Asynchronous components result in your memory being split before and after the component. They radically change the shape of the rows going through a component such that downstream components can't reuse the address space above it. This results in a physical memory copy which is a slow operation.
My final comment though is if your package is performing adequately, finishing in an acceptable time frame, unless you have nothing else to do, leave it be and go on to your next task. These are just warnings and should not "grow up" to full blown errors.
So I have a bit of a performance problem. I have made a java program that constructs a database. The problem is when loading in the data. I am loading in 5,000 files into a sql Database. When the program starts off, it can process about 10% of the files in 10 minutes however it gets much slower as it progresses. Currently at 28% it is going to finish in 16 hours at its current rate. However that rate is slowing down considerably.
My question is why does the program get progressively slower as it runs and how to fix that.
EDIT: I have two versions. One is threaded (capped at 5 threads) and one is not. The difference between the two is negligible. I can post the code again if any one likes, but I took it out because I am now fairly certain that the bottle neck is the MySQL (Also appropriately re tagged). I went ahead and used batch inserts. This did cause an initial increase in speed but once again after processing about 30% of the data it does drop of quickly.
So SQL Points
My Engine for all 64 tables is InnoDB version 10.
The table have about 300k rows at this point (~30% of the data)
All tables have one "joint" primary key. A id and a date.
Looking at MySQL WorkBench I see that there is a query per thread (5 queries)
I am not sure the unit of time (Just reading from MySQL Administrator), but the queries to check if a file is already inserted are taking `300. (This query should be fast as it is a SELECT MyIndex from MyTable Limit 1 to 1 where Date = date.) As I have been starting and stopping the program I built in this check to see if the file was already inserted. That way I am able to start it after each change and see what if any improvement there is without starting the process again.
I am fairly certain that the degradation of preformance is related to the tables' sizes. (I can stop and start the program now and the process remains slow. It is only when the tables are small that the process is going at an acceptable speed.)
Please, please ask and I will post what ever information you need.
DONE! Well I just let it run for the 4 Days it needed to. Thank you all for the help.
Cheers,
--Orlan
Q1: Why does the program get progressively slower?
In your problem space, you have 2 systems interacting: a producer that reads from the file system and produces data, and a consumer that transforms that data into records and stores them in the db. Your code is currently hard linking these two processes and your system works at the slowest speed of the two.
In your program you have a fixed arrival rate (1/sec - the wait when you've more than 10 threads running). If you have indexes in the tables being filled, as the table grows bigger, inserts will take longer. That means that while your arrival rate is fixed at 1/sec, your exit rate is continuosly increasing. Therefore, you will be creating more and more threads that share the same CPU/IO resources and getting less things done per unit of time. Creating threads is also a very expensive operation.
Q2: Could it have to do with how I am constructing the queries from Strings?
Only partially. Your string manipulation is a fixed cost in the system. It increases the cost it takes to service one request. But string operations are CPU bounded and your problem is I/O bounded, meaning that improving the string handling (that you should) will only marginally improve the performance of the system. (See Amdahl's Law).
Q3: how to fix that (performance issue)
Separate the file reader process from the db insert process. See the Consumer-Producer pattern. See also Completion Service for an implementation built-in the JDK:
(FileReaderProducer) --> queue --> (DBBulkInsertConsumer)
Don't create new Threads. Use the facilities provided by the java.util.concurrent package, like the executor service or the Completion service mentioned above. For a "bare" threadpool, use the Executors factory.
For this specific proble, having 2 separate thread pools, (one for the consumer, one for the producer) will allow you to tune your system for best performance. File reading improves with parallelization (up to your I/O bound), but db inserts are not (I/O + indexes + relational consistency checks), so you might need to limit the amount of file reading threads (3-5) to match the insertion rate (2-3). You can monitor the queue size to evaluate your system performance.
Use JDBC bulk inserts: http://viralpatel.net/blogs/batch-insert-in-java-jdbc/
Use StringBuilder instead of String concatenation. Strings in Java are immutable. That means that every time you do: myString += ","; you are creating a new String and making the old String elegible for garbage collection. In turn, this increases garbage collection performance penalties.
You can use direct insert from file to database (read here). It works faster. When I do same for postgres I get 20 times performance increase.
And also dounload Your kit profiler and profile your application for performance. Than you will see what takes your time.
There's a number of things in your code that could contribute to the speed problems and you are correct in suspecting that the Strings play a role.
Take for example this code:
String rowsString = "";
// - an extra 1 to not have a comma at the end
for (int i = 0; i <= numberOfRows - 3; i++) {
rowsString += "(DATA), \n";
}
rowsString += "(DATA)";
Depending on how many rows there are, this is a potential bottle-neck and memory hog. I think it's best if you use a StringBuilder here. I see a lot of String manipulation that are better suited to StringBuilders. Might I suggest you read up on String handling a bit and optimise these, especially where you += Strings?
Then the next question is how is your table designed? There could be things making your inserts slow, like incorrect default lengths for varchars, no indexes or too many indexes etc.
Most databases load data more efficiently if,
you load in batches of data,
you load in a relatively small numebr of threads e.g. one or two.
As you add more threads you add more overhead, so you expect it to be slower.
Try using an ExecutorService with a fixed size pool e.g. 2-4 and try loading the data in batches of say 100 at a time in a transaction.
You have several good tried and tested options for speeding up database access.
Use an ExecutorService for your threads. This may not help speed-wise but it will help you implement the following.
Hold a ThreadLocal Connection instead of making a new connection for every file. Also, obviously, don't close it.
Create a single PreparedStatement instead of making a new one every time around.
Batch up your statement executions.
Lets say I query a table with 500K rows. I would like to begin viewing any rows in the fetch buffer, which holds the result set, even though the query has not yet completed. I would like to scroll thru the fetch buffer. If I scroll too far ahead, I want to display a message like: "REACHED LAST ROW IN FETCH BUFFER.. QUERY HAS NOT YET COMPLETED".
Could this be accomplished using fgets() to read the fetch buffer while the query continues building the result set? Doing this implies multi-threading*
Can a feature like this, other than the FIRST ROWS hint directive, be provided in Oracle, Informix, MySQL, or other RDBMS?
The whole idea is to have the ability to start viewing rows before a long query completes, while displaying a counter of how many rows are available for immediate viewing.
EDIT: What I'm suggesting may require a fundamental change in a DB server's architecture, as to the way they handle their internal fetch buffers, e.g. locking up the result set until the query has completed, etc. A feature like the one I am suggesting would be very useful, especially for queries which take a long time to complete. Why have to wait until the whole query completes, when you could start viewing some of the results while the query continues to gather more results!
Paraphrasing:
I have a table with 500K rows. An ad-hoc query without a good index to support it requires a full table scan. I would like to immediately view the first rows returned while the full table scan continues. Then I want to scroll through the next results.
It seems that what you would like is some sort of system where there can be two (or more) threads at work. One thread would be busy synchronously fetching the data from the database, and reporting its progress to the rest of the program. The other thread would be dealing with the display.
In the meantime, I would like to display the progress of the table scan, example: "Searching...found 23 of 500,000 rows so far".
It isn't clear that your query will return 500,000 rows (indeed, let us hope it does not), though it may have to scan all 500,000 rows (and may well have only found 23 rows that match so far). Determining the number of rows to be returned is hard; determining the number of rows to be scanned is easier; determining the number of rows already scanned is very difficult.
If I scroll too far ahead, I want to display a message like: "Reached last row in look-ahead buffer...query has not completed yet".
So, the user has scrolled past the 23rd row, but the query is not yet completed.
Can this be done? Maybe like: spawn/exec, declare scroll cursor, open, fetch, etc.?
There are a couple of issues here. The DBMS (true of most databases, and certainly of IDS) remains tied up as far as the current connection on processing the one statement. Obtaining feedback on how a query has progressed is difficult. You could look at the estimated rows returned when the query was started (information in the SQLCA structure), but those values are apt to be wrong. You'd have to decide what to do when you reach row 200 of 23, or you only get to row 23 of 5,697. It is better than nothing, but it is not reliable. Determining how far a query has progressed is very difficult. And some queries require an actual sort operation, which means that it is very hard to predict how long it will take because no data is available until the sort is done (and once the sort is done, there is only the time taken to communicate between the DBMS and the application to hold up the delivery of the data).
Informix 4GL has many virtues, but thread support is not one of them. The language was not designed with thread safety in mind, and there is no easy way to retrofit it into the product.
I do think that what you are seeking would be most easily supported by two threads. In a single-threaded program like an I4GL program, there isn't an easy way to go off and fetch rows while waiting for the user to type some more input (such as 'scroll down the next page full of data').
The FIRST ROWS optimization is a hint to the DBMS; it may or may not give a significant benefit to the perceived performance. Overall, it typically means that the query is processed less optimally from the DBMS perspective, but getting results to the user quickly can be more important than the workload on the DBMS.
Somewhere down below in a much down-voted answer, Frank shouted (but please don't SHOUT):
That's exactly what I want to do, spawn a new process to begin displaying first_rows and scroll through them even though the query has not completed.
OK. The difficulty here is organizing the IPC between the two client-side processes. If both are connected to the DBMS, they have separate connections, and therefore the temporary tables and cursors of one session are not available to the other.
When a query is executed, a temporary table is created to hold the query results for the current list. Does the IDS engine place an exclusive lock on this temp table until the query completes?
Not all queries result in a temporary table, though the result set for a scroll cursor usually does have something approximately equivalent to a temporary table. IDS does not need to place a lock on the temporary table backing a scroll cursor because only IDS can access the table. If it was a regular temp table, there'd still not be a need to lock it because it cannot be accessed except by the session that created it.
What I meant with the 500k rows, is nrows in the queried table, not how many expected results will be returned.
Maybe a more accurate status message would be:
Searching 500,000 rows...found 23 matching rows so far
I understand that an accurate count of nrows can be obtained in sysmaster:sysactptnhdr.nrows?
Probably; you can also get a fast and accurate count with 'SELECT COUNT(*) FROM TheTable'; this does not scan anything but simply accesses the control data - probably effectively the same data as in the nrows column of the SMI table sysmaster:sysactptnhdr.
So, spawning a new process is not clearly a recipe for success; you have to transfer the query results from the spawned process to the original process. As I stated, a multithreaded solution with separate display and database access threads would work after a fashion, but there are issues with doing this using I4GL because it is not thread-aware. You'd still have to decide how the client-side code is going store the information for display.
There are three basic limiting factors:
The execution plan of the query. If the execution plan has a blocking operation at the end (such as a sort or an eager spool), the engine cannot return rows early in the query execution. It must wait until all rows are fully processed, after which it will return the data as fast as possible to the client. The time for this may itself be appreciable, so this part could be applicable to what you're talking about. In general, though, you cannot guarantee that a query will have much available very soon.
The database connection library. When returning recordsets from a database, the driver can use server-side paging or client-side paging. Which is used can and does affect which rows will be returned and when. Client-side paging forces the entire query to be returned at once, reducing the opportunity for displaying any data before it is all in. Careful use of the proper paging method is crucial to any chance to display data early in a query's lifetime.
The client program's use of synchronous or asynchronous methods. If you simply copy and paste some web example code for executing a query, you will be a bit less likely to be working with early results while the query is still running—instead the method will block and you will get nothing until it is all in. Of course, server-side paging (see point #2) can alleviate this, however in any case your application will be blocked for at least a short time if you do not specifically use an asynchronous method. For anyone reading this who is using .Net, you may want to check out Asynchronous Operations in .Net Framework.
If you get all of these right, and use the FAST FIRSTROW technique, you may be able to do some of what you're looking for. But there is no guarantee.
It can be done, with an analytic function, but Oracle has to full scan the table to determine the count no matter what you do if there's no index. An analytic could simplify your query:
SELECT x,y,z, count(*) over () the_count
FROM your_table
WHERE ...
Each row returned will have the total count of rows returned by the query in the_count. As I said, however, Oracle will have to finish the query to determine the count before anything is returned.
Depending on how you're processing the query (e.g., a PL/SQL block in a form), you could use the above query to open a cursor, then loop through the cursor and display sets of records and give the user the chance to cancel.
I'm not sure how you would accomplish this, since the query has to complete prior to the results being known. No RDBMS (that I know of) offers any means of determining how many results to a query have been found prior to the query completing.
I can't speak factually for how expensive such a feature would be in Oracle because I have never seen the source code. From the outside in, however, I think it would be rather costly and could double (if not more) the length of time a query took to complete. It would mean updating an atomic counter after each result, which isn't cheap when you're talking millions of possible rows.
So I am putting up my comments into this answer-
In terms of Oracle.
Oracle maintains its own buffer cache inside the system global area (SGA) for each instance. The hit ratio on the buffer cache depends on the sizing and reaches 90% most of the time, which means 9 out of 10 hits will be satisfied without reaching the disk.
Considering the above, even if there is a "way" (so to speak) to access the buffer chache for a query you run, the results would highly depend on the cache sizing factor. If a buffer cache is too small, the cache hit ratio will be small and more physical disk I/O will result, which will render the buffer cache un-reliable in terms of temp-data content. If a buffer cache is too big, then parts of the buffer cache will be under-utilized and memory resources will be wasted, which in terms would render too much un-necessary processing trying to access the buffer cache while in order to peek in it for the data you want.
Also depending on your cache sizing and SGA memory it would be upto the ODBC driver / optimizer to determine when and how much to use what (cache buffering or Direct Disk I/O).
In terms of trying to access the "buffer cache" to find "the row" you are looking for, there might be a way (or in near future) to do it, but there would be no way to know if what you are looking for ("The row") is there or not after all.
Also, full table scans of large tables usually result in physical disk reads and a lower buffer cache hit ratio.You can get an idea of full table scan activity at the data file level by querying v$filestat and joining to SYS.dba_data_files. Following is a query you can use and sample results:
SELECT A.file_name, B.phyrds, B.phyblkrd
FROM SYS.dba_data_files A, v$filestat B
WHERE B.file# = A.file_id
ORDER BY A.file_id;
Since this whole ordeal is highly based on multiple parameters and statistics, the results of what you are looking for may remain a probability driven off of those facotrs.