Logarithmically increasing execution time for each loop of a ForEach control - sql-server-2008

First, some background, I’m an SSIS newbie and I’ve just completed my second data-import project.
The package is very simple and consists of a dataflow that imports a tab-separated customer values file of ~30,000 records into an ADO recordset variable which in turn is used to power a ForEach Loop Container that executes a piece of SQL passing in values from each row of the recordset.
The import of the first ~21,000 records took 59 hours to accomplish, prior to it failing! The last ~9,000 took a further 8 hours. Yes, 67 hours in total!
The SQL consists of a check to determine if the record already exists, a call to a procedure to generate a new password, and a final call to another procedure to insert the customer data into our system. The final procedure returns a recordset, but I’m disintersted in the result and so I have just ignored it. I don’t know whether SSIS discards the recordset or not. I am aware that this is the slowest possible way of getting the data into the system, but I did not expect it to be this slow, nor to fail two thirds of the way through, and again whilst processing the last ~9,000.
When I tested the a ~3,000 record subset on my local machine the Execute Package Utility reported that each insert was taking approximately 1 second. A bit of quick math and the suggestion was that the total import would take around 8 hours to run. Seemed like a long time, which I had expected given all that I had read about SSIS and RBAR execution. I figured that the final import would be a bit quicker as the server is considerably more powerful. Although I am accessing the server remotely, but I wouldn’t have expected this to be an issue, as I have performed imports in the past, using bespoke c# console applications that use simple ADO connections and have had nothing run anywhere near as slowly.
Initially the destination table wasn’t optimised for the existence check, and I thought this could be the cause of the slow performance. I added an appropriate index to the table to change the test from a scan to a seek, expecting that this would get rid of the performance issue. Bizarrely it seemed to have no visible effect!
The reason we use the sproc to insert the data into our system is for consistency. It represents the same route that the data takes if it is inserted into our system via our web front-end. The insertion of the data also causes a number of triggers to fire and update various other entities in the database.
What’s been occurring during this import though, and has me scratching my head, is that the execution time for the SQL batch, as reported by the output of the Execute Package Utility has been logarithmically increasing during the run. What starts out as a sub-one second execution time, ends up over the course of the import at greater than 20 seconds, and eventually the import package just simply ground to a complete halt.
I've searched all over the web multiple times, thanks Google, as well as StackOverflow, and haven’t found anything that describes these symptoms.
Hopefully someone out there has some clues.
Thanks
In response to ErikE: (I couldn’t fit this into a comment, so I've added it here.)
Erik. as per your request I ran the profiler over the database whilst running the three thousand item test file through it’s paces.
I wasn’t able to easily figure out how to get SSIS to insert a visible difference into the code that would be visible to the profiler, so I just ran the profiler for the whole run. I know there will be some overhead associated with this, but, theoretically, it should be more or less consistent over the run.
The duration on a per item basis remains pretty constant over the whole run.
Below is cropped output from the trace. In the run that I've done here the first 800 overlapped previously entered data, so the system was effectively doing no work (Yay indexes!). As soon as the index stopped being useful and the system was actually inserting new data, you can see the times jump accordingly, but they don’t seem to change much, if at all between the first and last elements, with the number of reads being the largest item.
------------------------------------------
| Item | CPU | Reads | Writes | Duration |
------------------------------------------
| 0001 | 0 | 29 | 0 | 0 |
| 0002 | 0 | 32 | 0 | 0 |
| 0003 | 0 | 27 | 0 | 0 |
|… |
| 0799 | 0 | 32 | 0 | 0 |
| 0800 | 78 | 4073 | 40 | 124 |
| 0801 | 32 | 2122 | 4 | 54 |
| 0802 | 46 | 2128 | 8 | 174 |
| 0803 | 46 | 2128 | 8 | 174 |
| 0804 | 47 | 2131 | 15 | 242 |
|… |
| 1400 | 16 | 2156 | 1 | 54 |
| 1401 | 16 | 2167 | 3 | 72 |
| 1402 | 16 | 2153 | 4 | 84 |
|… |
| 2997 | 31 | 2193 | 2 | 72 |
| 2998 | 31 | 2195 | 2 | 48 |
| 2999 | 31 | 2184 | 2 | 35 |
| 3000 | 31 | 2180 | 2 | 53 |
------------------------------------------
Overnight I've also put the system through a full re-run of the import with the profiler switched on to see how things feared. It managed to get through 1 third of the import in 15.5 hours on my local machine. I exported the trace data to a SQL table so that I could get some statistics from it. Looking at the data in the trace, the delta between inserts increases by ~1 second per thousand records processed, so by the time it’s reached record 10,000 it’s taking 10 seconds per record to perform the insert. The actual code being executed for each record is below. Don’t bother critiquing the procedure, the SQL was written by the self-taught developer who was originally our receptionist long before anyone with actual developer education was employed by the company. We are well aware that it’s not good. The main thing is that I believe it should execute at a constant rate, and it very obviously doesn’t.
if not exists
(
select 1
from [dbo].[tblSubscriber]
where strSubscriberEmail = #EmailAddress
and ProductId = #ProductId
and strTrialSource = #Source
)
begin
declare #ThePassword varchar(20)
select #ThePassword = [dbo].[DefaultPassword]()
exec [dbo].[MemberLookupTransitionCDS5]
#ProductId
,#EmailAddress
,#ThePassword
,NULL --IP Address
,NULL --BrowserName
,NULL --BrowserVersion
,2 --blnUpdate
,#FirstName --strFirstName
,#Surname --strLastName
,#Source --strTrialSource
,#Comments --strTrialComments
,#Phone --strSubscriberPhone
,#TrialType --intTrialType
,NULL --Redundant MonitorGroupID
,NULL --strTrialFirstPage
,NULL --strTrialRefererUrl
,30 --intTrialSubscriptionDaysLength
,0 --SourceCategoryId
end
GO
Results of determining the difference in time between each execution (cropped for brevity).
----------------------
| Row | Delta (ms) |
----------------------
| 500 | 510 |
| 1000 | 976 |
| 1500 | 1436 |
| 2000 | 1916 |
| 2500 | 2336 |
| 3000 | 2816 |
| 3500 | 3263 |
| 4000 | 3726 |
| 4500 | 4163 |
| 5000 | 4633 |
| 5500 | 5223 |
| 6000 | 5563 |
| 6500 | 6053 |
| 7000 | 6510 |
| 7500 | 6926 |
| 8000 | 7393 |
| 8500 | 7846 |
| 9000 | 8503 |
| 9500 | 8820 |
| 10000 | 9296 |
| 10500 | 9750 |
----------------------

Let's take some steps:
Advice: Isolate if it is a server issue or a client one. Run a trace and see how long the first insert takes compared to the 3000th. Include in the SQL statements some difference on the 1st and 3000th iteration that can be filtered for in the trace so it is not capturing the other events. Try to avoid statement completion--use batch or RPC completion.
Response: The recorded CPU, reads, and duration from your profiler trace are not increasing, but the actual elapsed/effective insert time is.
Advice: Assuming that the above pattern holds true through the 10,000th insert (please advise if different), my best guess is that some blocking is occurring, maybe something like a constraint validation that is doing a nested loop join, which would scale logarithmically with the number of rows in the table just as you are seeing. Would you please do the following:
Provide the full execution plan of the INSERT statement using SET SHOWPLAN_TEXT ON.
Run a trace on the Blocked Process Report event and report on anything interesting.
Read Eliminating Deadlocks Caused by Foreign Keys with Large Transactions and let me know if this might be the cause or if I am barking up the wrong tree.
If none of this makes progress on the problem, simply update your question with any new information and comment here, and I'll continue to do my best to help.

Related

Using an SQL View to dynamically place field data in buckets

I have a complex(?) SQL query I am needing to build. We have an application that captures a simple data set for multiple clients:
ClientID | AttributeName | AttributeValue | TimeReceived
----------------------------------------------------------------
002 | att1 | 123.98 | 23:02:00 02-03-20017
----------------------------------------------------------------
003 | att2 | 987.2 | 23:02:00 02-03-20017
I need to be able to return a single record per client that looks something like this
Attribute | Hour_1 | Hour_2 | Hour_x |
--------------------------------------
att1 120.67 |
--------------------------------------
att2 | 10 | 89.3 |
The hours are to be determined by a time provided to the query. If the time was 11:00 on 02-03-20017, then hour 1 would be from 10-11 on 02-03-20017, and hour 2 from 9-10 on 02-03-20017. Attributes will be allocated to these hourly buckets based on the hour/date in their time stamp (not all buckets will have data). There will be a limit on the number of hours allocated in a single query. In summary, there are possibly 200-300 attributes and hourly blocks of up to 172 hours. To be honest I am not really sure where to start to build a query like this. Any guidance appreciated.

MySQL Queries taking too long to load

I have a database table, with 300,000 rows and 113.7 MB in size. I have my database running on Ubuntu 13.10 with 8 Cores and 8GB of RAM. As things are now, the MySQL server uses up an average of 750% CPU. and 6.5 %MEM (results obtained by running top in the CLI). Also to note, it runs on the same server as Apache2 Web Server.
Here's what I get on the Mem line:
Mem: 8141292k total, 6938244k used, 1203048k free, 211396k buffers
When I run: show processlist; I get something like this in return:
2098812 | admin | localhost | phpb | Query | 12 | Sending data | SELECT * FROM items WHERE thumb = 'Halloween 2013 Horns/thumbs/Halloween 2013 Horns (Original).png'
2098813 | admin | localhost | phpb | Query | 12 | Sending data | SELECT * FROM items WHERE thumb = 'Halloween 2013 Witch Hat/thumbs/Halloween 2013 Witch Hat (Origina
2098814 | admin | localhost | phpb | Query | 12 | Sending data | SELECT * FROM items WHERE thumb = 'Halloween 2013 Blouse/thumbs/Halloween 2013 Blouse (Original).png
2098818 | admin | localhost | phpb | Query | 11 | Sending data | SELECT * FROM items WHERE parent = 210162 OR auto = 210162
Some queries are taking an excess of 10 seconds to execute, this is not the top of the list, but somewhere in the middle just to give kind of a perspective of how many queries are stacking up in this list. I feel that it may have something to do with my Query Cash configurations. Here are the configurations show from running the SHOW STATUS LIKE 'Qc%';
+-------------------------+----------+
| Variable_name | Value |
+-------------------------+----------+
| Qcache_free_blocks | 434 |
| Qcache_free_memory | 2037880 |
| Qcache_hits | 62580686 |
| Qcache_inserts | 10865474 |
| Qcache_lowmem_prunes | 4157011 |
| Qcache_not_cached | 3140518 |
| Qcache_queries_in_cache | 1260 |
| Qcache_total_blocks | 4440 |
+-------------------------+----------+
I noticed that the Qcache_lowmem_prunes seem a bit high, is this normal?
I've been searching around StackOverflow, but I couldn't find anything that would solve my problem. Any help with this would be greatly appreciated, thank you!
This is probably one for http://dba.stackexchange.com. That said...
Why are your queries running slow? Do they return a large result set, or are they just really complex?
Have you tried running one of these queries using EXPLAIN SELECT column FROM ...?
Are you using indexes correctly?
How have you configured MySQL in your my.cnf file?
What table types are you using?
Are you getting any errors?
Edit: Okay, looking at your query examples. What data type is items.thumb? Varchar, Text? Is it not at all possible to query this table using another method than literal text matching? (e.g. ID number). Does this column have an index?

very poor performance from MySQL INNODB

lately my mysql 5.5.27 has been performing very poorly. I have changed just about everything in the config to try and see if it makes a difference with no luck. I am getting tables locked up constantly reaching 6-9 locks per table. My select queries take forever 300sec-1200sec.
Moved Everything to PasteBin because it exceeded 30k chars
http://pastebin.com/bP7jMd97
SYS ACTIVITIES
90% UPDATES AND INSERTS
10% SELECT
My slow query log is backed up. below I have my mysql info. Please let me know if there is anything i should add that would help.
Server version 5.5.27-log
Protocol version 10
Connection XX.xx.xxx via TCP/IP
TCP port 3306
Uptime: 21 hours 39 min 40 sec
Uptime: 78246 Threads: 125 Questions: 6764445 Slow queries: 25 Opens: 1382 Flush tables: 2 Open tables: 22 Queries per second avg: 86.451
SHOW OPEN TABLES
+----------+---------------+--------+-------------+
| Database | Table | In_use | Name_locked |
+----------+---------------+--------+-------------+
| aridb | ek | 0 | 0 |
| aridb | ey | 0 | 0 |
| aridb | ts | 4 | 0 |
| aridb | tts | 6 | 0 |
| aridb | tg | 0 | 0 |
| aridb | tgle | 2 | 0 |
| aridb | ts | 5 | 0 |
| aridb | tg2 | 1 | 0 |
| aridb | bts | 0 | 0 |
+---------+--------------+-------+------------+
I've hit a brick wall and need some guidance. thanks!
From looking through your log it would seem the problem (as I’m quite sure you’re aware) is due to the huge amount of locks that are present given the amount of data being updated / selected / inserted and possible at the same time.
It is really hard to give performance tips without first knowing lots of information which you don’t provide such as size of tables, schema, hardware, config, topology etc – SO probably isn’t the best place for such a broad question anyway!
I’ll keep my answer as generic as I can, but possible things to look at or try would be:
Run Explain the select queries and make sure they are selectively finding data and not performing full table scans or wasting huge amounts of data
Leave the server to do it's inserts and updates but create a read replica for reporting, this way data won’t be locked
If you’re updating many rows at a time, think about updating with a limit supplied to stop so much data being locked
If you are able to, delay the inserts to relieve pressure
Look at a hardware fix such as Solid State Disks for IO performance and more memory so more indexing / data can be held in memory or to have a larger buffer

Comparsion between friend-of-friend-of-friend-of... relationships in MySQL and Neo4J

To see the advantages of using Neo4J for friend-relationships, i created on MySQL database with one table for the Persons ("Persons", 20900 datasets):
id | name
--------------
1 | Peter
2 | Max
3 | Sam
... | ...
20900 | Rudi
and one table for the relationships ("Friendships", each person with 50 to 100 friends):
personen_id_1 | personen_id_2
-------------------------
1 | 2
1 | 3
2 | 56
... | ...
20900 | 201
so, there are arround 1.2 million relationships.
Now i want to now the friends-of-friends-of-friends-of-friends of Person with id=1, so i crafted a query like this:
select distinct P.name
from Friendships f
join Friendships f2 ON f.personen_id_2 = f2.personen_id_1
join Friendships f3 ON f2.personen_id_2 = f3.personen_id_1
join Friendships f4 ON f3.personen_id_2 = f4.personen_id_1
join Persons P ON f4.personen_id_2 = P.id
where f.personen_id_1 = 1
the query took arround 30 seconds for user-id 1
In Neo4J i created for each person one Node (20900 Nodes) with one name-property. All nodes were connected equal to the Friendships-table in MySQL, so there are 1.2 million relationships.
to get the same frinedset here, i typed in gremlin:
gremlin> g.v(1).outE.inV.loop(2){ it.loops <= 4 }.name.dedup.map()
this took arround 1 minute. I dont expected this at all!
so is my comparsion correct? and if yes, how to modify this example to show the advantages of using neo4j for this task?
I'm not overly familiar with Gremlin, but I generated a similar sized dataset (stats below) and ran an equivalent query in Cypher:
START person=node:user(name={name})
MATCH person-[:FRIEND]-()-[:FRIEND]-()-[:FRIEND]-()-[:FRIEND]-friend
RETURN friend.name AS name
I ran this 1000 times against the dataset, each time picking a different user as the starting point. I didn't warm the cache before running the tests, so this was from a standing start. Average response time: 33 ms.
Running on a MacBook Pro, 2.2 GHz Intel Core i7, 8 GB RAM, 4 GB heap
Here are the graph stats:
+----------------------------------------------+
| user | 20900 |
+----------------------------------------------+
| | Average | High | Low |
+----------------------------------------------+
| FRIEND |
+----------------------------------------------+
| OUTGOING | 74 | 100 | 48 |
| incoming | 74 | 123 | 31 |
+----------------------------------------------+
+----------------------------------------------+
| _UNKNOWN | 1 |
+----------------------------------------------+
| | Average | High | Low |
+----------------------------------------------+
+----------------------------------------------+
| Totals |
+----------------------------------------------+
| Nodes | 20901 |
| Relationships | 1565787 |
+----------------------------------------------+
| FRIEND | 1565787 |
+----------------------------------------------+
If you know you are doing 4 loops, do this:
g.v(1).out.out.out.out.name.dedup.map
There is a known semantic bug in Gremlin where loop() will turn into a breadth-first query.
https://github.com/tinkerpop/pipes/issues/25
Moreover, don't do outE.inV if you don't need to. The equivalent is out. Also, realize you are doing a 4 step search, that is a massive computation (combinatorial explosion). This is something that graph databases are not good at. You will want to look at a batch analytics framework like Faunus for this -- http://thinkaurelius.github.com/faunus/. For a reason why, see http://thinkaurelius.com/2012/04/21/loopy-lattices/
Graph databases are optimized for local traversals, by 4 steps, you touched (most likely) your entire dataset and using a "get get get"-style of database access, this is not efficient.
HTH,
Marko.

MS Access: Any Ideas on How to Create an Excel-Like Form?

Objective: Convert an overgrown Excel sheet into an Access database, but maintain a front-end that is familiar and easy to use.
There are several aspects to this, but the one I'm stuck on is one of the input forms. I'm not going to clutter this question with the back-end implementation that I have already tried because I'm open to changing it. Currently, an Excel spreadsheet is used to input employee hour allocations to various tasks. It looks something like the following.
Employee | Task | 10/03/10 | 10/10/10 | 10/17/10 | 10/24/10 | ... | 12/26/11
---------------------------------------------------------------------------------
Doe, John | Code | 16 | 16 | 20 | 20 | ... | 40
---------------------------------------------------------------------------------
Smith, Jane | Code | 32 | 32 | 16 | 32 | ... | 32
---------------------------------------------------------------------------------
Doe, John | Test | 24 | 24 | 20 | 20 | ... | 0
---------------------------------------------------------------------------------
Smith, Jane | Test | 0 | 0 | 16 | 0 | ... | 0
---------------------------------------------------------------------------------
Smith, Jane | QA | 8 | 8 | 8 | 8 | ... | 8
---------------------------------------------------------------------------------
TOTAL | 80 | 80 | 80 | 80 | ... | 80
Note that there are fifteen months of data on the sheet and that employee allocations are entered for each week of those fifteen months. Currently, at the end of the fifteen months, a new sheet is created, but the database should maintain this data for historical purposes.
Does anyone have any ideas on how to create an editable form/datasheet that has the same look and feel? If not, how about an alternative solution that still provides the user a quick glance at all fifteen months and allows easy editing of the data? What would the back-end tables look like for your proposed solution?
This is a classic de-normalization problem.
To produce an editable spread-sheet like view of your database you'll need a table with 66 columns (the two identifying columns and 64 weekly integer columns). The question is whether you want the permanent storage of the data to use this table, or to use a normalized table with four columns (the two identifiers, the week-starting date, and the integer hours value).
I would give serious consideration to storing the data in the normalized form, then converting (using a temporary table) into the denormalized form, allowing the user to print/edit the data, and then converting back to normal form.
Using this technique you get the following benefits:
The ability to support rolling windows into the data (with 66 columns, you will see a specified 15 month period, and no other). With a rolling window you can show them last month and the next 14 months, or whatever.
It will be substantially easier to do things like check peoples total hours per month, or compare the hours spent in testing vs QA for an arbitrary range of dates.
Of course, you have to write the code to translate between normal and denormal form, but that should be pretty straightforward.