My senior developer and I witnessed LookUp Transformation sending random amount of rows. For example: 150 rows are read from CSV but only 90 rows or sometimes 135 rows are sent to the database.
For test purposes we are only dealing with 150 rows but during deployment, more than 1000 to 10,000 rows are estimated.
We noticed that when settings were changed from no-cache [in cache mode for lookup transform] and Partial Cache to Full Cache, only Full Cache yielded results with full 150 row count transferred to database in comparison with 150 rows sent to sent as input to Lookup Transform. (Results were as expected). In computer B that has higher specs than computer A showing problems, we noticed that computer B produced expected results consistently.
Can anyone advise upon this issue?
Recently we noticed that this issue only occurred with originally generated CSV, however after editing using Excel and re-saving, results were fine.
I figured it out , turns out that I horribly misconceived the idea that matching only one column to another related column in LookUp Transformation was sufficient to mark the row as unique. All the rows in that first column had same data.
So after I matched all the columns from input column to their corresponding LookUp Column, I finally got expected results.
I think my partial excuse is, all the random results restricted me trying the the very approach billinkc suggested, yet strangely I am not sure how attaching that dataviewer to Match Output stopped LookUp Transformation to send rows to No-Match Output, which made better sense.
Thanks for all your suggestions in regards to debuggin the problem and my apologies over this silly mistake
Related
I have a table with 27 columns and 300,000 rows of data, out of which 8 columns are filled with data 0 or 1 or null. Using LabVIEW I get the total count of each of these columns using the following query:
select
d_1_result,
d_2_value_1_result,
de_2_value_2_result,
d_3_result,
d_4_value_1_result,
d_4_value_2_result,
d_5_result
from Table_name_vp
where ( insp_time between
"15-02-02 06:00:00" and "15-02-02 23:59:59" or
inspection_time between "15-02-03 00:00:00" and "15-02-03 06:00:00")
and partname = "AbvQuene";
This query runs for the number of days the user input, for example 120 days.
I found that total time taken by the query is 8 secs which not good.
I want to reduce the time to 8 millisecs.
I have also changed the engine to Myisam.
Any suggestions to reduce the time consumed by the query. (LabVIEW Processing is not taking time)
It depends on the data, and how many rows out of the 300,000 are actually selected by your WHERE clause. Obviously if all 300,000 are included, the whole table will need to be read. If it's a smaller number of rows, an index on insp_time or inspection_time (is this just a typo, are these actually the same field?) and/or partname might help. The exact index will depend on your data.
Update 2:
I can't see any reason for you wouldn't be able to load your whole DB into memory because it should be less than 60MB. Do you agree with this?
Please post your answer and the answer the following questions (you can edit a question after you have asked it - that's easier than commenting).
Next steps:
I should have mentioned this before, that before you run a query in LabVIEW I would always test it first using your DB admin tool (e.g. MySql Workbench). Please post whether that worked or not.
Post your LabVIEW code.
You can try running your query with less than 300K rows - say 50K and see how much your memory increases. If there's some limitation on how many rows you can query at one time than you can break your giant query into smaller ones pretty easily and then just add up the results sets. I can post an example if needed.
Update:
It sounds like there's something wrong with your schema.
For example, if you had 27 columns of double's and datetimes ( both are 8 bytes each) your total DB size would only be about 60MB (300K * 27 * 8 / 1048576).
Please post your schema for further help (you can use SHOW CREATE TABLE tablename).
8 millisecs is an extremely low time - I assume that's being driven by some type of hardware timing issue? If not please explain that requirement as a typical user requirement is around 1 second.
To get the response time that low you will need to do the following:
Query the DB at the start of your app and load all 300,000 rows into memory (e.g. a LabVIEW array)
Update the array with new values (e.g. array append)
Run the "query" against he array (e.g. using a for loop with a case select)
On a separate thread (i.e. LabVIEW "loop") insert the new records into to the database or do it write before the app closes
This approach assumes that only one instance of the app is running at a time because synchronizing database changes across multiple instances will be very hard with that timing requirement.
When I run a query in Web Intelligence, I only get a part of the data.
But I want to get all the data.
The resulting data set I am retrieving from database is quite large (10 million rows). However, I do not want to have 10 million rows in my reports, but to summarize it, so that the report has the most 50 rows.
Why am I getting only a partial data set as a result of WEBI query?
(I also noticed that in the bottom right corner there is an exclamation mark, that indicates I am working with partial data set, and when I click on refresh I still get the partial data set.)
BTW, I know I can see the SQL query when I built it using query editor, but can i see the corresponding query when I make a certain report? If yes, how?
UPDATE: I have tried the option by editing the 'Limit size of result set to:' in the Query Options in Business Layer by setting the value to 9 999 999 and the again by unchecking this option. However, I am still getting the partial result.
UPDATE: I have checked the number of rows in the resulting set - it is 9,6 million. Now it's even more confusing why I'm not getting all the rows (the max number of rows was set to 9 999 999)
SELECT
I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A.Description_TXT,
count(I_ATA_MV_FinanceTreasury.VWD_Party_A.Party_KEY)
FROM
I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A RIGHT OUTER JOIN
I_ATA_MV_FinanceTreasury.VWD_Party_A ON
(I_ATA_MV_FinanceTreasury.VWD_Segment_Value_A.Segment_Value_KEY=I_ATA_MV_FinanceTreasury.VWD_Party_A.Segment_Value_KEY)
GROUP BY 1
The "Limit size of result set" setting is a little misleading. You can choose an amount lower than the associated setting in the universe, but not higher. That is, if the universe is set to a limit of 5,000, you can set your report to a limit lower than 5,000, but you can't increase it.
Does your query include any measures? If not, and your query is set to retrieve duplicate rows, you will get an un-aggregated result.
If you're comfortable reading SQL, take a look at the report's generated SQL, and that might give you a clue as to what's going on. It's possible that there is a measure in the query that does not have an aggregate function (as it should).
While this may be a little off-topic, I personally would advise against loading that much data into a Web Intelligence document, especially if you're going to aggregate it to 50 rows in your report.
These are not the kind of data volumes WebI was designed to handle (regardless whether it will or not). Ideally, you should push down as much of the aggregation as possible to your database (which is much better equipped to handle such volumes) and return only the data you really need.
Have a look at this link, which contains some best practices. For example, slide 13 specifies that:
50.000 rows per document is a reasonable number
What you need to do is to add a measure to your query and make sure that this measure uses an aggregate database function (e.g. SUM()). This will cause WebI to create a SQL statement with GROUP BY.
Another alternative is to disable the option Retrieve duplicate rows. You can set this option by opening the data provider's properties.
.
My question is (1) is there a better strategy for solving my problem (2) is it possible to tweak/improve my solution so it works and doesn't split the aggregation in a reliable manner (3 the less important one) how can i debug it more intelligently? figuring out wtf the aggregator is doing is difficult because it only fails on giant batches that are hard to debug because of their size. answers to any of these would be very useful, most importantly the first two.
I think the problem is I'm not expressing to camel correctly that I need it to treat the CSV file coming in as a single lump and i dont want the aggregator to stop till all the records have been aggregated.
I'm writing a route to digest a million line CSV file, split then aggregate the data on some key primary fields, then write the aggregated records to a table
unforuntaely the primary constraints of the table are getting violated (which also correspond to the aggregation keys), implying that the aggregator is not waiting for the whole input to finish.
it works fine for small files of a few thousand records, but on the large sizes it will actually face in production, (1,000,000 records) it fails.
Firstly it fails with a JavaHeap memory error on the split after the CSV unmarshall. I fix this with .streaming(). This impacts the aggregator, where the aggregator 'completes' too early.
to illustrate:
A 1
A 2
B 2
--- aggregator split ---
B 1
A 2
--> A(3),B(2) ... A(2),B(1) = constraint violation because 2 lots of A's etc.
when what I want is A(5),B(3)
with examples of 100, 1000 etc, records it works fine and correctly. but when it processes 1,000,000 records, which is the real-size it needs to handle, firstly the split() gets an OutOfJavaHeapSpace exception.
I felt that simply changing the heap-size would be a short-term solution and just pushing the problem back until the next upper-limit of records comes through, so I got around it by using the .streaming() on the split.
Unfortunately now, the aggregator is being drip-fed the records, not getting them in a big cludge and it seems to be completing early and doing another aggregation, which is violating my primary constraint
from( file://inbox )
.unmarshall().bindy().
.split().body().streaming()
.setHeader( "X" Expression building string of primary-key fields)
.aggregate( header("X") ... ).completionTimeout( 15000 )
etc.
I think part of the problem is that I'm depending on the streaming split not timeing out longer than a fixed amount of time, which just isn't foolproof - e.g. a system task might reasonably cause this, etc. Also everytime I increase this timeout it makes it longer and longer to debug and test this stuff.
PRobably a better solution would be to read the size of the CSV file that comes in and not allow the aggregator to complete until every record has been processed. I have no idea how I'd express this in camel however.
Very possibly I just have a fundamental stategy misunderstanding of how I should be approaching / describing this problem. There may be a much better (simpler) approach that I dont know.
there's also such a large amount of records going in, I can't realistically debug them by hand to get an idea of what's happening (I'm also breaking the timeout on the aggregator when I do, I suspect)
You can split the file first one a line by line case, and then convert each line to CSV. Then you can run the splitter in streaming mode, and therefore have low memory consumption, and be able to read a file with a million records.
There is some blog links from this page http://camel.apache.org/articles about splitting big files in Camel. They cover XML though, but would be related to splitting big CSV files as well.
In my db there are two large tables. The first one (A) has 1.7 million rows, the second one (B): 2.1 millions. Records in A and B have a fairly identical size.
I can do any operation on A. It takes time, but it works. On B, I can't do anything. Even a simple select count(*) just hangs for ever. The problem is I don't see any error: it just hangs (when I show the process list it just says "updating" for ever).
It seems weird to me that the small delta (percentage-wise) between 1.7 and 2.1 million could make such a difference (from being able to do everything, to not even be able to do the simplest operation).
Can there be some kind of 2 million rows hard limit?
I am on Linux 2.6+, and I use innoDB.
Thanks!
Pierre
It appears it depends more on the amount of data in each row than it does on the total number of rows. If the rows contain little data, then the maximum rows returned will be higher than rows with more data. Check this link for more info:
http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html
The row size (the number of bytes needed to store one row) might be much larger for the second table. Count(*) may require a full table scan - ie reading through the entire table on disk - larger rows mean more I/O and longer time.
The presence/absence of indexes will likely make a difference too.
As I was saying in my initial post, the thing was the two tables were fairly similar, so row size would be fairly close in both tables. That's why I was a bit surprised, and I started to think that maybe, somehow, a 2 million limit was set somewhere.
It turns out my table was corrupted. It is bizarre since I was still able to access some records (using joins with other tables), and mySQL was not "complaining". I found out by doing a CHECK TABLE: it did not return any error, but it crashed mysqld every time...
Anyway, thank you all for your help on this.
Pierre
I have a question about table design and performance. I have a number of analytical machines that produce varying amounts of data (which have been stored in text files up to this point via the dos programs which run the machines). I have decided to modernise and create a new database to store all the machine results in.
I have created separate tables to store results by type e.g. all results from the balance machine get stored in the balance results table etc.
I have a common results table format for each machine which is as follows:
ClientRequestID PK
SampleNumber PK
MeasureDtTm
Operator
AnalyteName
UnitOfMeasure
Value
A typical ClientRequest might have 50 samples which need to tested by various machines. Each machine records only 1 line per sample, so there are apprx 50 rows per table associated with any given ClientRequest.
This is fine for all machines except one!
It measures 20-30 analytes per sample (and just spits them out in one long row), whereas all the other machines, I am only ever measuring 1 analyte per RequestID/SampleNumber.
If I stick to this format, this machine will generate over a miliion rows per year, because every sample can have as many as 30 measurements.
My other tables will only grow at a rate of 3000-5000 rows per year.
So after all that, my question is this:
Am I better to stick to the common format for this table, and have bucket loads of rows, or is it better to just add extra columns to represent each Analyte, such that it would generate only 1 row per sample (like the other tables). The machine can only ever measure a max of 30 analytes (and a $250k per machine, I won;t be getting another in my lifetime).
All I am worried about is reporting performance and online editing. In both cases, the PK: RequestID and SampleNumber remain the same, so I guess it's just a matter of what would load quicker. I know the multiple column approach is considered woeful from a design perspective, but would it yield better performance in this instance?
BTW the database is MS Jet / Access 2010
Any help would be greatly appreciated!
Millions of rows in a Jet/ACE database are not a problem if the rows have few columns.
However, my concern is how these records are inserted -- is this real-time data collection? If so, I'd suggest this is probably more than Jet/ACE can handle reliably.
I'm an experienced Access developer who is a big fan of Jet/ACE, but from what I know about your project, if I was starting it out, I'd definitely choose a server database from the get go, not because Jet/ACE likely can't handle it right now, but because I'm thinking in terms of 10 years down the road when this app might still be in use (remember Y2K, which was mostly a problem of apps that were designed with planned obsolescence in mind, but were never replaced).
You can decouple the AnalyteName column from the 'common results' table:
-- Table Common Results
ClientRequestID PK SampleNumber PK MeasureDtTm Operator UnitOfMeasure Value
-- Table Results Analyte
ClientRequestID PK SampleNumber PK AnalyteName
You join on the PK (Request + Sample.) That way you don't duplicate all the rest of the rows needlessly, can avoid the join in the queries where you don't require the AnalyteName to be used, can support extra Analytes and is overall saner. Unless you really start having a performance problem, this is the approach I'd follow.
Heck, even if you start having performance problems, I'd first move to a real database to see if that fixes the problems before adding columns to the results table.