I'm relatively new to Talend OSDI. I managed to do simple request in MySql with tMySqlInput component. However today I have a more ambitious request and have some trouble to make it work.
Indeed I need a request where the result depends on the previous line. I made it on MySQLWorkbench but not on Talend. Exemple : delay time between two dates.
Here is the request :
SET #var = NULL;
SELECT id, start_date, end_date, #var precedent, UNIX_TIMESTAMP(TIMEDIFF(start_date,#var)) AS diff, #var:=start_date AS temp
FROM ma_table
ORDER BY start_date;
and errors are :
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT id, start_date, end_date, id_process_type, #var precedent, UNIX_TIMESTAMP' at line 2
...Not very usefull, Is this syntax forbidden on Talend ? Do it exists others solutions to do such requests on Talend ? (for delay time between two dates for examples) or other component maybe ? I am searching with tMysqlRow.
Thanks for ideas !
As #Gabriele B mentions, you might want to consider doing this in a more "Talend" way.
I'd personally make use of the tMemorizeRows component to do this though.
To simplify this I've gone and made the start and end dates as integers but it should be trivial to handle this using proper dates.
If we have some data that shows the start and end date of a process and we want to work out the delay between finishing the last one and starting the next process we can read all of the data in and then use the tMemorizeRows component to remember the last 2 rows:
We then access the memorized data by looking at the array index. So here we go to a tJavaRow component that has an extra output column, startdelay. We then calculate it by comparing the current process' start day minus the last process' end date:
output_row.id = input_row.id;
output_row.startdate = input_row.startdate;
output_row.enddate = input_row.enddate;
if (id_tMemorizeRows_1[0] != 1) {
output_row.startDelay = startdate_tMemorizeRows_1[0] - enddate_tMemorizeRows_1[1];
} else {
output_row.startDelay = 0;
}
The conditional statement it to avoid null pointer errors on the first run of the data as the enddate_tMemorizeRows_1[1] will be null at that point. You could handle the null in other ways of course.
This process is reasonably easy to understand and maintain (although there is that small bit of Java code in there) and has the benefits of only needing the load the data once and only keep a small part of it in memory at any one time. It should also be very fast.
You should consider a statement refactory to do it in a "Talend" way, maybe little slower but most portable and robust.
If your table is not huge, for example, I would recommend to load it in memory using tCacheOutput/tCacheInput (you can find them on Talend Exchange) and this design:
tMySqlLoad----->tCacheOutput_1
|
|
|
OnSubjobOk
|
|
v
tCacheInput_1------->tMap_1--------+
|
|
tJoin-------------->tMap_3------------>[output]
|
|
tCacheInput_2------->tMap_2--------'
First of all you dump your table on a memory buffer
Then, you read two times this buffer. It's in memory, so it won't hurt performances
In tMap_1 you add a auto_increment index using a Numeric.sequence
You do the same in tMap_2 but with a starting number of 2 (basically, you shift the index)
Then you auto-join the table using these brand new columns
Finally in tMap_3 you're going to release your payload (ie make the diff)
This is going to be a verbose but robust solution if your table is small. If it's not and performance is not a issue you can try an even more verbose solution like Prepared Statements.
Related
I am very much a SQL developer and am new to redis, but it's performance is very interesting. I have a problem I think redis could help me very much in. I have a SQL table familiar to this:
| CONTAINER <String><NoUnq> | PROCESS <String><NoUnq> | PROCESS_DATA <String><NoUnq> | TimeCreated <TimeStamp><NoUnq>|
This table when populated to its max has roughly ~450,000,000 rows. I am running this on AWS. With these rows I select all the processes within a container (~1,000,000 containers), so I would have something like this in sql (of course container is indexed):
SELECT * FROM table WHERE container = '[CONTAINER_NAME]';
I then have a cronjob script which runs every hour and removes old processes from containers with something like this:
DELETE FROM table WHERE TimeCreated <= [SOME_TIME];
So essentially I like to have processes which are not older than ~4-5 hours. Looking at Redis I feel like I can engineer something similar to improve my performance, but am having trouble to convert this SQL like design into Redis.
My first thought was to use HSET, but I found out HSET does not allow the EXPIRE command on fields so I could not automatically remove old process. I am most concerned about performance and efficiency.
Look's like you can (and probably should) use HSET. And look's like you do not need to expire fields. You need to expire keys. The key name based on container name and EXPIREAT on this key. If you told about table relation structure like you wrote above the most like analogue is one table row is one key:
MULTI
HMSET <container name:rowId> PROCESS <value> PROCESS_DATA <value>
EXPIREAT <container name:rowId> <TimeCreated>
EXEC
Also you can use ZSET to store time related list of rows:
ZADD <container name> <TimeCreated> <rowId>
So you may use zRange as SELECT equivalent. Also you may use LUA scripting to get content of container with one request. Something like (I can make a mistake somewhere in the syntax of LUA):
local result = {}
local tmp = redis.call( 'zrange', KEYS[1], ARG[1], ARG[2], 'withscores' )
for k, v in pairs(tmp) do
result[v] = redis.call('hgetall', KEYS[1] + ':' + k)
end
return result
Where KEYS1 - container name, ARG1 - from , ARG2- to .
p.s. Also you should understand how redis expire keys to understand thats happens with memory at your instance.
Assuming that all values of MBR_DTH_DT evaluate to a Date data type other than the value '00000000', could the following UPDATE SQL fail when running on multiple processors if the CAST were performed before the filter by racing threads?
UPDATE a
SET a.[MBR_DTH_DT] = cast(a.[MBR_DTH_DT] as date)
FROM [IPDP_MEMBER_DEMOGRAPHIC_DECBR] a
WHERE a.[MBR_DTH_DT] <> '00000000'
I am trying to find the source of the following error
Error: 2014-01-30 04:42:47.67
Code: 0xC002F210
Source: Execute csp_load_ipdp_member_demographic Execute SQL Task
Description: Executing the query "exec dbo.csp_load_ipdp_member_demographic" failed with the following error: "Conversion failed when converting date and/or time from character string.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.
End Error
It could be another UPDATE or INSERT query, but the otehrs in question appear to have data that is proeprly typed from what I see,, so I am left onbly with the above.
No, it simply sounds like you have bad data in the MBR_DTH_DT column, which is VARCHAR but should be a date (once you clean out the bad data).
You can identify those rows using:
SELECT MBR_DTH_DT
FROM dbo.IPDP_MEMBER_DEMOGRAPHIC_DECBR
WHERE ISDATE(MBR_DTH_DT) = 0;
Now, you may only get rows that happen to match the where clause you're using to filter (e.g. MBR_DTH_DT = '00000000').
This has nothing to do with multiple processors, race conditions, etc. It's just that SQL Server can try to perform the cast before it applies the filter.
Randy suggests adding an additional clause, but this is not enough, because the CAST can still happen before any/all filters. You usually work around this by something like this (though it makes absolutely no sense in your case, when everything is the same column):
UPDATE dbo.IPDP_MEMBER_DEMOGRAPHIC_DECBR
SET MBR_DTH_DT = CASE
WHEN ISDATE(MBR_DTH_DT) = 1 THEN CAST(MBR_DTH_DT AS DATE)
ELSE MBR_DTH_DT END
WHERE MBR_DTH_DT <> '00000000';
(I'm not sure why in the question you're using UPDATE alias FROM table AS alias syntax; with a single-table update, this only serves to make the syntax more convoluted.)
However, in this case, this does you absolutely no good; since the target column is a string, you're just trying to convert a string to a date and back to a string again.
The real solution: stop using strings to store dates, and stop using token strings like '00000000' to denote that a date isn't available. Either use a dimension table for your dates or just live with NULL already.
Not likely. Even with multiple processors, there is no guarantee the query will processed in parallel.
Why not try something like this, assuming you're using SQL Server 2012. Even if you're not, you could write a UDF to validate a date like this.
UPDATE a
SET a.[MBR_DTH_DT] = cast(a.[MBR_DTH_DT] as date)
FROM [IPDP_MEMBER_DEMOGRAPHIC_DECBR] a
WHERE a.[MBR_DTH_DT] <> '00000000' And IsDate(MBR_DTH_DT) = 1
Most likely you have bad data are are not aware of it.
Whoops, just checked. IsDate has been available since SQL 2005. So try using it.
I have this union statement when I try to take parameters from a form and pass it to a union select statement it says too many parameters. This is using MS ACCESS.
SELECT Statement FROM table 1 where Date = Between [Forms]![DateIN]![StartDate]
UNION
SELECT Statement FROM table 2 where Date = Between [Forms]![DateIN]![StartDate]
This is the first time I am using windows DB applications to do Database apps. I am Linux type of person and always use MySQL for my projects but for this one have to use MS Access.
Is there anther way to pass parameters to UNION Statement because this method of defining values in a form can work on Single SELECT statements. But I don't know why this problem exist.
Between "Determines whether the value of an expression falls within a specified range of values" like this ...
expr [Not] Between value1 And value2
But your query only gives it one value ... Between [Forms]![DateIN]![StartDate]
So you need to add And plus another date value ...
Between [Forms]![DateIN]![StartDate] And some_other_date
Also Date is a reserved word. If you're using it as a field name, enclose it in brackets to avoid confusing the db engine: [Date]
If practical, rename the field to avoid similar problems in the future.
And as Gord pointed out, you must also bracket table names which include a space. The same applies to field names.
Still getting problems when using this method of calling the values or dates from the form to be used on the UNION statement. Here is the actual query that I am trying to use.
I don't want to recreate the wheel but I was thinking that if the Date() can be used with between Date() and Date()-6 to represent a 7 days range then I might have to right a module that takes the values from the for and then returns the values that way I can do something like Sdate() and Edate() then this can be used with Between Sdate() and Edate().
I have not tried this yet but this can be my last option I don't even know if it will work but it is worth a try. But before i do that i want to try all the resources that Access can help me make my life easy such as its OO Stuff it has for helping DB programmers.
SELECT
"Expenditure" as [TransactionType], *
FROM
Expenditures
WHERE
(((Expenditures.DateofExpe) Between [Forms]!Form1![Text0] and [Forms]![Form1]![Text11]))
UNION
SELECT
"Income" as [TransactionType], *
FROM
Income
WHERE
(((Income.DateofIncom) Between [Forms]!Form1![Text0] and [Forms]![Form1]![Text11] ));
Access VBA has great power but I don't want to use it as of yet as it will be hard to modify changes for a user that does not know how to program. trying to keep this DB app simple as possible for a dumb user to fully operate.
Any comments is much appreciated.
I have the following codes..
echo "<form><center><input type=submit name=subs value='Submit'></center></form>";
$val=$_POST['resulta']; //this is from a textarea name='resulta'
if (isset($_POST['subs'])) //from submit name='subs'
{
$aa=mysql_query("select max(reservno) as 'maxr' from reservation") or die(mysql_error()); //select maximum reservno
$bb=mysql_fetch_array($aa);
$cc=$bb['maxr'];
$lines = explode("\n", $val);
foreach ($lines as $line) {
mysql_query("insert into location_list (reservno, location) values ('$cc', '$line')")
or die(mysql_error()); //insert value of textarea then save it separately in location_list if \n is found
}
If I input the following data on the textarea (assume that I have maximum reservno '00014' from reservation table),
Davao - Cebu
Cebu - Davao
then submit it, I'll have these data in my location_list table:
loc_id || reservno || location
00001 || 00014 || Davao - Cebu
00002 || 00014 || Cebu - Davao
Then this code:
$gg=mysql_query("SELECT GROUP_CONCAT(IF((#var_ctr := #var_ctr + 1) = #cnt,
location,
SUBSTRING_INDEX(location,' - ', 1)
)
ORDER BY loc_id ASC
SEPARATOR ' - ') AS locations
FROM location_list,
(SELECT #cnt := COUNT(1), #var_ctr := 0
FROM location_list
WHERE reservno='$cc'
) dummy
WHERE reservno='$cc'") or die(mysql_error()); //QUERY IN QUESTION
$hh=mysql_fetch_array($gg);
$ii=$hh['locations'];
mysql_query("update reservation set itinerary = '$ii' where reservno = '$cc'")
or die(mysql_error());
is supposed to update reservation table with 'Davao - Cebu - Davao' but it's returning this instead, 'Davao - Cebu - Cebu'. I was previously helped by this forum to have this code working but now I'm facing another difficulty. Just can't get it to work. Please help me. Thanks in advance!
I got it working (without ORDER BY loc_id ASC) as long as I set phpMyAdmin operations loc_id ascending. But whenever I delete all data, it goes back as loc_id descending so I have to reset it. It doesn't entirely solve the problem but I guess this is as far as I can go. :)) I just have to make sure that the table column loc_id is always in ascending order. Thank you everyone for your help! I really appreciate it! But if you have any better answer, like how to set the table column always in ascending order or better query, etc, feel free to post it here. May God bless you all!
The database server is allowed to rewrite your query to optimize its execution. This might affect the order of the individual parts, in particular the order in which the various assignments are executed. I assume that some such reodering causes the result of the query to become undefined, in such a way that it works on sqlfiddle but not on your actual production system.
I can't put my finger on the exact location where things go wrong, but I believe that the core of the problem is the fact that SQL is intended to work on relations, but you try to abuse it for sequential programming. I suggest you retrieve the data from the database using portable SQL without any variable hackery, and then use PHP to perform any post-processing you might need. PHP is much better suited to express the ideas you're formulating, and no optimization or reordering of statements will get in your way there. And as your query currently only results in a single value, fetching multiple rows and combining them into a single value in the PHP code shouldn't increase complexety too much.
Edit:
While discussing another answer using a similar technique (by Omesh as well, just as the answer your code is based upon), I found this in the MySQL manual:
As a general rule, you should never assign a value to a user variable
and read the value within the same statement. You might get the
results you expect, but this is not guaranteed. The order of
evaluation for expressions involving user variables is undefined and
may change based on the elements contained within a given statement;
in addition, this order is not guaranteed to be the same between
releases of the MySQL Server.
So there are no guarantees about the order these variable assignments are evaluated, therefore no guarantees that the query does what you expect. It might work, but it might fail suddenly and unexpectedly. Therefore I strongly suggest you avoid this approach unless you have some relaibale mechanism to check the validity of the results, or really don't care about whether they are valid.
Our website has a problem: The visiting time of one page is too long. We have found out that it has a n*n matrix in that page; and for each item in the matrix, it queries three tables from MYSQL database. Every item in that matrix do the query quiet alike.
So I wonder maybe it is the large amount of MYSQL queries lead to the problem. And I want to try to fix it. Here is one of my confusions I list below:
1.
m = store.execute('SELECT X FROM TABLE1 WHERE I=1')
result = store.execute('SELECT Y FROM TABLE2 WHERE X in m')
2.
r = store.execute('SELECT X, Y FROM TABLE2');
result = []
for each in r:
i = store.execute('SELECT I FROM TABLE1 WHERE X=%s', each[0])
if i[0][0]=1:
result.append(each)
It got about 200 items in TABLE1 and more then 400 items in TABLE2. I don't know witch part takes the most time, so I can't make a better decision of how to write my sql statement.
How could I know how much time it takes to do some operation in MYSQL? Thank you!
Rather than installing a bunch of special tools, you could take a dead-simple approach like this (pardon my Ruby):
start = Time.new
# DB query here
puts "Query XYZ took #{Time.now - start} sec"
Hopefully you can translate that to Python. OR... pardon my Ruby again...
QUERY_TIMES = {}
def query(sql)
start = Time.new
connection.execute(sql)
elapsed = Time.new - start
QUERY_TIMES[sql] ||= []
QUERY_TIMES[sql] << elapsed
end
Then run all your queries through this custom method. After doing a test run, you can make it print out the number of times each query was run, and the average/total execution times.
For the future, plan to spend some time learning about "profilers" (if you haven't already). Get a good one for your chosen platform, and spend a little time learning how to use it well.
I use the MySQL Workbench for SQL development. It gives response times and can connect remotely to MySQL servers granted you have the permission (which in this case will give you a more accurate reading).
http://www.mysql.com/products/workbench/
Also, as you've realized it appears you have a SQL statement in a for loop. That could drastically effect performance. You'll want to take a different route with retrieving that data.