LOAD DATA INFILE MySQL(MariaDB) - csv

I am trying to load a .csv file into MariaDB but I am struggling with the query.
Here is how the.csv file is formatted:
USER DATE TIME TESTRESULT ERRORCODE
Esa_Test 16.5.2022 12:36:59 Fail 1(MinMaxError)
Esa_Test 16.5.2022 12:38:02 Fail 1(MinMaxError)
Esa_Test 16.5.2022 12:55:40 Fail 1(MinMaxError)
Esa_Test 17.5.2022 16:15:00 Fail 1(MinMaxError)
DPHYD_Ate 18.5.2022 9:50:11 OK 0(NoError)
When I use this query:
LOAD DATA LOW_PRIORITY LOCAL INFILE 'C:\\xampp\\mysql\\data\\test\\log2.csv' IGNORE INTO TABLE `test`.`testova2` IGNORE 4 LINES (`USER`, `DATE`, `TIME`, `TESTRESULT`, `ERRORCODE`);
The data is loaded successfully but with spaces like this:
USER;DATE;TIME;TESTRESULT;ERRORCODE
E s a _ T e s t ; 1 6 . 5 . 2 0 2 2 ; 1 2 : 3 6 : 5 9 ; F a i l ; 1 ( M i n M a x E r r o r )
E s a _ T e s t ; 1 6 . 5 . 2 0 2 2 ; 1 2 : 3 8 : 0 2 ; F a i l ; 1 ( M i n M a x E r r o r )
E s a _ T e s t ; 1 6 . 5 . 2 0 2 2 ; 1 2 : 5 5 : 4 0 ; F a i l ; 1 ( M i n M a x E r r o r )
E s a _ T e s t ; 1 7 . 5 . 2 0 2 2 ; 1 6 : 1 5 : 0 0 ; F a i l ; 1 ( M i n M a x E r r o r )
D P H Y D _ A t e ; 1 8 . 5 . 2 0 2 2 ; 9 : 5 0 : 1 1 ; O K ; 0 ( N o E r r o r )
I tried to define some "limits" via FIELDS TERMINATED BY '\t' LINES TERMINATED BY '|' in the query but not working. The original file is with encoding UTF16LE according to notepad++
Please help me to build the proper query for my case in order to insert the data correctly..

If someone looking at this:
I manage to find a solution.
The problem was the encoding.
The solution was to make a short Python script to change the encoding before the upload in MySQL the code is:
import codecs
with codecs.open("input.csv","r",encoding="utf_16") as fin:
with codecs.open("output.csv","w",encoding="utf_8") as fout:
fout.write(fin.read())
I automated this and now everything is working.

Related

How to remove non-printable character ^# in perl

I have a simple perl script as below
use strict;
use warnings 'all';
use Text::CSV;
use open ":std", ":encoding(UTF-8)";
my $in_file = $ARGV[0] or die "Usage: $0 filename\n";
my $out_file = $ARGV[1] or die "Usage: $0 filename\n";
my $csv = Text::CSV->new( { binary => 1, auto_diag => 1 });
my $out_csv = Text::CSV->new ( { binary => 1,
auto_diag => 1,
eol => $/,
sep_char => ',',
quote_char => '"',
always_quote => 1});
open my $data, '<:encoding(UTF-8)', $in_file or die $!;
open my $fh_out, '>:encoding(UTF-8)', $out_file or die $!;
while (my $words = $csv->getline($data))
{
tr/\r\n//d for #$words;
tr/,/;/ for #$words;
tr/"/'/ for #$words;
$out_csv->print($fh_out, $words);
}
All it does is basically convert a csv file into more structured one which is easily understandable by a external application which has limitation. The script removes unwanted new line, commas and double quote which are entered as part user text entry. This has been working well . However recently in one of the input file we have non-printable character ^# , he script doesn't fail but the out put file have additional characters "0
Not sure if I can give an example of input file since the charcater is non-printable. Please see the command output below.
cat Input.csv
ObjectId,OrgId,Title,ObjectType,Type,ObjectId,Location,StartDate,EndDate,DueDate,ObjectId1,ObjectId2,ObjectId3,LastModified,IsDeleted,SortOrder,Depth,DummyId,IsHidden,ResultId,DeletedDate,CreatedBy,LastModifiedBy,DeletedBy
3386484,532947,Test,Topic,Auto,3386415,http://www.test.com ,,,,,,,2016-06-27T05:08:26.3070000Z,1,443,3,,False,,2017-02-16T00:31:39.4870000Z,,,
With cat -e option you can see the non printable character.
cat -e Input.csv
M-oM-;M-?ObjectId,OrgId,Title,ObjectType,Type,ObjectId,Location,StartDate,EndDate,DueDate,ObjectId1,ObjectId2,ObjectId3,LastModified,IsDeleted,SortOrder,Depth,DummyId,IsHidden,ResultId,DeletedDate,CreatedBy,LastModifiedBy,DeletedBy^M$
3386484,532947,Test,Topic,Auto,3386415,http://www.test.com ^#,,,,,,,2016-06-27T05:08:26.3070000Z,1,443,3,,False,,2017-02-16T00:31:39.4870000Z,,,^M$
Once the file goes through the script the output looks like this. With additional "0 at the end of http://www.test.com
cat -e Output.csv
"M-oM-;M-?ObjectId","OrgId","Title","ObjectType","Type","ObjectId","Location","StartDate","EndDate","DueDate","ObjectId1","ObjectId2","ObjectId3","LastModified","IsDeleted","SortOrder","Depth","DummyId","IsHidden","ResultId","DeletedDate","CreatedBy","LastModifiedBy","DeletedBy"$
"3386484","532947","Test","Topic","Auto","3386415","http://www.test.com "0","","","","","","","2016-06-27T05:08:26.3070000Z","1","443","3","","False","","2017-02-16T00:31:39.4870000Z","","",""$
Adding command output as per Dave's request
od -ch Input.csv
0000000 357 273 277 O b j e c t I d , O r g I
bbef 4fbf 6a62 6365 4974 2c64 724f 4967
0000020 d , T i t l e , O b j e c t T y
2c64 6954 6c74 2c65 624f 656a 7463 7954
0000040 p e , T y p e , O b j e c t I d
6570 542c 7079 2c65 624f 656a 7463 6449
0000060 , L o c a t i o n , S t a r t D
4c2c 636f 7461 6f69 2c6e 7453 7261 4474
0000100 a t e , E n d D a t e , D u e D
7461 2c65 6e45 4464 7461 2c65 7544 4465
0000120 a t e , O b j e c t I d 1 , O b
7461 2c65 624f 656a 7463 6449 2c31 624f
0000140 j e c t I d 2 , O b j e c t I d
656a 7463 6449 2c32 624f 656a 7463 6449
0000160 3 , L a s t M o d i f i e d , I
2c33 614c 7473 6f4d 6964 6966 6465 492c
0000200 s D e l e t e d , S o r t O r d
4473 6c65 7465 6465 532c 726f 4f74 6472
0000220 e r , D e p t h , D u m m y I d
7265 442c 7065 6874 442c 6d75 796d 6449
0000240 , I s H i d d e n , R e s u l t
492c 4873 6469 6564 2c6e 6552 7573 746c
0000260 I d , D e l e t e d D a t e , C
6449 442c 6c65 7465 6465 6144 6574 432c
0000300 r e a t e d B y , L a s t M o d
6572 7461 6465 7942 4c2c 7361 4d74 646f
0000320 i f i e d B y , D e l e t e d B
6669 6569 4264 2c79 6544 656c 6574 4264
0000340 y \r \n 3 3 8 6 4 8 4 , 5 3 2 9 4
0d79 330a 3833 3436 3438 352c 3233 3439
0000360 7 , T e s t , T o p i c , A u t
2c37 6554 7473 542c 706f 6369 412c 7475
0000400 o , 3 3 8 6 4 1 5 , h t t p : /
2c6f 3333 3638 3134 2c35 7468 7074 2f3a
0000420 / w w w . t e s t . c o m \0 ,
772f 7777 742e 7365 2e74 6f63 206d 2c00
0000440 , , , , , , 2 0 1 6 - 0 6 - 2 7
2c2c 2c2c 2c2c 3032 3631 302d 2d36 3732
0000460 T 0 5 : 0 8 : 2 6 . 3 0 7 0 0 0
3054 3a35 3830 323a 2e36 3033 3037 3030
0000500 0 Z , 1 , 4 4 3 , 3 , , F a l s
5a30 312c 342c 3334 332c 2c2c 6146 736c
0000520 e , , 2 0 1 7 - 0 2 - 1 6 T 0 0
2c65 322c 3130 2d37 3230 312d 5436 3030
0000540 : 3 1 : 3 9 . 4 8 7 0 0 0 0 Z ,
333a 3a31 3933 342e 3738 3030 3030 2c5a
0000560 , , \r \n
2c2c 0a0d
0000564
How can I handle this in the script , I do not want the additional "0 in the output csv.
Thanks
Once you find out the numerical value for that character, you can use \x{} to specify it by that code number:
s/\x{....}//g;
For example, if the character is 🐱, I can find its ordinal value. I typically do that with a hex dump, but there are a variety of ways. It's U+1F431:
s/\x{1F431}//g;
You can also specify the code number in octal for the non-wide characters:
s/\200//g;
Or use those in a range in a character class:
s/[\200-\377]//g;
s/[\000-\037]//g;
s/[\000-\037\200-\377]//g;
But, there may be another thing you want to do. You might match it by a property it has (see perluniprops):
s/\p{XPosixCntrl}//g
Or, with an uppercase P, a property it doesn't have:
s/\P{Print}//g
So, we've learned two things from your hex dump of the file:
It's definitely a UTF8 file - as the first three characters are the UTF8 Byte Order Marker (or BOM).
Your problematic character is actually a null (with a codepoint of zero).
So, following brian's advice, you'll be able to remove the character with:
s/\0//g;
I'm not sure how you discovered it was ^#, but you should be able to refer to any such thing by using \c#, where the \c stands for "CTRL-". So, if it were a SOH, \x01, and you saw it displayed as ^A, you could specify s/\cA//g to get rid of it

how to know the count of number of rows deleted using pandas

Here from this 2 .csv files filtering is done and common emailid's are deleted,I am able to get the total after deletion ,But is there any option that gives how many rows are deleted using pandas.
using mysql :
delete a from data a, data1 b where a.email=b.email; select row_count();
How can this be done using pandas
import pandas as pd
colnames=['id','emailid']
data=pd.read_csv("input.csv",names=colnames,header=None)
colnames=['email']
data1= pd.read_csv("compare.csv",names=colnames,header=None)
emailid_suppress1=data1['email'].str.lower()
suppress_md5=data[~data['emailid'].isin(emailid_suppress1)]
print suppress_md5.count()
I believe need sum of Trues values which are processes like 1:
data = pd.DataFrame({'id':list('abcde'), 'emailid':list('klmno')})
print (data)
id emailid
0 a k
1 b l
2 c m
3 d n
4 e o
data1 = pd.DataFrame({'email':list('ABCKLDEFG')})
print (data1)
email
0 A
1 B
2 C
3 K
4 L
5 D
6 E
7 F
8 G
emailid_suppress1=data1['email'].str.lower()
print ((~data['emailid'].isin(emailid_suppress1)).sum())
3
suppress_md5=data[~data['emailid'].isin(emailid_suppress1)]
print (suppress_md5)
id emailid
2 c m
3 d n
4 e o
EDIT:
print ((data['emailid'].isin(emailid_suppress1)).sum())
2
suppress_md5=data[data['emailid'].isin(emailid_suppress1)]
print (suppress_md5)
id emailid
0 a k
1 b l

SQL: simplify and, or query on multiple columns

I have a table named employe with following content
id A B C D E F G
1 8 4 6 3 2 5 2
1 7 3 1 2 1 3 7
3 9 2 3 3 2 6 1
4 6 1 2 4 5 5 7
4 6 4 6 3 2 5 2
I do a query like this
employe.where(
A > 7 &
(
(C<3 & D<3 & E<3 & F<3 & G<3)
|(B<3 & D<3 & E<3 & F<3 & G<3)
|(B<3 & C<3 & E<3 & F<3 & G<3)
|(B<3 & C<3 & D<3 & F<3 & G<3)
|(B<3 & C<3 & D<3 & E<3 & G<3)
|(B<3 & C<3 & D<3 & E<3 & F<3)
)
)
Is there any way to simplify the above query? Because I have more than 20 columns in my table and I have do the above query for all columns. It looks ugly and same code in every line. Even if I could something like this would look nice
q1 = [C, D, E, F, G]
q2 = [B, D, E, F, G]
q3 = [B, C, E, F, G]
q4 = [B, C, D, F, G]
q5 = [B, C, D, E, G]
q6 = [B, C, D, E, F]
employe.where(
A > 7 &
(
q1<3 | q2<3 | q4<4 | q5<3 | q6<3
)
)
I don't have MySQL to check, but is it possible to do something like:
SELECT * FROM #employee e
INNER JOIN (SELECT id, A,
CASE WHEN a < 3 THEN 0 ELSE 1 END +
CASE WHEN b < 3 THEN 0 ELSE 1 END +
CASE WHEN c < 3 THEN 0 ELSE 1 END +
CASE WHEN d < 3 THEN 0 ELSE 1 END +
CASE WHEN e < 3 THEN 0 ELSE 1 END +
CASE WHEN f < 3 THEN 0 ELSE 1 END +
CASE WHEN g < 3 THEN 0 ELSE 1 END as cnt from #employee) c
on e.id = c.id
WHERE e.A > 7 AND c.cnt < 3
If I understand you correctly you are trying to find instances of one column where it is greater than 7, where at least four of the other five columns are less than three. My example achieves this (for all columns) by a simple trick of logic. If you count all the instances of A-G that are not less than 3 and check that for any column this is less than 3 where that column is > 7, then you know the others are maximally one not less than 3, since the column itself is one of the 2 that are not less than 3.
Expanding the example for all columns you then get
SELECT * FROM #employee e
INNER JOIN (SELECT id,
CASE WHEN a < 3 THEN 0 ELSE 1 END +
CASE WHEN b < 3 THEN 0 ELSE 1 END +
CASE WHEN c < 3 THEN 0 ELSE 1 END +
CASE WHEN d < 3 THEN 0 ELSE 1 END +
CASE WHEN e < 3 THEN 0 ELSE 1 END +
CASE WHEN f < 3 THEN 0 ELSE 1 END +
CASE WHEN g < 3 THEN 0 ELSE 1 END as cnt from #employee) c
on e.id = c.id
WHERE (e.A > 7 OR e.B > 7 OR e.C > 7 OR e.D > 7 OR e.F > 7 or e.G > 7)
AND c.cnt < 3
If all values (columns) from the list must be less than 3, then the greatest value in this list must be also less than 3, so you can use greatest function, something like:
Where
Greatest( c, d, e, f, g ) < 3
Or
Greatest( b, d, e, f, g ) < 3
Or
Greatest( b, c, d, e, f ) < 3
Or
.....

How to delete duplicate data with a condition

I have witten a query to delete the duplicate data
DELETE FROM tbl_test t
WHERE t.ROWID > ANY (SELECT b.ROWID
FROM tbl_test b
WHERE b.ser_no = t.ser_no
);
This gives me 199 records which is correct .
Since the duplicate records have status L,F or R . So I have to delete all teh records which are F or R.
For example if I have two records
ID ser_no Sta
1 20 L
2 20 F
3 15 R
4 15 L
5 89 L
6 89 F
7 10 R
8 10 R
So only one of the duplicate of Status R or F should be Deleted . There is no case in which both dupliate has LL or F,R Status
So I tried
DELETE FROM tbl_test t
WHERE t.ROWID > ANY (SELECT b.ROWID
FROM tbl_test b
WHERE b.ser_no = t.ser_no
AND b.Sta<>'L'
);
It did not work. It displays 125 records.
The final result should be
ID ser_no Sta
2 20 F
3 15 R
6 89 F
8 10 R
Try this delete statement, it works as described in rules and gave correct output for your examples:
SQLFiddle
delete from tbl_test t
where t.sta in ('F', 'R') and exists (
select 1 from tbl_test b
where b.ser_no = t.ser_no
and (sta='L' or (sta<>'L' and b.id<t.id)) )
try this query
DELETE FROM tbl_test t
WHERE t.ROWID in (SELECT max(b.ROWID)
FROM tbl_test b
where b.Sta in('R','F')
group by b.ser_no,b.Sta
);

Improvement of VBA code in Access to do something like a pivot table

I have a table of data in MS Access 2007. There are 6 fields per record, thousands of records. I want to make a sort of pivot table like object. That is, if any two rows happens to be the same in the first 4 fields, then they will end up grouped together into one row. The column headers in this pivot table will be the values from the 5th field, and the value in the pivot table will be the 6th field, a dollar amount. Think of the 5th field as letters A, B, C, D, E, F, G. So, the table I start with might have a row with A in the 5th field and $3.48 in the 6th field. Another row may match in the first 4 fields, have B in the 5th field and $8.59 in the 6th field. Another may match in the first 4 fields, have E in the 5th field and $45.20 in the 6th field. I want all these rows to be turned into one row (in a new table) that starts with the first 4 fields where they match, then lists $3.48, $8.59, $0.00, $0.00, $45.20, $0.00, $0.00, corresponding to column headers A, B, C, D, E, F, G (since no records contained C, D, F, G, their corresponding values are $0.00), and then ends with one more field that totals up the money in that row.
Currently, I have some VBA code that does this, written by someone else a few years ago. It is extremely slow and I am hoping for a better way. I asked a previous question (but not very clearly so I was advised to create a new question), where I was asking if there was a better way to do this in VBA. My question asked about reading and writing large amounts of data all at once in Access through VBA, which I know is a good practice in Excel. That is, I was hoping to take my original table and just assign the entire thing to an array all at once (as in Excel, instead of cell by cell), then work with that array in VBA and create some new array and then write that entire array all at once to a new table (instead of record by record, field by field). From the answers in that question, it seems like that is not really a possibility in Access, but my best bet might be to use some sort of query. I tried the Query Wizard and found the Cross Tab query which is close to what I describe above. But, there appears to be a max of 3 fields used in the Row Heading, whereas here I have 4. And, instead of putting $0.00 when a value is not specified (like C, D, F, G in my example above), it just leaves a blank.
Update (in response to Remou's comment to give sample data): Here is some sample data.
ID a b c d e f
7 1 2 3 5 A 5
8 1 2 3 5 B 10
9 1 2 3 5 C 15
10 1 2 3 5 D 20
11 1 2 3 5 E 25
12 1 2 4 4 A 16
13 1 2 4 4 B 26
14 1 3 3 7 D 11
15 1 3 3 7 B 11
The result should be:
a b c d an bn cn dn en Total
1 2 3 5 5 10 15 20 25 75
1 2 4 4 16 26 0 0 0 42
1 3 3 7 0 11 0 11 0 22
But, when I copy and paste the SQL given by Remou, the only output I get is
a b c d an bn cn dn en
1 2 3 5 5 10 15 20 25
This is, I think, what you want, but it would be better to consider database design, because this is a spreadsheet-like solution.
SELECT t0.a,
t0.b,
t0.c,
t0.d,
Iif(Isnull([a1]), 0, [a1]) AS an,
Iif(Isnull([b1]), 0, [b1]) AS bn,
Iif(Isnull([c1]), 0, [c1]) AS cn,
Iif(Isnull([d1]), 0, [d1]) AS dn,
Iif(Isnull([e1]), 0, [e1]) AS en
FROM (((((SELECT DISTINCT t.a,
t.b,
t.c,
t.d
FROM table3 t) AS t0
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS a1
FROM table3 t
WHERE t.e = "A") AS a0
ON ( t0.d = a0.d )
AND ( t0.c = a0.c )
AND ( t0.b = a0.b )
AND ( t0.a = a0.a ))
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS b1
FROM table3 t
WHERE t.e = "B") AS b0
ON ( t0.d = b0.d )
AND ( t0.c = b0.c )
AND ( t0.b = b0.b )
AND ( t0.a = b0.a ))
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS c1
FROM table3 t
WHERE t.e = "C") AS c0
ON ( t0.d = c0.d )
AND ( t0.c = c0.c )
AND ( t0.b = c0.b )
AND ( t0.a = c0.a ))
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS d1
FROM table3 t
WHERE t.e = "D") AS d0
ON ( t0.d = d0.d )
AND ( t0.c = d0.c )
AND ( t0.b = d0.b )
AND ( t0.a = d0.a ))
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS e1
FROM table3 t
WHERE t.e = "E") AS e0
ON ( t0.d = e0.d )
AND ( t0.c = e0.c )
AND ( t0.b = e0.b )
AND ( t0.a = e0.a );
Table3
ID a b c d e f
1 1 2 3 4 a €10.00
2 1 2 3 4 b €10.00
3 1 2 3 4 c €10.00
4 1 2 3 4 d €10.00
5 1 2 3 4 e €10.00
6 1 2 3 5 a €10.00
7 1 2 3 5 b
8 1 2 3 5 c €10.00
9 1 2 3 5 d €10.00
10 1 2 3 5 e €10.00
Result
There are two rows, because there are only two different sets in the first four columns.
a b c d an bn cn dn en
1 2 3 4 €10.00 €10.00 €10.00 €10.00 €10.00
1 2 3 5 €10.00 €0.00 €10.00 €10.00 €10.00
The way the sql above is supposed to work, is that it selects each of the four definition columns and the currency column from the table where the sort column has a particular sort letter and labels the currency column with the sort letter, each of these sub queries are then assembled, however, you can take a sub query and look at the results. The last one is the part between the parentheses:
INNER JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS e1
FROM table3 t
WHERE t.e = "E") AS e0