How to remove non-printable character ^# in perl - csv

I have a simple perl script as below
use strict;
use warnings 'all';
use Text::CSV;
use open ":std", ":encoding(UTF-8)";
my $in_file = $ARGV[0] or die "Usage: $0 filename\n";
my $out_file = $ARGV[1] or die "Usage: $0 filename\n";
my $csv = Text::CSV->new( { binary => 1, auto_diag => 1 });
my $out_csv = Text::CSV->new ( { binary => 1,
auto_diag => 1,
eol => $/,
sep_char => ',',
quote_char => '"',
always_quote => 1});
open my $data, '<:encoding(UTF-8)', $in_file or die $!;
open my $fh_out, '>:encoding(UTF-8)', $out_file or die $!;
while (my $words = $csv->getline($data))
{
tr/\r\n//d for #$words;
tr/,/;/ for #$words;
tr/"/'/ for #$words;
$out_csv->print($fh_out, $words);
}
All it does is basically convert a csv file into more structured one which is easily understandable by a external application which has limitation. The script removes unwanted new line, commas and double quote which are entered as part user text entry. This has been working well . However recently in one of the input file we have non-printable character ^# , he script doesn't fail but the out put file have additional characters "0
Not sure if I can give an example of input file since the charcater is non-printable. Please see the command output below.
cat Input.csv
ObjectId,OrgId,Title,ObjectType,Type,ObjectId,Location,StartDate,EndDate,DueDate,ObjectId1,ObjectId2,ObjectId3,LastModified,IsDeleted,SortOrder,Depth,DummyId,IsHidden,ResultId,DeletedDate,CreatedBy,LastModifiedBy,DeletedBy
3386484,532947,Test,Topic,Auto,3386415,http://www.test.com ,,,,,,,2016-06-27T05:08:26.3070000Z,1,443,3,,False,,2017-02-16T00:31:39.4870000Z,,,
With cat -e option you can see the non printable character.
cat -e Input.csv
M-oM-;M-?ObjectId,OrgId,Title,ObjectType,Type,ObjectId,Location,StartDate,EndDate,DueDate,ObjectId1,ObjectId2,ObjectId3,LastModified,IsDeleted,SortOrder,Depth,DummyId,IsHidden,ResultId,DeletedDate,CreatedBy,LastModifiedBy,DeletedBy^M$
3386484,532947,Test,Topic,Auto,3386415,http://www.test.com ^#,,,,,,,2016-06-27T05:08:26.3070000Z,1,443,3,,False,,2017-02-16T00:31:39.4870000Z,,,^M$
Once the file goes through the script the output looks like this. With additional "0 at the end of http://www.test.com
cat -e Output.csv
"M-oM-;M-?ObjectId","OrgId","Title","ObjectType","Type","ObjectId","Location","StartDate","EndDate","DueDate","ObjectId1","ObjectId2","ObjectId3","LastModified","IsDeleted","SortOrder","Depth","DummyId","IsHidden","ResultId","DeletedDate","CreatedBy","LastModifiedBy","DeletedBy"$
"3386484","532947","Test","Topic","Auto","3386415","http://www.test.com "0","","","","","","","2016-06-27T05:08:26.3070000Z","1","443","3","","False","","2017-02-16T00:31:39.4870000Z","","",""$
Adding command output as per Dave's request
od -ch Input.csv
0000000 357 273 277 O b j e c t I d , O r g I
bbef 4fbf 6a62 6365 4974 2c64 724f 4967
0000020 d , T i t l e , O b j e c t T y
2c64 6954 6c74 2c65 624f 656a 7463 7954
0000040 p e , T y p e , O b j e c t I d
6570 542c 7079 2c65 624f 656a 7463 6449
0000060 , L o c a t i o n , S t a r t D
4c2c 636f 7461 6f69 2c6e 7453 7261 4474
0000100 a t e , E n d D a t e , D u e D
7461 2c65 6e45 4464 7461 2c65 7544 4465
0000120 a t e , O b j e c t I d 1 , O b
7461 2c65 624f 656a 7463 6449 2c31 624f
0000140 j e c t I d 2 , O b j e c t I d
656a 7463 6449 2c32 624f 656a 7463 6449
0000160 3 , L a s t M o d i f i e d , I
2c33 614c 7473 6f4d 6964 6966 6465 492c
0000200 s D e l e t e d , S o r t O r d
4473 6c65 7465 6465 532c 726f 4f74 6472
0000220 e r , D e p t h , D u m m y I d
7265 442c 7065 6874 442c 6d75 796d 6449
0000240 , I s H i d d e n , R e s u l t
492c 4873 6469 6564 2c6e 6552 7573 746c
0000260 I d , D e l e t e d D a t e , C
6449 442c 6c65 7465 6465 6144 6574 432c
0000300 r e a t e d B y , L a s t M o d
6572 7461 6465 7942 4c2c 7361 4d74 646f
0000320 i f i e d B y , D e l e t e d B
6669 6569 4264 2c79 6544 656c 6574 4264
0000340 y \r \n 3 3 8 6 4 8 4 , 5 3 2 9 4
0d79 330a 3833 3436 3438 352c 3233 3439
0000360 7 , T e s t , T o p i c , A u t
2c37 6554 7473 542c 706f 6369 412c 7475
0000400 o , 3 3 8 6 4 1 5 , h t t p : /
2c6f 3333 3638 3134 2c35 7468 7074 2f3a
0000420 / w w w . t e s t . c o m \0 ,
772f 7777 742e 7365 2e74 6f63 206d 2c00
0000440 , , , , , , 2 0 1 6 - 0 6 - 2 7
2c2c 2c2c 2c2c 3032 3631 302d 2d36 3732
0000460 T 0 5 : 0 8 : 2 6 . 3 0 7 0 0 0
3054 3a35 3830 323a 2e36 3033 3037 3030
0000500 0 Z , 1 , 4 4 3 , 3 , , F a l s
5a30 312c 342c 3334 332c 2c2c 6146 736c
0000520 e , , 2 0 1 7 - 0 2 - 1 6 T 0 0
2c65 322c 3130 2d37 3230 312d 5436 3030
0000540 : 3 1 : 3 9 . 4 8 7 0 0 0 0 Z ,
333a 3a31 3933 342e 3738 3030 3030 2c5a
0000560 , , \r \n
2c2c 0a0d
0000564
How can I handle this in the script , I do not want the additional "0 in the output csv.
Thanks

Once you find out the numerical value for that character, you can use \x{} to specify it by that code number:
s/\x{....}//g;
For example, if the character is 🐱, I can find its ordinal value. I typically do that with a hex dump, but there are a variety of ways. It's U+1F431:
s/\x{1F431}//g;
You can also specify the code number in octal for the non-wide characters:
s/\200//g;
Or use those in a range in a character class:
s/[\200-\377]//g;
s/[\000-\037]//g;
s/[\000-\037\200-\377]//g;
But, there may be another thing you want to do. You might match it by a property it has (see perluniprops):
s/\p{XPosixCntrl}//g
Or, with an uppercase P, a property it doesn't have:
s/\P{Print}//g

So, we've learned two things from your hex dump of the file:
It's definitely a UTF8 file - as the first three characters are the UTF8 Byte Order Marker (or BOM).
Your problematic character is actually a null (with a codepoint of zero).
So, following brian's advice, you'll be able to remove the character with:
s/\0//g;

I'm not sure how you discovered it was ^#, but you should be able to refer to any such thing by using \c#, where the \c stands for "CTRL-". So, if it were a SOH, \x01, and you saw it displayed as ^A, you could specify s/\cA//g to get rid of it

Related

LOAD DATA INFILE MySQL(MariaDB)

I am trying to load a .csv file into MariaDB but I am struggling with the query.
Here is how the.csv file is formatted:
USER DATE TIME TESTRESULT ERRORCODE
Esa_Test 16.5.2022 12:36:59 Fail 1(MinMaxError)
Esa_Test 16.5.2022 12:38:02 Fail 1(MinMaxError)
Esa_Test 16.5.2022 12:55:40 Fail 1(MinMaxError)
Esa_Test 17.5.2022 16:15:00 Fail 1(MinMaxError)
DPHYD_Ate 18.5.2022 9:50:11 OK 0(NoError)
When I use this query:
LOAD DATA LOW_PRIORITY LOCAL INFILE 'C:\\xampp\\mysql\\data\\test\\log2.csv' IGNORE INTO TABLE `test`.`testova2` IGNORE 4 LINES (`USER`, `DATE`, `TIME`, `TESTRESULT`, `ERRORCODE`);
The data is loaded successfully but with spaces like this:
USER;DATE;TIME;TESTRESULT;ERRORCODE
E s a _ T e s t ; 1 6 . 5 . 2 0 2 2 ; 1 2 : 3 6 : 5 9 ; F a i l ; 1 ( M i n M a x E r r o r )
E s a _ T e s t ; 1 6 . 5 . 2 0 2 2 ; 1 2 : 3 8 : 0 2 ; F a i l ; 1 ( M i n M a x E r r o r )
E s a _ T e s t ; 1 6 . 5 . 2 0 2 2 ; 1 2 : 5 5 : 4 0 ; F a i l ; 1 ( M i n M a x E r r o r )
E s a _ T e s t ; 1 7 . 5 . 2 0 2 2 ; 1 6 : 1 5 : 0 0 ; F a i l ; 1 ( M i n M a x E r r o r )
D P H Y D _ A t e ; 1 8 . 5 . 2 0 2 2 ; 9 : 5 0 : 1 1 ; O K ; 0 ( N o E r r o r )
I tried to define some "limits" via FIELDS TERMINATED BY '\t' LINES TERMINATED BY '|' in the query but not working. The original file is with encoding UTF16LE according to notepad++
Please help me to build the proper query for my case in order to insert the data correctly..
If someone looking at this:
I manage to find a solution.
The problem was the encoding.
The solution was to make a short Python script to change the encoding before the upload in MySQL the code is:
import codecs
with codecs.open("input.csv","r",encoding="utf_16") as fin:
with codecs.open("output.csv","w",encoding="utf_8") as fout:
fout.write(fin.read())
I automated this and now everything is working.

how to know the count of number of rows deleted using pandas

Here from this 2 .csv files filtering is done and common emailid's are deleted,I am able to get the total after deletion ,But is there any option that gives how many rows are deleted using pandas.
using mysql :
delete a from data a, data1 b where a.email=b.email; select row_count();
How can this be done using pandas
import pandas as pd
colnames=['id','emailid']
data=pd.read_csv("input.csv",names=colnames,header=None)
colnames=['email']
data1= pd.read_csv("compare.csv",names=colnames,header=None)
emailid_suppress1=data1['email'].str.lower()
suppress_md5=data[~data['emailid'].isin(emailid_suppress1)]
print suppress_md5.count()
I believe need sum of Trues values which are processes like 1:
data = pd.DataFrame({'id':list('abcde'), 'emailid':list('klmno')})
print (data)
id emailid
0 a k
1 b l
2 c m
3 d n
4 e o
data1 = pd.DataFrame({'email':list('ABCKLDEFG')})
print (data1)
email
0 A
1 B
2 C
3 K
4 L
5 D
6 E
7 F
8 G
emailid_suppress1=data1['email'].str.lower()
print ((~data['emailid'].isin(emailid_suppress1)).sum())
3
suppress_md5=data[~data['emailid'].isin(emailid_suppress1)]
print (suppress_md5)
id emailid
2 c m
3 d n
4 e o
EDIT:
print ((data['emailid'].isin(emailid_suppress1)).sum())
2
suppress_md5=data[data['emailid'].isin(emailid_suppress1)]
print (suppress_md5)
id emailid
0 a k
1 b l

Sum Over unique identifier in Spotfire

I am trying to create a bar chart that sums up values in a field, but only for each unique identifier. For instance, for my data:
Condition CT_ID Enrollment Company
I 5127 24 H
J 5127 24 H
P 5127 24 H
I 5127 24 O
J 5127 24 O
P 5127 24 O
L 27668 387 C
R 27668 387 C
D 38190 650 D
Q 38190 650 D
F 38785 30 A
E 39682 30 B
M 41818 17 I
O 44093 188 G
A 54850 18 K
G 59183 F
C 59891 266 J
G 61142 48 F
H 61425 28 L
K 61449 N
A 61793 12 E
N 61793 12 E
B 61910 120 M
B 61917 120 M
B 61961 130 M
Or, since I really want to eventually summarize these data by Condition, let me just show the above data resorted by Condition instead of [CT_ID].
Condition CT_ID Enrollment Company
A 54850 18 K
A 61793 12 E
B 61910 120 M
B 61917 120 M
B 61961 130 M
C 59891 266 J
D 38190 650 D
E 39682 30 B
F 38785 30 A
G 59183 F
G 61142 48 F
H 61425 28 L
I 5127 24 H
I 5127 24 O
J 5127 24 H
J 5127 24 O
K 61449 N
L 27668 387 C
M 41818 17 I
N 61793 12 E
O 44093 188 G
P 5127 24 H
P 5127 24 O
Q 38190 650 D
R 27668 387 C
The rows are duplicated by different values in Condition and Company. CT_ID is the identifier for the record.
I want to sum up my Enrollment column, but I will be overcounting. So I thought I could create a custom expression like :
Sum(Avg([Enrollment]) OVER ([CT_ID]))
But this is somehow not a valid expression. Where am I going wrong?
For my bar chart, I'd like to have the Condition as the category axis and the Enrollment as the value axis. The below table shows how the Enrollment value should be calculated. Focus on Conditions P, I, and J.
Condition Enrollment
A 30
B 370
C 266
D 650
E 30
F 30
G 48
H 28
I 24
J 24
K
L 387
M 17
N 12
O 188
P 24
Q 650
R 387
My current solution uses a rank function and only puts the enrollment info in the first row for each unique CT_ID, but that is just plain wrong when I start filtering data. For instance, imagine in my dataset above I only had an Enrollment value for the first row of CT_ID 5127. If I filter out Condition "I" (the one in the first row), now the Enrollment value is zero!
Any help you can provide would be greatly appreciated! I'm no expert on OVER expressions, so hopefully there is an easy solution!
This is too long to comment...
Can you provide some expected results? It looks as though each CT_ID can only have one value, so SUM of them distinctively would just be any one of the values... right? If not, what determines a duplicate?
Here is how you can accomplish what you were attempting in your code above--but from your test data isn't just going to give you the value of Enrollment since there each CT_ID seemingly only has one value (duplicated).
Sum([Enrollment]) over ([CT_ID]) / Count() OVER ([CT_ID])
You could also just use First()
First([Enrollment]) OVER ([CT_ID])
EDIT
Since you have some duplicates and some not... let's just split out the duplicates into the average of the number of duplicates there are. Insert this calculated column:
Max([Enrollment]) over (Intersect([CT_ID],[Condition])) / Count([Enrollment]) over (Intersect([CT_ID],[Condition]))
Then use this column in place of Enrollment in what ever expression you want to ignore duplicates, where a duplicate is the same Condition, CT_ID, and Enrollment value.
For example... the rows for Condition = P and CT_ID = 5127 will have 12 instead of 24.

MySQL query the latest duplicate

EDIT:
I'm trying to do a MySQL query which will give me the latest entry for duplicates together with those without any duplicates.
This is what my table look like:
mentor table:
mentor_id applicant_id mentor_count mento_email mentor_name mentor_pass
193 92 1 test#yahoo.com test 1234
194 92 2 table#yahoo.com table 4567
195 92 3 lamp#yahoo.com lamp 7890
196 92 1 test#yahoo.com test 1234
197 92 2 table#yahoo.com table 4567
198 92 3 lamp#yahoo.com lamp 7890
mf table:
mf_id mentor_id dept contact orgname yrs length sak social char goal resp emomat res others impact evaluation
43 193 math dept 9111111 etc 1 1 e e e e e e e e e good
114 196 math dept 9111111 etc 1 1 e e e e e e e e e good
193 197 sci dept 9222222 org 2 2 n n n n n n n n n medium
194 194 sci dept 9222222 org 2 2 n n n n n n n n n medium
220 195 eng dept 9333333 hello 3 3 q q q q q q q q q bad
I tried using this query:
SELECT *
FROM mentor m1
LEFT JOIN (
SELECT mentor_name, max( mentor_id ) AS maxid
FROM mentor m
GROUP BY m.mentor_id
)m2 ON m1.mentor_name = m2.mentor_name
AND m1.mentor_id < m2.maxid
LEFT JOIN mf ON m1.mentor_id = mf.mentor_id
WHERE m1.applicant_id =833
AND m2.maxid IS NULL
ORDER BY m1.mentor_id ASC
LIMIT 0 , 30
but this is what happens:
mentor_id applicant_id mentor_count mentor_email mentor_name mentor_pass mentor_name maxid mf_id mentor_id dept contact orgname yrs length sak social char goal resp emomat res others spirit concept comm impact evaluation
/*there is data here but the column for mentor_name onwards is null*/
How can I make it so that the columns for mentor_name onwards is not null, but still displays the latest duplicates as well as those without any duplicates?
try
select * from mentor
where mentor_id in
(
SELECT max(mentor_id) from mf
where applicant_id = 92
group by mentor_id
)
I guess you want to add mentor.application_id = mf.application_id to the JOIN condition
select *
from mentor
inner join
(
SELECT *, max(mentor_id) as maxid
from mf
group by mentor_id
) mf on mentor.mentor_id = mf.maxid AND mentor.application_id = mf.application_id
where applicant_id = 92
Typically you will need an extra condition to get the duplicates. WHERE applicant_id = 92 won't be a duplicate unless there were others with the same applicant_id in the same table.

Improvement of VBA code in Access to do something like a pivot table

I have a table of data in MS Access 2007. There are 6 fields per record, thousands of records. I want to make a sort of pivot table like object. That is, if any two rows happens to be the same in the first 4 fields, then they will end up grouped together into one row. The column headers in this pivot table will be the values from the 5th field, and the value in the pivot table will be the 6th field, a dollar amount. Think of the 5th field as letters A, B, C, D, E, F, G. So, the table I start with might have a row with A in the 5th field and $3.48 in the 6th field. Another row may match in the first 4 fields, have B in the 5th field and $8.59 in the 6th field. Another may match in the first 4 fields, have E in the 5th field and $45.20 in the 6th field. I want all these rows to be turned into one row (in a new table) that starts with the first 4 fields where they match, then lists $3.48, $8.59, $0.00, $0.00, $45.20, $0.00, $0.00, corresponding to column headers A, B, C, D, E, F, G (since no records contained C, D, F, G, their corresponding values are $0.00), and then ends with one more field that totals up the money in that row.
Currently, I have some VBA code that does this, written by someone else a few years ago. It is extremely slow and I am hoping for a better way. I asked a previous question (but not very clearly so I was advised to create a new question), where I was asking if there was a better way to do this in VBA. My question asked about reading and writing large amounts of data all at once in Access through VBA, which I know is a good practice in Excel. That is, I was hoping to take my original table and just assign the entire thing to an array all at once (as in Excel, instead of cell by cell), then work with that array in VBA and create some new array and then write that entire array all at once to a new table (instead of record by record, field by field). From the answers in that question, it seems like that is not really a possibility in Access, but my best bet might be to use some sort of query. I tried the Query Wizard and found the Cross Tab query which is close to what I describe above. But, there appears to be a max of 3 fields used in the Row Heading, whereas here I have 4. And, instead of putting $0.00 when a value is not specified (like C, D, F, G in my example above), it just leaves a blank.
Update (in response to Remou's comment to give sample data): Here is some sample data.
ID a b c d e f
7 1 2 3 5 A 5
8 1 2 3 5 B 10
9 1 2 3 5 C 15
10 1 2 3 5 D 20
11 1 2 3 5 E 25
12 1 2 4 4 A 16
13 1 2 4 4 B 26
14 1 3 3 7 D 11
15 1 3 3 7 B 11
The result should be:
a b c d an bn cn dn en Total
1 2 3 5 5 10 15 20 25 75
1 2 4 4 16 26 0 0 0 42
1 3 3 7 0 11 0 11 0 22
But, when I copy and paste the SQL given by Remou, the only output I get is
a b c d an bn cn dn en
1 2 3 5 5 10 15 20 25
This is, I think, what you want, but it would be better to consider database design, because this is a spreadsheet-like solution.
SELECT t0.a,
t0.b,
t0.c,
t0.d,
Iif(Isnull([a1]), 0, [a1]) AS an,
Iif(Isnull([b1]), 0, [b1]) AS bn,
Iif(Isnull([c1]), 0, [c1]) AS cn,
Iif(Isnull([d1]), 0, [d1]) AS dn,
Iif(Isnull([e1]), 0, [e1]) AS en
FROM (((((SELECT DISTINCT t.a,
t.b,
t.c,
t.d
FROM table3 t) AS t0
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS a1
FROM table3 t
WHERE t.e = "A") AS a0
ON ( t0.d = a0.d )
AND ( t0.c = a0.c )
AND ( t0.b = a0.b )
AND ( t0.a = a0.a ))
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS b1
FROM table3 t
WHERE t.e = "B") AS b0
ON ( t0.d = b0.d )
AND ( t0.c = b0.c )
AND ( t0.b = b0.b )
AND ( t0.a = b0.a ))
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS c1
FROM table3 t
WHERE t.e = "C") AS c0
ON ( t0.d = c0.d )
AND ( t0.c = c0.c )
AND ( t0.b = c0.b )
AND ( t0.a = c0.a ))
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS d1
FROM table3 t
WHERE t.e = "D") AS d0
ON ( t0.d = d0.d )
AND ( t0.c = d0.c )
AND ( t0.b = d0.b )
AND ( t0.a = d0.a ))
LEFT JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS e1
FROM table3 t
WHERE t.e = "E") AS e0
ON ( t0.d = e0.d )
AND ( t0.c = e0.c )
AND ( t0.b = e0.b )
AND ( t0.a = e0.a );
Table3
ID a b c d e f
1 1 2 3 4 a €10.00
2 1 2 3 4 b €10.00
3 1 2 3 4 c €10.00
4 1 2 3 4 d €10.00
5 1 2 3 4 e €10.00
6 1 2 3 5 a €10.00
7 1 2 3 5 b
8 1 2 3 5 c €10.00
9 1 2 3 5 d €10.00
10 1 2 3 5 e €10.00
Result
There are two rows, because there are only two different sets in the first four columns.
a b c d an bn cn dn en
1 2 3 4 €10.00 €10.00 €10.00 €10.00 €10.00
1 2 3 5 €10.00 €0.00 €10.00 €10.00 €10.00
The way the sql above is supposed to work, is that it selects each of the four definition columns and the currency column from the table where the sort column has a particular sort letter and labels the currency column with the sort letter, each of these sub queries are then assembled, however, you can take a sub query and look at the results. The last one is the part between the parentheses:
INNER JOIN (SELECT t.a,
t.b,
t.c,
t.d,
t.f AS e1
FROM table3 t
WHERE t.e = "E") AS e0