Need to insert a column in csv and parse current date in it - JREPL.BAT? - csv

I could (thanks to dbenham and its powerful JREPL.BAT) remove header row from CSV File with command
jrepl "^(Date,)?.*" "($1?i++:i)?$0:false" /jmatch /jbeg "var i=0" /f test.txt /o output.txt
I now need to insert in this csv below the date in first column (here 2016-03-27) for every row and delete last row (total). Would jrepl do this too? Thanks!
Report,Begin Date,End Date,Currency,Change Currency
Activity Summary By Account,2016-03-27 00:00:00.000 -0600,2016-03-28 00:00:00.000 -0600,USD,Change Currency
Affiliate,Account,Screen Alias,Total Wagered,Total Payout,Net Win Loss,Percent Payout
FaridZ,BuF,BuFis,1153.00,828.00,325.00,71.81%
JohnX,adel,adel,104.70,71.70,33.00,68.48%
FaridZ,chat00,shat,49065.00,45987.50,3077.50,93.72%
,,Total:,"50,657.70","47,247.20","3,410.50",93.26%
Updated: screenshot of final csv output...

This can be done efficiently with JREPL.BAT, but the solution is fairly complicated if you want to do everything with a single pass of the input data. I'm not sure that the solution is any simpler than writing a custom JScript or VBS script.
Note that I discovered a minor JREPL.BAT bug while developing this solution, so there is an updated version 3.8 at the link with a few bug fixes.
jrepl "^$|^,,Total:,.*|^.*?,(.+?),.*"^
"i=1;''|false|if(ln==2){dt=$4;$0}else i?$0+','+((i++==1)?'Date':dt):$0"^
/jmatch /jbeg "var i=0, dt" /t "|" /f test.txt /o output.txt
I used line continuation to make the code easier to read.
A bit of explanation of the solution is in order.
/JBEG defines a couple of variables that are needed during the find/replace operation.
dt - Holds the captured date string.
i - Used to control whether anything is appended:
if i=0 then no change
if i=1 then append the Date header
else append dt
I used /JMATCH along with the /T option with | as a delimiter. The /T option is similar to the unix tr command. For each delimited search in the find string, there is a corresponding JScript expression in the replacement string.
$1 search ^$ - Looks for an empty line
replace i=1;'' - Triggers i so that subsequent non-empty lines have the date column appended. The replacement value for this line is an empty string.
$2 search ^,,Total:,.* - Looks for the final Total line
replace false - Prevents the total line from being printed
$3 search ^.*?,(.+?),.* - Looks for a line with at least 3 fields, capturing the 2nd field in $4
replace if(ln==2){dt=$4;$0}else i?$0+','+((i++==1)?'Date':dt):$0 - This is where most of the complicated logic resides:
If this is the 2nd line, then save the date string ($4) in dt and replace with the full matched string
else if i is not 0, then increment i and replace with the full matched string plus append string ',Date' the first time, else append the dt value
else i=0, so replace with the original string.

Related

Fail to load a 4 column CSV file into OCTAVE - output is only first column, or 1 array per line

Trouble loading a csv file into OCTAVE.
EDIT: as pointed out from ANDY and Eliahu Aaron, I changed ; to ,.
csvread 4 returns separated columns, each named after the first line.
My matlab script throws these errors:
error: 'z' undefined near line 13 column 3
error: called from myScript at line 13 column 2
I can0t find -z even though there is now a column called z from where it should calculate.
This fixed my Issue in the end:
g = cell2mat(A(2:end-1,2));
My csv looks like this:
time;z;y;x
5;15084;-1360;-9664
7;15280;-1296;-9784
10;15032;-1384;-9688
30;15160;-1548;-9772
56;15116;-1532;-9660
First I had to delete the first row- because matrix was unreadable for octave.
If I try to csv2cell my file - I only get 1 column filled with all values in every line
mycsvdata = csv2cell("file.csv")
if I try csvread i get 1 column with the values of the first column name "ans"... 2nd,3rd and 4th column is ignored.
csvread("file.csv")
when i drag and drop the same csv into matlab - i click on the green tick and every column is named after its first cell and is a var. I end up having 4 vars called: time, z, y and x.
In octave this is kind of impossible for me to archieve.
what am i doing wrong?
This seems to be such a basic problem but I havent come across a solution in the internet.
I need to get 4 variables called time, z, y and x and having them all the values from the 1st (time), 2nd(z), 3rd(y) and 4th(x) column stored in them
I am new to octave and have a code written for matlab - which I want to change to octave. I am not even able to test my code, becuase I am not able to load the csv properly. This is very frustrating for me.
thanks in adavance
CSV by default uses , as column delimiters but your file has ; as column delimiters.
You can use dlmread("file.csv", ";") instead of csvread but it can't read the first row time;z;y;x.
You can use csv2cell("file.csv", ";"), the first row will be strings and the rest numbers.
To create a struct array with fields time;z;y;x you can use the fullowing code:
pkg load io
A = csv2cell("file.csv", ";");
B = cell2struct(A(2:end,:),A(1,:),2);

Read a list of CSV files in Talend with ; in field

I have a list of CSV files which i receive for ETL into database every month. Its in a folder. My data has ; in many columns as well. For example, in the location column values like New York; USA are present, which i want to appear in a single column instead of splitting into many columns. How do i specify delimiter then?
I think you cannot have the field separator included in the field content or you have to incluse these values between "". For example:
blabla;"New York; USA";blabla
Other solution, change the field delimitor to a more specific (and unused) character.
I'm afraid there is no better solution.
Regards,
TRF
As TRF mentioned, you can't have the delimiter as part of the non-delimiting text in your file.
My workaround for that would be the following:
1) Read the file with a tFileInputFullRow (https://help.talend.com/display/TalendComponentsReferenceGuide54EN/tFileInputFullRow)
2) Use a tReplace to replace the ; with some other character,
say -, for the problem cells (in your case, replace "New York;USA" with "New York-USA". You can also use the regex option in the tReplace component to make it a generic rule.
3) Save that output into another file
4) Now read the new file using ; as the delimiter
References:
1) tReplace: https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/18.16+tReplace
2) Regex: https://docs.oracle.com/javase/tutorial/essential/regex/

Convert MySQL "INSERT" commands to text

I am trying to import a recent Wikipedia dump into a MySQL database. The problem is, I am inserting a 50 GiB text table using INSERT INTO text MySQL commands and I want to convert these into a text file.
My text.sql file has the following structure:
INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),...,(idN,'textN','flagsN');
However, using mysql -u USERNAME -p DBNAME < text.sql is very slow. I am already disabling autocommit,unique_checks and foreign_key_checks, and I am enclosing all transactions within a START TRANSACTION; ... COMMIT; block, but the import process is still very slow.
After researching, I read here that using LOAD DATA INFILE; can be much faster than using INSERT commands. Therefore, I am looking to convert text.sql to text.txt as follows:
id1,'text1','flags1'
id2,'text2','flags2'
...
idN,'textN','flagsN'
I was thinking of using awk for this, but my experience with regular expressions is very limited. Furthermore, each INSERT command is given in a single line, as shown above, making it for me even more difficult to extract the values.
Given that the text.sql file is 50 GiB, would you recommend using awk or to develop a C/C++ program? If awk is a good approach, how could I achieve the conversion?
Input #1 example:
INSERT INTO text (old_id,old_text,old_flags) VALUES (id1,'text1','flags1'),(id2,'text2','flags2'),(id3,'text3','flags3');
Output #1 example:
id1,'text1','flags1'
id2,'text2','flags2'
id3,'text3','flags3'
Input #2 example: (with parenthesis in the values)
INSERT INTO page (page_id,page_namespace,page_title,page_restrictions,page_is_redirect,page_is_new,page_random,page_touched,page_latest,page_len,page_content_model) VALUES (10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'),(12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext');
Output #2 example:
10,0,'AccessibleComputing','',1,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,631144794,69,'wikitext'
12,0,'Anarchism','',0,0,RAND(),DATE_ADD('1970-01-01', INTERVAL UNIX_TIMESTAMP() SECOND)+0,703037144,180446,'wikitext'
Input #3 example: (with escaped ' or ")
INSERT INTO text (old_id,old_text,old_flags) VALUES (631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'),(703037144,'{{Redirect2|Anarchist|Anarchists
|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8');
Output #3 example:
631144794,'#REDIRECT [[Computer accessibility]]\n\n{{Redr|move|from CamelCase|up}}','utf-8'
703037144,'{{Redirect2|Anarchist|Anarchists|the fictional character|Anarchist (comics)|other uses|Anarchists (disambiguation)}}\n{{pp-move-indef}}\n{{Use British English|date=January 2014}}','utf-8'
edit: after conducting some more research, it appears that examples #2 and #3 may not be converted using regular expressions: sources: #1, #2.
If this isn't what you want:
$ awk -v FPAT='[(][^)]+[)]' '{for (i=2;i<=NF;i++) print substr($i,2,length($i)-2)}' file
id1,'text1','flags1'
id2,'text2','flags2'
idN,'textN','flagsN'
then edit your question to provide clearer, testable sample input and expected output.
The above used GNU awk for FPAT, with other awks you'd use a while(match()) loop.
Use this:
sed -e 's/(//' -e 's/),//' test.csv
(appropriately piped) and all your lines will be clean.
Change first and last lines manually.
Regards

Replace not working

I have a text-file of data in key-value pairs that I have managed to convert to a format where the key-value pairs are all separated by an underscore between them, and the key is separated from the value by a colon. I thought this format would be useful for keeping spaces intact within the data. Here's an example with the data substituted for ~~~~~~~s.
_ID:~~~_NAME:~~~~~_DESCRIPTION:~~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~ ...etc
I want to convert this to a MySQL script to insert the data into a table. My problem is there are nullable fields that aren't included in every record. e.g. A record has a _TYPE1: and may or may not have a _TYPE2:
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~_ADDRESS:~~~~~~~ ...
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_ADDRESS:~~~~~~~ ...
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_ADDRESS:~~~~~~~ ...
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~_ADDRESS:~~~~~~~ ...
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_ADDRESS:~~~~~~~ ...
I thought to fix this by inserting _TYPE2: after every _TYPE1 without a _TYPE2:. Since there are only a few different possible types, I managed to select the _ after each _TYPE1:~~~~~~ without a TYPE2: following it. I used the following regex, where egtype is one example of a possible type:
(?<=_TYPE1:egtype)_(?!TYPE2:)
At this point, all I have to do is replace that _ with _TYPE2:_ and every field is present in every line, which makes it easy to convert every row to a MySQL insert statement! Unfortunately, Notepad++ is not replacing it when I click the Replace button. I'm not sure why.
Does anyone know why it wouldn't replace an _ with _TYPE2:_ using that particular regex? Or does anyone have any other suggestions on how to turn all this data into a MySQL insert script?
Regex
To do what you want, try this:
Find:
_TYPE1:[^_]+\K(?!.*_TYPE2)
Replace:
_TYPE2:
You can test it with your sample data and have it explained here.
Python Script plugin
As a side note, I don't think it's possible to convert your data into SQL insert statements with the use of one and only one regular expression, and while I see what you are trying to do by adding fake TYPE2, I don't think it is your best option.
So, my suggestion is to use Notepad++'s Python Script plugin.
Install Python Script plugin, from Plugin Manager or from the official website.
Then go to Plugins > Python Script > New Script. Choose a filename for your new file (eg sql_insert.py) and copy the code that follows.
Run Plugins > Python Script > Scripts > sql_insert.py and a new tab will show up the desired result.
Script:
columns = [[]]
values = [[]]
current_line = 0
def insert(line, match):
global current_line
if line > current_line:
current_line += 1
columns.append([])
values.append([])
if match:
i = 0
for m in match.groups():
if i % 2 == 0:
columns[line].append(m)
else:
values[line].append(m)
i += 1
editor.pysearch("_([A-Z0-9]+):([^_\n]+)", insert)
notepad.new()
for line in range(len(columns)):
editor.addText("INSERT INTO table (" + ",".join(columns[line]) + ") values (" + ",".join(values[line]) +");\n")
Note: I'm still learning Python and I've a feeling that this one could be written in a better way. Feel free to edit my answer or drop a comment if you can suggest improvements!
Example input:
_ID:~~~_NAME:~~~~~_DESCRIPTION:~~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~
_ID:~~~_NAME:~~~~~_DESCRIPTION:~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~_ADDRESS:~~~~~~~
_ID:~~~_NAME:~~~~~_DESCRIPTION:~~~~~~_TYPE1:~~~~~~_ADDRESS:~~~~~~~
Example output:
INSERT INTO table (ID,NAME,DESCRIPTION,TYPE1,TYPE2) values (~~~,~~~~~,~~~~~~~,~~~~~~,~~~~~~);
INSERT INTO table (ID,NAME,DESCRIPTION,TYPE1,TYPE2,ADDRESS) values (~~~,~~~~~,~~~~~~,~~~~~~,~~~~~~,~~~~~~~);
INSERT INTO table (ID,NAME,DESCRIPTION,TYPE1,ADDRESS) values (~~~,~~~~~,~~~~~~,~~~~~~,~~~~~~~);
try searching for (_TYPE1:)(\S\S\S\S\S\S)(_ADDRESS:)
and replacing with \1\2_TYPE2:~~~~~~\3
i tested in notepad++ with your data and it works
don't forget to change the Search Mode to regular expression.
to turn it into an INSERT script just keep using regular expression like i did above, and bracket which ever field you want and then replace with a \number whichever field and move them around it should be pretty simple manual labor, have fun.
for example search for your whole line here i am only doing DESCRIPTION,TYPE1,and TYPE2
search for using regular expression
(_DESCRIPTION)(:)(\S\S\S\S\S\S)(_TYPE1)(:)(\S\S\S\S\S\S)(_TYPE2)(:)(\S\S\S\S\S\S)
then replace with something like
INSERT INTO table1\(desc,type1,type2\)values\('\3','\6','\9'\); (in notepad++)
If this is a once-off problem then a two step process would work. First step would add a _TYPE2:SomeDefaultValue to every line. Step two would remove it from lines where it was not needed.
Step 1: Find what: $, Replace with: _TYPE2:xxx
Step 2: Find what: (_TYPE2:.*)_TYPE2:xxx$, Replace with: \1
In both steps select "regular expression" and un-select "dot matches newline". Also change xxx to your default value.

CSV with comma or semicolon?

How is a CSV file built in general? With commas or semicolons?
Any advice on which one to use?
In Windows it is dependent on the "Regional and Language Options" customize screen where you find a List separator. This is the char Windows applications expect to be the CSV separator.
Of course this only has effect in Windows applications, for example Excel will not automatically split data into columns if the file is not using the above mentioned separator. All applications that use Windows regional settings will have this behavior.
If you are writing a program for Windows that will require importing the CSV in other applications and you know that the list separator set for your target machines is ,, then go for it, otherwise I prefer ; since it causes less problems with decimal points, digit grouping and does not appear in much text.
CSV is a standard format, outlined in RFC 4180 (in 2005), so there IS no lack of a standard. https://www.ietf.org/rfc/rfc4180.txt
And even before that, the C in CSV has always stood for Comma, not for semiColon :(
It's a pity Microsoft keeps ignoring that and is still sticking to the monstrosity they turned it into decades ago (yes, I admit, that was before the RFC was created).
One record per line, unless a newline occurs within quoted text (see below).
COMMA as column separator. Never a semicolon.
PERIOD as decimal point in numbers. Never a comma.
Text containing commas, periods and/or newlines enclosed in "double quotation marks".
Only if text is enclosed in double quotation marks, such quotations marks in the text escaped by doubling. These examples represent the same three fields:
1,"this text contains ""quotation marks""",3
1,this text contains "quotation marks",3
The standard does not cover date and time values, personally I try to stick to ISO 8601 format to avoid day/month/year -- month/day/year confusion.
I'd say stick to comma as it's widely recognized and understood. Be sure to quote your values and escape your quotes though.
ID,NAME,AGE
"23434","Norris, Chuck","24"
"34343","Bond, James ""master""","57"
Also relevant, but specially to excel, look at this answer and this other one that suggests, inserting a line at the beginning of the CSV with
"sep=,"
To inform excel which separator to expect
1.> Change File format to .CSV (semicolon delimited)
To achieve the desired result we need to temporary change the delimiter setting in the Excel Options:
Move to File -> Options -> Advanced -> Editing Section
Uncheck the “Use system separators” setting and put a comma in the “Decimal Separator” field.
Now save the file in the .CSV format and it will be saved in the semicolon delimited format.
Initially it was to be a comma, however as the comma is often used as a decimal point it wouldnt be such good separator, hence others like the semicolon, mostly country dependant
http://en.wikipedia.org/wiki/Comma-separated_values#Lack_of_a_standard
CSV is a Comma Seperated File. Generally the delimiter is a comma, but I have seen many other characters used as delimiters. They are just not as frequently used.
As for advising you on what to use, we need to know your application. Is the file specific to your application/program, or does this need to work with other programs?
To change comma to semicolon as the default Excel separator for CSV - go to Region -> Additional Settings -> Numbers tab -> List separator
and type ; instead of the default ,
Well to just to have some saying about semicolon. In lot of country, comma is what use for decimal not period. Mostly EU colonies, which consist of half of the world, another half follow UK standard (how the hell UK so big O_O) so in turn make using comma for database that include number create much of the headache because Excel refuse to recognize it as delimiter.
Like wise in my country, Viet Nam, follow France's standard, our partner HongKong use UK standard so comma make CSV unusable, and we use \t or ; instead for international use, but it still not "standard" per the document of CSV.
best way will be to save it in a text file with csv extension:
Sub ExportToCSV()
Dim i, j As Integer
Dim Name As String
Dim pathfile As String
Dim fs As Object
Dim stream As Object
Set fs = CreateObject("Scripting.FileSystemObject")
On Error GoTo fileexists
i = 15
Name = Format(Now(), "ddmmyyHHmmss")
pathfile = "D:\1\" & Name & ".csv"
Set stream = fs.CreateTextFile(pathfile, False, True)
fileexists:
If Err.Number = 58 Then
MsgBox "File already Exists"
'Your code here
Return
End If
On Error GoTo 0
j = 1
Do Until IsEmpty(ThisWorkbook.ActiveSheet.Cells(i, 1).Value)
stream.WriteLine (ThisWorkbook.Worksheets(1).Cells(i, 1).Value & ";" & Replace(ThisWorkbook.Worksheets(1).Cells(i, 6).Value, ".", ","))
j = j + 1
i = i + 1
Loop
stream.Close
End Sub