Reading and splitting a CSV File with Talend - csv

The csv file contains more than one table, it might look like this:
"Table 1"
,
"id","visits","downloads","emailsent"
1, 4324, 23, 2
2, 664, 42, 1
3, 73, 44, 0
4, 914, 8, 0
...
"Table 2"
,
"id_of_2nd_tab","visits_of_2nd_tab","downloads_of_2nd_tab"
1, 524, 3
2, 564, 52
3, 63, 84
4, 814, 8
...
What is the best way to import those tables into Talend ?

Generally, that kind of multi-record format CSV format is more complex to parse.
Question : Are there are a finite number of tables?
Question: Does each table have a fixed number and order of columns?
Question: What is the separator between "tables" within the CSV?
I believe you need to take a multi-pass approach. You could do something like this.
Pass #1 - Use tFileInputDelimited
Use row separator such as "Table", No field separator, Grab 1 big field
Alternatively, you could split the first file into separate files at this stage.
Pass #2 - Split Row (on results from pass #1) on the Row Separator "\r\n" etc
Split it into multiple rows but of a single column.
Pass #3 - Extract Delimited Fields (on results from pass #2)
Extract based on a field separator
Recognize a "Table" row
Recognize a "Header row
Additional handling per Table / set of fields in header

Use the a tFileInputExcel component to read each worksheet. Then you can use tMap to join the worksheets into a target column layout assuming you want do some processing on a joined set of columns.

Related

Apache NiFi: Creating a new column by comparing multiple rows with various data

I have a csv which looks like this:
first,second,third,num1,num2,num3
12,312,433,0787388393,0783452323,01123124
12,124345,453,07821323,077213424,0123124421
33,2432,214,077213424,07821323,0234234211
I have to create another column according to the data stored in num1 and num2. There can be various values in the columns but the new column should only contain 2 values it's either original or fake. (I should only compare the first 3 digits in both num1 andnum2`.
For the mapping part I have another csv which looks like this (I have many more rows):
078,078,fake
072,078,original
077,078,original
My Output csv should look like this after mapping:
first,second,third,num1,num2,num3,status
12,312,433,0787388393,0783452323,01123124,fake
12,124345,453,07821323,072213424,0123124421,original
33,2432,214,078213424,07821323,0234234211,fake
Hope you can suggest me a nifi workflow to get the following done:
You can use LookupRecord for this, but due to the special logic you'll likely have to write your own ScriptedLookupService to read in the mapping file and compare the first 3 digits.

MySQL to CSV - separating multiple values

I have downloaded a MySQL table as CSV, which has over thousand entries of the following type:
id,gender,garment-color
1,male,white
2,"male,female",black
3,female,"red,pink"
Now, when I am trying to create a chart out of this data, it is taking "male" as one value, and "male,female" as a separate value.
So, for the above example, rather than counting 2 "male", and 3 "female", the chart is showing 3 separate categories ("male", "female", "male,female"), with one count each.
I want the output as follows, for chart to have the correct count:
id,gender,garment-color
1,male,white
2,male,black
2,female,black
3,female,red
3,female,pink
The only way I know is to copy the row in MS Excel and adjust the values manually, which is too tedious for 1000+ entries. Is there a better way?
From MySQL command line or whatever tool you are using to send queries to MySQL:
select * from the_table
into outfile '/tmp/out.txt' fields terminated by ',' enclosed by '"'
Then download /tmp/out.txt' from the server and it should be good to go assuming your data is good. If it is not, you might need to massage it with some SQL function use in theselect`.
The csv likely came from a poorly designed/normalized database that had both those values in the same row. You could try using selects and updates, along some built in string functions, on such rows to spawn additional rows containing the additional values and update their original rows to remove those values; but you will have to repeat until all commas are removed (if there is more than one in some field), and will have to determine if a row containing multiple fields with such comma-separated lists need multiplied out (i.e. should 2 gender and 4 color mean 8 rows total).
More likely, you'll probably want to create additional tables for X_garmentcolors, and X_genders; where X is whatever the original table is supposed to be describing. These tables would have an X_id field referencing the original row and a [garmentcolor|gender] value field holding one of the values in the original rows lists. Ideally, they should actually reference [gender|garmentcolor] lookup tables instead of holding actual values; but you'd have to do the grunt work of picking out all the unique colors and genders from your data first. Once that is done, you can do something like:
INSERT INTO X_[garmentcolor|gender] (X_id, Y_id)
SELECT X.X_id, Y.Y_id
FROM originalTable AS X
INNER JOIN valueTable AS Y
ON X.Y_valuelist LIKE CONCAT('%,' Y.value) -- Value at end of list
OR X.Y_valuelist LIKE CONCAT('%,' Y.value, ',%') -- Value in middle of list
OR X.Y_valuelist LIKE CONCAT(Y.value, ',%') -- Value at start of list
OR X.Y_valuelist = Y.value -- Value is entire list
;

Notepad++ How do I add a comma at a specific column position?

I have pretty large ASCII file (1.7mil rows) that I need to insert commas into at specific column positions. I am doing this because I am trying to convert the file to csv so I can import it into mysql. Unless there is a better approach (no doubt), what I am trying to do is insert comma at the specific column positions where I know fields end. This is not a job for column mode as dragging through 1.7mil rows would be insane.
I have tried this solution - How do I add a character at a specific postion in a string?
but it did not work. Does anyone have a suggestion?
Thanks!
To insert after the 4th character on each line:
Find: ^(.{4})
Replace: \1,
(Ticking Regular Expression in the find/replace dialog)
Another way to do it is to import the txt file you have into mysql as a table with a single column. Then split the string using SUBSTRING()
SELECT
SUBSTRING(col, 1, 8) AS Column1
, SUBSTRING(col, 9, 8) AS Column2
, SUBSTRING(col, 17, 16) AS Column3
FROM table
You can modify this query to do SELECT INTO or INSERT INTO. Depends on how you want to get it to the final table.
I've had to do it this way before because it was a recurring process and needed to be automated.

Split a String every 5 characters and get distinct strings

So I have a column with strings that are multiples of 5 characters, such as "12345" or "abcde12345" or "asdfghjkli12345". What I'm trying to do is write a query to split each of these strings into 5 character chunks and return the distinct ones.
So with "12345" , "abcde12345" , "asdfghjkli12345"
I would get back "12345" "abcde" "asdfg" and "hjkli"
Is this possible?
MySQL does not include a function to split a delimited string. Although separated data would normally be split into separate fields within a relation data, spliting such can be useful either during initial data load/validation or where such data is held in a text field.
The following formula can be used to extract the Nth item in a delimited list, in this case the 3rd item "ccccc" in the example comma separated list.
select replace(substring(substring_index('aaa,bbbb,ccccc', ',', 3), length(substring_index('aaa,bbbb,ccccc', ',', 3 - 1)) + 1), ',', '') ITEM3
The above formula does not need the first item to be handled as a special case and returns empty strings correctly when the item count is less than the position requested.
You can also create your own split function and use it. Split value from one field to two
Source: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html
This sounds like it would be outside the limitations of SQL and would certainly require a programming language. It would be VERY easy in any 3GL.

Is a partial delete of column values in SQL database?

I need to batch edit a column of values in a database. Right now I have a "location" field formatted for Row Bay Level as follows: R001B002L004
Since there are less than ten Rows, Bays or Levels the R00 B00 and L00 are completely redundant and the field would be easier to manage if it were formatted as a three digit number. eg 124 for the previous example.
Is there way I can batch edit these 800 or so values to convert the R00*B00*L00* format to the three digit number format?
Here is one way:
update t
set location = replace(replace(replace(location, 'R00', ''), 'B00', ''), 'L00', '');
If you want to turn this into a number, then you have a bit of a challenge. The current type of location is some sort of string and changing the type is probably a lot of unnecessary work. I would just go with a digit-only string.