Pentaho Kettle split CSV into multiple records - csv

I'm new to Kettle, but getting on well with it so far. However I can't figure out how to do this.
I have a csv which looks something like this
a, col1, col2, col3
a, col1, col2, col3
a, col1, col2, col3
b, col1, col2, col3
b, col1, col2, col3
c, col1, col2, col3
c, col1, col2, col3
The first column starts with a key (a,b,c), and then the rest of the columns follow. What I want to do is read in the csv (got that covered) and then split the csv based on key, so I have 3 chunks/ groups of data and then convert each of those chunks of data into a separate json file, which I think I can get.
What I can't get my head around is the grouping the data and then performing a separate action (convert to json) on each of those separate groups. Its not the creating json I have an issue with.
The data is from a sensor network of many environmental sensors so there are many keys, hundreds, and new ones get added. I've used map reduce to process this data before as the concept of partitioning is what I'm trying to replicate here, without using the hadoop elements of kettle as the deployment is different. Once I've partitioned the data it needs to be loaded into different places as seperate records. The key is a unique ID (serial number) of a sensor.
Any ideas please?
Thanks

I guess create a javascript to output the fields of a row in a JSON like string added to the row:
{"id":"a","col1":"1","col2":"2","col3":"3"}
Next you could use the group step and set the base field to the 'id' field and have as aggregate the javascript value in type 'Concatenate strings separated by ,'
{"id":"a","col1":"1","col2":"2","col3":"3"},{"id":"a","col1":"4","col2":"5","col3":"6"}, {"id":"a","col1":"7","col2":"8","col3":"9"}
Add some tags around it and you have valid json. Next you could assemble a file name using javascript step:
var file_name="C:\\dir\\"+ id + ".txt";
Use the text file output and set the file name field to 'file_name'. Remove separator / enclosure options to have none extra formatting and you are done.

If i have understood your question correctly, you can use "GROUP BY" step to group the columns (i.e. the first header in your data set) and then store these into memory.
Once this is done.. use parameter looping to "get the variables" and dynamically generate multiple JSON output. Check the image below:
In the JSON output step, use variables like header1 to generate multiple files. Highlighted below the changes i made in the JSON Output.
In case you find in confusing, i have uploaded a sample code in here.
Hope it helps :)

Related

Can I get a Json Key into a Hive Column?

I am trying to read data from json files in S3 into my Hive table. If the column names and json keys are same its all loading properly. But now I want to read data in such a way that the nested json values goes into specific columns (For eg: for json
{"data1": {"key1": "value1"}}
I want the data1.key1 value to go into column named data1_key1; which I understand is achievable with SERDEPROPERTIES. My next problem is there can be multiple json keys and I want the key names to be column values in my Hive table.
Also, depending upon those keys, the keys that go into other columns will also change.
For eg my json files will be either:
{"data1" : {"key1":"value1"}}
or
{"data2" : { "key2" : "value2"}}
This need to create a table as below:
col1 col2
data1 value1
data2 value2
Is this possible? If so how should it be done?
You can do it using regular expressions. Define json column as string in table DDL and use regexp to parse it. Tested on your data example:
Demo:
with your_table as ( --Replace this CTE with your table
select stack(2,
'{"data1": {"key1": "value1"}}',
'{"data2" : { "key2" : "value2"}}'
) as json
)
select regexp_extract(json,'^\\{ *\\"(\\w+)\\" *:', 1) as col1, --capturing group 1 in a parenthesis START{spaces"(word)"spaces:
regexp_extract(json,': *\\"(.+)\\" *\\} *\\}$', 1) as col2 --:spaces"(value characters)"spaces}spaces}END
from your_table;
Result:
col1,col2
data1,value1
data2,value2
Read the comments in the code please. You can adjust this solution to fit your JSON. This approach allows to extract keys and values from JSON not knowing their names. json_tuple and get_json_object are not applicable in this case.
Alternatively you can use regexSerDe to do the same in the table DDL like in this answer: https://stackoverflow.com/a/47944328/2700344. For the RegexSerDe solution you need to write more complex single regexp containing one capturing group (in parenthesis) for each column.

Abinitio : Writing data to a target table

I am new to Abinitio and I need help for the following.
Table 1 has columns :
Col1
Col2
Col3
Table 2 has columns :
col 4
col5
I am using join component and also reformat component and got the output as col2,col3,col4,col5. And I am writing this to a target table which has
id, col2,col3,col4,col5 ,created_by, created_date_time,last_modified_date.
As I have data for col2,col3,col4,col5 from output of the join component but not for id,created_by, created_date_time,last_modified_date.
How do I add this using abinitio. Any help on this is greatly appreciated and apologize if this kind of basic question was alread discussed.
Regards.
You could connect REFORMAT component to the output flow of the JOIN component. Transform function in the REFORMAT component could pass the col2, col3, col4, col5 values using wildcard rule out.* :: in.*. The rest of columns in the output table should be present in the DML of the output port of REFORMAT component as well, so then you could assign data to these columns in the transform function in the REFORMAT, e. g. out.created_by :: "something".
After join component connect the reformat component and in the ports section of reformat component change the dml for output port by adding all the relevant columns you need to have in the output then change the transform function of reformat component as follows:
1. for all the incoming values of join use out.* :: in.*
2. for all the additional columns that you have added in the out port of dml assign value using out.column_name :: "value you need to pass"
In join component,simply you could write transformation for all the required columns. Include id,created_by, created_date_time,last_modified_date in output port dml(embedded) of join .For col2,col3,col4,col5 you can map from respective input columns and for id,created_by, created_date_time,last_modified_date add required transformation. So you could avoid one extra reformat component .

empty returned rows in Hive query

I have created an external Hive table from a tweets json file which is exported from Mongo DB. Whenever I select more than one column from the hive table, the retrieved results are not well formatted. some columns are empty or NULL (even if i conditioned on specific values)and some data appear in the wrong columns.
I think this is happening because the text has commas in it. when i try to query the hive table without selecting the text of the tweets, the results make sense then. But i don't know how to fix that.
Anyone Has any idea how to fix that??
Best,
Why dont you try formatting the output? Something like this -
SELECT
CONCAT(COALESCE(COL1,''),
'|', COALESCE(COL2,''),
'|', COALESCE(COL3,''),
'|', COALESCE(COL4,''),
'|', COALESCE(COL5,''),
'|', COALESCE(COL6,''),
'|', COALESCE(COL7,'')) as tweetsout
FROM (
SELECT COL1, COL2, COL3, COL4, COL5, COL6, COL7
FROM TWEETS
) TOUT
This would give you the output delimied by a pipe instead of the standard tab delimited output.
It is difficult to tell without knowing the exact create table command you used...
Usually the table is parsed incorrectly if the input data contains table delimiters. For example, some tweets in your input database may contain \n which might be a row separator in the hive table you created.

MySQL to CSV - separating multiple values

I have downloaded a MySQL table as CSV, which has over thousand entries of the following type:
id,gender,garment-color
1,male,white
2,"male,female",black
3,female,"red,pink"
Now, when I am trying to create a chart out of this data, it is taking "male" as one value, and "male,female" as a separate value.
So, for the above example, rather than counting 2 "male", and 3 "female", the chart is showing 3 separate categories ("male", "female", "male,female"), with one count each.
I want the output as follows, for chart to have the correct count:
id,gender,garment-color
1,male,white
2,male,black
2,female,black
3,female,red
3,female,pink
The only way I know is to copy the row in MS Excel and adjust the values manually, which is too tedious for 1000+ entries. Is there a better way?
From MySQL command line or whatever tool you are using to send queries to MySQL:
select * from the_table
into outfile '/tmp/out.txt' fields terminated by ',' enclosed by '"'
Then download /tmp/out.txt' from the server and it should be good to go assuming your data is good. If it is not, you might need to massage it with some SQL function use in theselect`.
The csv likely came from a poorly designed/normalized database that had both those values in the same row. You could try using selects and updates, along some built in string functions, on such rows to spawn additional rows containing the additional values and update their original rows to remove those values; but you will have to repeat until all commas are removed (if there is more than one in some field), and will have to determine if a row containing multiple fields with such comma-separated lists need multiplied out (i.e. should 2 gender and 4 color mean 8 rows total).
More likely, you'll probably want to create additional tables for X_garmentcolors, and X_genders; where X is whatever the original table is supposed to be describing. These tables would have an X_id field referencing the original row and a [garmentcolor|gender] value field holding one of the values in the original rows lists. Ideally, they should actually reference [gender|garmentcolor] lookup tables instead of holding actual values; but you'd have to do the grunt work of picking out all the unique colors and genders from your data first. Once that is done, you can do something like:
INSERT INTO X_[garmentcolor|gender] (X_id, Y_id)
SELECT X.X_id, Y.Y_id
FROM originalTable AS X
INNER JOIN valueTable AS Y
ON X.Y_valuelist LIKE CONCAT('%,' Y.value) -- Value at end of list
OR X.Y_valuelist LIKE CONCAT('%,' Y.value, ',%') -- Value in middle of list
OR X.Y_valuelist LIKE CONCAT(Y.value, ',%') -- Value at start of list
OR X.Y_valuelist = Y.value -- Value is entire list
;

Exporting table with SequelPro into csv file

I'm trying to export a MySQL table into csv file using SequelPro, however the result file is not properly formatted. For instance, on the first column should the id, in some of the records there is a number, but in others there is text instead. This happens in pretty much every column.
One explanation might be that the field are being exported vertically rather than horizontally (you have the id 1 on the first row and the id 10 on the 10th row.
Maybe I'm getting the the options (fields) wrong, in which case I'd appreciate help.
What I need is for the file to have the titles as the first row and the actual data in the subsequent rows with each data property as a column.
Appreciate the help
PS: I've tried this but I'm getting the data vertically as well.
For convenience in executing and sharing across the team, consider adding commas as the part of SELECT statement itself like this
SELECT
col1 as 'Column 1',',',
col2 as 'Column 2',',',
col3 as 'Column 3',',',
col4 as 'Column 4',',',
col5 as 'Column 5',','
FROM
table1, table2
WHERE
clause1 and clause2
The data and headers do automatically get their necessary commas.