Abinitio : Writing data to a target table - ab-initio

I am new to Abinitio and I need help for the following.
Table 1 has columns :
Col1
Col2
Col3
Table 2 has columns :
col 4
col5
I am using join component and also reformat component and got the output as col2,col3,col4,col5. And I am writing this to a target table which has
id, col2,col3,col4,col5 ,created_by, created_date_time,last_modified_date.
As I have data for col2,col3,col4,col5 from output of the join component but not for id,created_by, created_date_time,last_modified_date.
How do I add this using abinitio. Any help on this is greatly appreciated and apologize if this kind of basic question was alread discussed.
Regards.

You could connect REFORMAT component to the output flow of the JOIN component. Transform function in the REFORMAT component could pass the col2, col3, col4, col5 values using wildcard rule out.* :: in.*. The rest of columns in the output table should be present in the DML of the output port of REFORMAT component as well, so then you could assign data to these columns in the transform function in the REFORMAT, e. g. out.created_by :: "something".

After join component connect the reformat component and in the ports section of reformat component change the dml for output port by adding all the relevant columns you need to have in the output then change the transform function of reformat component as follows:
1. for all the incoming values of join use out.* :: in.*
2. for all the additional columns that you have added in the out port of dml assign value using out.column_name :: "value you need to pass"

In join component,simply you could write transformation for all the required columns. Include id,created_by, created_date_time,last_modified_date in output port dml(embedded) of join .For col2,col3,col4,col5 you can map from respective input columns and for id,created_by, created_date_time,last_modified_date add required transformation. So you could avoid one extra reformat component .

Related

Querying custom attributes in Athena

I apologize in advance if this is very simple and I am just missing it.
Would any of you know how to put custom attributes as column headers? I currently have a simple opt in survey on connect and I would like to have each of the 4 items as column headers and the score in the table results. I pull the data using an ODBC connection to excel so ideally I would like to just add this on the end of my current table if I can figure out how to do it.
This is how it currently looks in the output
{"effortscore":"5","promoterscore":"5","satisfactionscore":"5","survey_opt_in":"True"}
If you have any links or something that I can follow to try improve my knowledge.
Thanks in advance
There are multiple options to query data in JSON format in Athena, and based on your use case (data source, query frequency, query destination, etc.) you can choose what makes more sense.
String Column + JSON functions
This is usually the most straightforward option and a good starting point. You define the survey_output as a string column, and when you need to extract the specific attributes from the JSON string, you can apply the JSON functions in Trino/Athena: https://trino.io/docs/current/functions/json.html. For example:
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers
String Column + JSON functions + View
The following way to simplify access to data without json_query functions is to define a VIEW on that table using the json_query syntax in the VIEW creation. You define the view once by a DBA, and when the users query the data, they see the columns they care about. For example:
CREATE VIEW survey_results AS
SELECT
id,
json_query(
survey_output,
'lax $.satisfactionscore'
) AS satisfactionscore
FROM customers;
With such dynamic view creation, you have more flexibility in what data will be easily exposed to the users.
Create a Table with STRUCT
Another option is to create the external table from the data source (files in S3, for example) with the STRUCT definition.
CREATE EXTERNAL TABLE survey (
id string,
survey_results struct<
effortscore:string,
promoterscore:string,
satisfactionscore:string,
survey_opt_in:string
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://<YOUR BUCKET HERE>/<FILES>'

How to query an array field (AWS Glue)?

I have a table in AWS Glue, and the crawler has defined one field as array.
The content is in S3 files that have a json format.
The table is TableA, and the field is members.
There are a lot of other fields such as strings, booleans, doubles, and even structs.
I am able to query them all using a simpel query such as:
SELECT
content.my_boolean,
content.my_string,
content.my_struct.value
FROM schema.tableA;
The issue is when I add content.members into the query.
The error I get is: [Amazon](500310) Invalid operation: schema "content" does not exist.
Content exists because i am able to select other fiels from the main key in the json (content).
Probably is something related with how to perform the query agains array field in Spectrum.
Any idea?
You have to rename the table to extract the fields from the external schema:
SELECT
a.content.my_boolean,
a.content.my_string,
a.content.my_struct.value
FROM schema.tableA a;
I had the same issue on my data, I really don't know why it needs this cast but it works. If you need to access elements of an array you have to explod it like:
SELECT member.<your-field>,
FROM schema.tableA a, a.content.members as member;
Reference
You need to create a Glue Classifier.
Select JSON as Classifier type
and for the JSON Path input the following:
$[*]
then run your crawler. It will infer your schema and populate your table with the correct fields instead of just one big array. Not sure if this was what you were looking for but figured I'd drop this here just in case others had the same problem I had.

Can I get a Json Key into a Hive Column?

I am trying to read data from json files in S3 into my Hive table. If the column names and json keys are same its all loading properly. But now I want to read data in such a way that the nested json values goes into specific columns (For eg: for json
{"data1": {"key1": "value1"}}
I want the data1.key1 value to go into column named data1_key1; which I understand is achievable with SERDEPROPERTIES. My next problem is there can be multiple json keys and I want the key names to be column values in my Hive table.
Also, depending upon those keys, the keys that go into other columns will also change.
For eg my json files will be either:
{"data1" : {"key1":"value1"}}
or
{"data2" : { "key2" : "value2"}}
This need to create a table as below:
col1 col2
data1 value1
data2 value2
Is this possible? If so how should it be done?
You can do it using regular expressions. Define json column as string in table DDL and use regexp to parse it. Tested on your data example:
Demo:
with your_table as ( --Replace this CTE with your table
select stack(2,
'{"data1": {"key1": "value1"}}',
'{"data2" : { "key2" : "value2"}}'
) as json
)
select regexp_extract(json,'^\\{ *\\"(\\w+)\\" *:', 1) as col1, --capturing group 1 in a parenthesis START{spaces"(word)"spaces:
regexp_extract(json,': *\\"(.+)\\" *\\} *\\}$', 1) as col2 --:spaces"(value characters)"spaces}spaces}END
from your_table;
Result:
col1,col2
data1,value1
data2,value2
Read the comments in the code please. You can adjust this solution to fit your JSON. This approach allows to extract keys and values from JSON not knowing their names. json_tuple and get_json_object are not applicable in this case.
Alternatively you can use regexSerDe to do the same in the table DDL like in this answer: https://stackoverflow.com/a/47944328/2700344. For the RegexSerDe solution you need to write more complex single regexp containing one capturing group (in parenthesis) for each column.

Is it possible to include column names in the csv with a copy into statement in Snowflake?

For example:
COPY INTO #my_stage/my_test.csv
FROM (select * from my_table)
FILE_FORMAT = (TYPE = CSV)
OVERWRITE=TRUE SINGLE=TRUE
will result in a csv but does not include column headers. If it is not possible with a copy into statement, is there perhaps any non-obvious technique that might accomplish this?
Thanks in advance.
Snowflake has added this feature. You can simply add an option HEADER=TRUE:
COPY INTO #my_stage/my_test.csv
FROM (select * from my_table)
FILE_FORMAT = (TYPE = CSV)
OVERWRITE=TRUE SINGLE=TRUE HEADER=TRUE
We've seen this request before, and it's on our roadmap. If it's high priority for you, please contact Snowflake support.
If you're looking for a workaround, it's hard to come up with a truly generic one.
One option is to add a single row with explicit column names, but
you'd need to know them in advance and it might not be efficient if
not all your fields are strings.
Another is to convert all records
using OBJECT_CONSTRUCT(*) and export as JSON, then you will have
column names, but it will be of course only useful if you can ingest
JSON.
But I hope Snowflake will add this functionality in the not-so-far future.
To supplement #Jiaxing's answer, the Snowflake HEADER feature also allows you to explicitly define your column names by naming the columns via AS:
COPY INTO #my_stage/my_test.csv
FROM (
SELECT
column1 AS "Column 1",
column2 AS "Column 2"
FROM my_table
) FILE_FORMAT = (TYPE = CSV)

Pentaho Kettle split CSV into multiple records

I'm new to Kettle, but getting on well with it so far. However I can't figure out how to do this.
I have a csv which looks something like this
a, col1, col2, col3
a, col1, col2, col3
a, col1, col2, col3
b, col1, col2, col3
b, col1, col2, col3
c, col1, col2, col3
c, col1, col2, col3
The first column starts with a key (a,b,c), and then the rest of the columns follow. What I want to do is read in the csv (got that covered) and then split the csv based on key, so I have 3 chunks/ groups of data and then convert each of those chunks of data into a separate json file, which I think I can get.
What I can't get my head around is the grouping the data and then performing a separate action (convert to json) on each of those separate groups. Its not the creating json I have an issue with.
The data is from a sensor network of many environmental sensors so there are many keys, hundreds, and new ones get added. I've used map reduce to process this data before as the concept of partitioning is what I'm trying to replicate here, without using the hadoop elements of kettle as the deployment is different. Once I've partitioned the data it needs to be loaded into different places as seperate records. The key is a unique ID (serial number) of a sensor.
Any ideas please?
Thanks
I guess create a javascript to output the fields of a row in a JSON like string added to the row:
{"id":"a","col1":"1","col2":"2","col3":"3"}
Next you could use the group step and set the base field to the 'id' field and have as aggregate the javascript value in type 'Concatenate strings separated by ,'
{"id":"a","col1":"1","col2":"2","col3":"3"},{"id":"a","col1":"4","col2":"5","col3":"6"}, {"id":"a","col1":"7","col2":"8","col3":"9"}
Add some tags around it and you have valid json. Next you could assemble a file name using javascript step:
var file_name="C:\\dir\\"+ id + ".txt";
Use the text file output and set the file name field to 'file_name'. Remove separator / enclosure options to have none extra formatting and you are done.
If i have understood your question correctly, you can use "GROUP BY" step to group the columns (i.e. the first header in your data set) and then store these into memory.
Once this is done.. use parameter looping to "get the variables" and dynamically generate multiple JSON output. Check the image below:
In the JSON output step, use variables like header1 to generate multiple files. Highlighted below the changes i made in the JSON Output.
In case you find in confusing, i have uploaded a sample code in here.
Hope it helps :)