I'm currently trying to get a handle on how Kettle 4.4 handles data transformations by trying to port something I'm currently doing via Python to a Kettle job.
I have a relational database with four tables that I need to import into my data pipeline. Here's a simplified version of the model...
Widgets
+-----------+-------------+----------------+
| WIDGET_ID | Name | Notes |
+-----------+-------------+----------------+
| 1 | Gizmo | Red paint job |
| 2 | Large Gizmo | Blue paint job |
+-----------+-------------+----------------+
Customers
+-----------+------------+----------------------------------+
| WIDGET_ID | Name | Mailing_Address |
+-----------+------------+----------------------------------+
| 1 | Acme, Inc. | 123 Fake Street, Springfield, IL |
| 2 | Fake Corp. | 555 Main Street, Small Town, IN |
| 2 | Acme, Inc. | 123 Fake Street, Springfield, IL |
+-----------+------------+----------------------------------+
Inventory
+-----------+--------+------------+
| WIDGET_ID | Amount | Date |
+-----------+--------+------------+
| 2 | 11000 | 2012-01-15 |
| 1 | 13000 | 2012-02-05 |
| 1 | 900 | 2013-01-01 |
+-----------+--------+------------+
I'd like to be able to take the above and produce JSON output like this:
{
"id": 1,
"Name": "Gizmo",
"Notes": "Red Paint Job",
"Customers": [
{
"Name": "Acme, Inc.",
"Address": "123 Fake Street..."
}
],
"Inventory": [
{
"Amount": 13000,
"Date": "2012-02-05"
},
{
"Amount": 900,
"Date": "2013-01-01"
}
]
}
My attempts to use Kettle's joins, JS transforms and JSON output have not been very successful, and I find the documentation to be quite lacking. Can anyone help me out, or point me in the right direction?
Thanks!
you can use 3 (well 6 in total) kettle steps for this transformation:
1) add 3 table input steps one for each table.
2) add next a Multiway Merge Join step, group the 3 table input step arrow flows onto this,
choose widget_id as key field, choose inner join type.
3) add 1 json ouput step to the output flow the multiway join step.
to make the final json format you have to use the JSONPath notation:
http://goessner.net/articles/JsonPath/
hope it helps.
(if you are new in kettle i recommend to go trough the samples folder included in kettle spoon)
Related
So for example when the data is in string format I can do something like this:
->orderBy(DB::raw('FIELD(animal_type, "fish", "amphibian", "reptile", "bird", "mammal", "") ASC, animal_type'))
But if the data for 'animal_type' is stored in JSON format like this:
["vertebrate", "amphibian"]
Let's say I have a table that looks like this:
id
animal
animal_type
1
Leaf green tree frog
["vertebrate", "amphibian", "ectothermic", "pelodryadidae"]
2
Seymouria
["vertebrate", "amphibian"]
3
Dermophis mexicanus
["amphibian"]
4
Old World sparrow
["vertebrate", "bird"]
5
Parrot
["vertebrate", "bird", "psittacines"]
6
African bush elephant
["vertebrate", "mammal"]
Ideally I'd like to sort by a single characteristic of the animal. Say order by "bird", "amphibian". Then the result would look like:
Old World sparrow
Parrot
Leaf green tree frog
Seymouria
Dermophis mexicanus
How would I go about creating a query that can do an orderBy in this kind of scenario?
MySQL 8.0 introduces a JSON function that can help: https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html#operator_member-of
Here's how it works:
mysql> select * from mytable order by
'bird' member of(animal_type) desc,
'amphibian' member of(animal_type) desc;
+----+-----------------------+--------------------------------------------+
| id | animal | animal_type |
+----+-----------------------+--------------------------------------------+
| 4 | Old World sparrow | ["vertebrate", "bird"] |
| 5 | Parrot | ["vertebrate", "bird", "psittacines"] |
| 1 | Leaf green tree frog | ["vertebrate", "amphibian", "ectothermic"] |
| 2 | Seymouria | ["vertebrate", "amphibian"] |
| 3 | Dermophis mexicanus | ["amphibian"] |
| 6 | African bush elephant | ["vertebrate", "mammal"] |
+----+-----------------------+--------------------------------------------+
Although MySQL 8.0 also supports creating a multi-valued index on JSON data to help searches for values, this only helps optimize row filtering (WHERE clause), it does not yet optimizing sorting (ORDER BY clause).
If you use a version of MySQL 5.x that doesn't support MEMBER OF(), you're out of luck. You should be making plans to upgrade anyway, because 5.x is going to be end of life in October 2023.
I've been trying to set up a MySQL to Elasticsearch data pipeline for real-time data replication.
The MySQL database has around 10 different tables that are highly normalized. But in Elasticsearch, I'm in need to have all of the data from these tables in a single index, which would be similar to the output from a big compound JOIN query. Tried a lot to find out, please help 🙂
(Changing the DB schema isn't feasible as there are a lot of other dependent services. )
For example :
Input from MySQL:
Table: main_profile
+--------+------+
| name | city |
+--------+------+
| Edward | 1 |
| Jake | 9 |
+--------+------+
Table: city_master
+---------+----------+
| city_id | name |
+---------+----------+
| 1 | New York |
| 9 | Tampa |
+---------+----------+
Document stored in Elasticsearch:
{
"0": {
"name": "Edward",
"city": "New York"
},
"1": {
"name": "Jake",
"city": "Tampa"
}
}
you can use Kafka Streams to do aggregation from two different topics to build a unfied message. Please check an example for Debezium source https://github.com/debezium/debezium-examples/tree/master/kstreams
The target is MongoDB in the example but the principle is the same.
I have a MySQL database column that contains JSON array encoded strings. I would like to search the JSON array where the "Elapsed" value is greater than a particular number and return the corresponding TaskID value of the object the value was found. I have been attempting to use combinations of the JSON_SEARCH, JSON_CONTAINS, and JSON_EXTRACT functions but I am not getting the desired results.
[
{
"TaskID": "TAS00000012344",
"Elapsed": "25"
},
{
"TaskID": "TAS00000012345",
"Elapsed": "30"
},
{
"TaskID": "TAS00000012346",
"Elapsed": "35"
},
{
"TaskID": "TAS00000012347",
"Elapsed": "40"
}
]
Referencing the JSON above, if I search for "Elapsed" > "30" then 2 records would return
'TAS00000012346'
'TAS00000012347'
I am using MySQL version 5.7.11 and new to querying json data. Any help would be appreciated. thanks
With MySQL pre-8.0, there is no easy way to turn a JSON array to a recordset (ie, function JSON_TABLE() is not yet available).
So, one way or another, we need to manually iterate through the array to extract the relevant pieces of data (using JSON_EXTRACT()). Here is a solution that uses an inline query to generate a list of numbers ; another classic approchach is to use a number tables.
Assuming a table called mytable with a column called js holding the JSON content:
SELECT
JSON_EXTRACT(js, CONCAT('$[', n.idx, '].TaskID')) TaskID,
JSON_EXTRACT(js, CONCAT('$[', n.idx, '].Elapsed')) Elapsed
FROM mytable t
CROSS JOIN (
SELECT 0 idx
UNION ALL SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
) n
WHERE JSON_EXTRACT(js, CONCAT('$[', n.idx, '].Elapsed')) * 1.0 > 30
NB: in the WHERE clause, the * 1.0 operation is there to force the conversion to a number.
Demo on DB Fiddle with your sample data:
| TaskID | Elapsed |
| -------------- | ------- |
| TAS00000012346 | 35 |
| TAS00000012347 | 40 |
Yes , you can definitely to it using JSON_EXTRACT() function in mysql.
lets take a table that contains JSON (table client_services here) :
+-----+-----------+--------------------------------------+
| id | client_id | service_values |
+-----+-----------+------------+-------------------------+
| 100 | 1000 | { "quota": 1,"data_transfer":160000} |
| 101 | 1000 | { "quota": 2,"data_transfer":800000} |
| 102 | 1000 | { "quota": 3,"data_transfer":70000} |
| 103 | 1001 | { "quota": 1,"data_transfer":97000} |
| 104 | 1001 | { "quota": 2,"data_transfer":1760} |
| 105 | 1002 | { "quota": 2,"data_transfer":1060} |
+-----+-----------+--------------------------------------+
And now lets say we want client_id for all who have quota>1 , then use this query :
SELECT
id,client_id,
JSON_EXTRACT(service_values, '$.quota') AS quota
FROM client_services
WHERE JSON_EXTRACT(service_values, '$.quota') > 1;
And hence it will result into :
+-----+-----------+-------+
| id | client_id | quota |
+-----+-----------+--------
| 101 | 1000 | 2 |
| 102 | 1000 | 3 |
| 104 | 1001 | 2 |
| 105 | 1002 | 2 |
+-----+-----------+-------+
hope this helps!
I am using Vertica DB (and DBeaver as SQL Editor) - I am new to both tools.
I have a view with multiple columns:
someint | xyz | c | json
5 | 1542 | none | {"range":23, "rm": 51, "spx": 30}
5 | 1442 | none | {"range":24, "rm": 50, "spx": 3 }
3 | 1462 | none | {"range":24, "rm": 50, "spx": 30}
(int) | (int) | (Varchar) | (Long Varchar)
I want to create another view (or for the beginning, just be able to query it properly) of the above, but with the "json" column separated into the individual fields/columns "range", "rm" and "spx".
I imagine the output of the query / the new view to be something like the following:
someint | xyz | c | range | rm | spx
5 | 1542 | none | 23 | 51 | 30
5 | 1442 | none | 24 | 50 | 3
....
So far I have not been able to even query the "range" for example.
Hence my questions:
How can I separate the json column key-value structure into individual columns (in a query output)?
How can I transfer the desired output into a new view in Vertica?
I haven't found much help in the documentation as the procedure there is to load json text files from a drive or operate on tables, which I cannot do as I only have access to a view.
I have found a solution, so for anyone else encountering this problem:
SELECT a, xyza, cont,
MAPLOOKUP(MapJSONExtractor(json), 'range') AS range,
MAPLOOKUP(MapJSONExtractor(json), 'rm') AS rm,
MAPLOOKUP(MapJSONExtractor(json), 'spx') AS spx
FROM test;
I have a MySQL table authors with columns id, name and published_books. In this, published_books is a JSON column. With sample data,
id | name | published_books
-----------------------------------------------------------------------
1 | Tina | {
| | "17e9bf8f": {
| | "name": "Book 1",
| | "tags": [
| | "self Help",
| | "Social"
| | ],
| | "language": "English",
| | "release_date": "2017-05-01"
| | },
| | "8e8b2470": {
| | "name": "Book 2",
| | "tags": [
| | "Inspirational"
| | ],
| | "language": "English",
| | "release_date": "2017-05-01"
| | }
| | }
-----------------------------------------------------------------------
2 | John | {
| | "8e8b2470": {
| | "name": "Book 4",
| | "tags": [
| | "Social"
| | ],
| | "language": "Tamil",
| | "release_date": "2017-05-01"
| | }
| | }
-----------------------------------------------------------------------
3 | Keith | {
| | "17e9bf8f": {
| | "name": "Book 5",
| | "tags": [
| | "Comedy"
| | ],
| | "language": "French",
| | "release_date": "2017-05-01"
| | },
| | "8e8b2470": {
| | "name": "Book 6",
| | "tags": [
| | "Social",
| | "Life"
| | ],
| | "language": "English",
| | "release_date": "2017-05-01"
| | }
| | }
-----------------------------------------------------------------------
As you see, the published_books column has nested JSON data (one level). JSON will have dynamic UUIDs as the keys and its values will be book details as a JSON.
I want to search for books with certain conditions and extract those books JSON data alone to return as the result.
The query that I've written,
select JSON_EXTRACT(published_books, '$.*') from authors
where JSON_CONTAINS(published_books->'$.*.language', '"English"')
and JSON_CONTAINS(published_books->'$.*.tags', '["Social"]');
This query performs the search and returns the entire published_books JSON. But I wanted just those books JSON alone.
The expected result,
result
--------
"17e9bf8f": {
"name": "Book 1",
"tags": [
"self Help",
"Social"
],
"language": "English",
"release_date": "2017-05-01"
}
-----------
"8e8b2470": {
"name": "Book 6",
"tags": [
"Social",
"Life"
],
"language": "English",
"release_date": "2017-05-01"
}
There is no JSON function yet that filters elements of a document or array with "WHERE"-like logic.
But this is a task that some people using JSON data may want to do, so the solution MySQL has provided is to use the JSON_TABLE() function to transform the JSON document into a format as if you had stored your data in a normal table. Then you can use a standard SQL WHERE clause to the fields returned.
You can't use this function in MySQL 5.7, but if you upgrade to MySQL 8.0 you can do this.
select authors.id, authors.name, books.* from authors,
json_table(published_books, '$.*'
columns(
bookid for ordinality,
name text path '$.name',
tags json path '$.tags',
language text path '$.language',
release_date date path '$.release_date')
) as books
where books.language = 'English'
and json_search(tags, 'one', 'Social') is not null;
+----+-------+--------+--------+-------------------------+----------+--------------+
| id | name | bookid | name | tags | language | release_date |
+----+-------+--------+--------+-------------------------+----------+--------------+
| 1 | Tina | 1 | Book 1 | ["self Help", "Social"] | English | 2017-05-01 |
| 3 | Keith | 2 | Book 6 | ["Social", "Life"] | English | 2017-05-01 |
+----+-------+--------+--------+-------------------------+----------+--------------+
Note that nested JSON arrays are still difficult to work with, even with JSON_TABLE(). In this example, I exposed the tags as a JSON array, and then use JSON_SEARCH() to find the tag you wanted.
I agree with Rick James — you might as well store the data in normalized tables and columns. You think that using JSON will save you some work, but it's won't. It might make it more convenient to store the data as a single JSON document instead of multiple rows across several tables, but you just have to unravel the JSON again before you can query it the way you want.
Furthermore, if you store data in JSON, you will have to solve this sort of JSON_TABLE() expression every time you want to query the data. That's going to make a lot more work for you on an ongoing basis than if you had stored the data normally.
Frankly, I have yet to see a question on Stack Overflow about using JSON with MySQL that wouldn't lead to the conclusion that storing data in relational tables is a better idea than using JSON, if the structure of the data doesn't need to vary.
You are approaching the task backwards.
Do the extraction as you insert the data. Insert into a small number of tables (Authors, Books, Tags, and maybe a couple more) and build relations between them. No JSON is needed in this database.
The result is an easy-to-query and fast database. However, it requires learning about RDBMS and SQL.
JSON is useful when the data is a collection of random stuff. Your JSON is very regular, hence the data fits very nicely into RDBMS technology. In that case, JSON is merely a standard way to serialize the data. But it should not be used for querying.