I am trying to use Pentaho Kettle to read the Json file with the structure below and insert the data into the DW (Redshift).
{
"_id": {
"_data": "11111111111111"
},
"operationType": "insert",
"clusterTime": {
"$timestamp": {
"t": 1599495064,
"i": 1
}
},
"ns": {
"db": "abc",
"coll": "abc"
},
"documentKey": {
"_id": {
"$uuid": "abcabcabcabcabcabc"
}
},
"fullDocument": {
"_id": {
"$uuid": "abcabcabcabcabcabc"
},
"orderNumber": "1234567",
"externalOrderId": "12345678",
"orderDateTime": "2020-09-11T08:06:26Z[UTC]",
"attraction": "abc",
"entryDate": {
"$date": 1599523200000
},
"entryTime": {
"$date": 1599472800000
},
"requestId": "abc",
"ticketUrl": "abc",
"tickets": [
{
"passId": "1111111",
"externalTicketId": "1234567"
},
{
"passId": "222222222",
"externalTicketId": "122442492"
}
],
"_class": "abc"
}
}
As we see above both columns "entry_date" and "entry_time" are in Unix format. I need to somehow take the date component from "entry_date" and the time component from "entry_time" and transform both into a concatenated unique field that will give me the following output: "GMT: Monday, September 7, 2020 4:11:04 PM".
Also I would like to achieve the same for the field "orderDateTime" - is there any way I could use Pentaho to transform that into the same format as the above "GMT: Monday, September 7, 2020 4:11:04 PM"?
Below you can see how the 3 fields above ("entry_time", "entry_date" and "orderDatetime") are currently set up for the "Select Values" step. From there I would then take the data into the DW with the Table Output step as seen in the illustration below. Any help is appreciated.
Related
I have a few records in elastic search I want to group the record by user_id and fetch the latest record which is event_type is 1
If the latest record event_type value is not 1 then we should not fetch that record. I did it in MySQL query. Please let me know how can I do that same in elastic search.
After executing the MySQL query
SELECT * FROM user_events
WHERE id IN( SELECT max(id) FROM `user_events` group by user_id ) AND event_type=1;
I need the same output in elasticsearch aggregations.
Elasticsearch Query:
GET test_analytic_report/_search
{
"from": 0,
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"event_date": {
"gte": "2022-10-01",
"lte": "2023-02-06"
}
}
}
]
}
},
"sort": {
"event_date": {
"order": "desc"
}
},
"aggs": {
"group": {
"terms": {
"field": "user_id"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"_source": ["user_id", "event_date", "event_type"],
"sort": {
"user_id": "desc"
}
}
}
}
}
}
}
I have the above query I have two users whose user_id is 55 and 56. So, in my aggregations, it should not come. But It fetched the other event_type data but I want only event_types=1 with the latest one. if the user's last record does not have event_type=1, it should not come.
In the above table, user_id 56 latest record event_type contains 2 so it should not come in our aggregations.
I tried but it's not returning the exact result that I want.
Note: event_date is the current date and time. As per the above image, I have inserted it manually that's why the date differs
GET user_events/_search
{
"size": 1,
"query": {
"term": {
"event_type": 1
}
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
Explanation: This is an Elasticsearch API request in JSON format. It retrieves the latest event of type 1 (specified by "event_type": 1 in the query) from the "user_events" index, with a size of 1 (specified by "size": 1) and sorts the results in descending order by the "id" field (specified by "order": "desc" in the sort).
If your ES version supports, you can do it with field collapse feature. Here is an example query:
{
"_source": false,
"query": {
"bool": {
"filter": {
"term": {
"event_type": 1
}
}
}
},
"collapse": {
"field": "user_id",
"inner_hits": {
"name": "the_record",
"size": 1,
"sort": [
{
"id": "desc"
}
]
}
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
In the response, you will see that the document you want is in inner_hits under the name you give. In my example it is the_record. You can change the size of the inner hits if you want more records in each group and sort them.
Tldr;
They are many ways to go about it:
Sorting
Collapsing
Latest Transform
All those solution are approximate of what you could get with sql.
But my personal favourite is transform
Solution - transform jobs
Set up
We create 2 users, with 2 events.
PUT 75324839/_bulk
{"create":{}}
{"user_id": 1, "type": 2, "date": "2015-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 1, "type": 1, "date": "2016-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 2, "type": 1, "date": "2015-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 2, "type": 2, "date": "2016-01-01T00:00:00.000Z"}
Transform job
This transform job is going to run against the index 75324839.
It will find the latest document, with regard to the user_id, based of the value in date field.
And the results are going to be stored in latest_75324839.
PUT _transform/75324839
{
"source": {
"index": [
"75324839"
]
},
"latest": {
"unique_key": [
"user_id"
],
"sort": "date"
},
"dest": {
"index": "latest_75324839"
}
}
If you were to query latest_75324839
You would find:
{
"hits": [
{
"_index": "latest_75324839",
"_id": "AGvuZWuqqz7c5ytICzX5Z74AAAAAAAAA",
"_score": 1,
"_source": {
"date": "2017-01-01T00:00:00.000Z",
"user_id": 1,
"type": 1
}
},
{
"_index": "latest_75324839",
"_id": "AA3tqz9zEwuio1D73_EArycAAAAAAAAA",
"_score": 1,
"_source": {
"date": "2016-01-01T00:00:00.000Z",
"user_id": 2,
"type": 2
}
}
]
}
}
Get the final results
To get the amount of user with type=1.
A simple search query such as:
GET latest_75324839/_search
{
"query": {
"term": {
"type": {
"value": 1
}
}
},
"aggs": {
"number_of_user": {
"cardinality": {
"field": "user_id"
}
}
}
}
Side notes
This transform job has been running in batch, this means it will only run once.
It is possible to run it in a continuous fashion, to get all the time the latest event for a user_id.
Here are some examples.
Your are looking for an SQL HAVING clause, which would allow you to filter results after grouping. But sadly there is nothing equivalent on Elastic.
So it is not possible to
sort, collapse and filter afterwards (even post_filter does not
help here)
use a top_hits aggregation with custom sorting and then filter
use any map/reduce scripted aggregations, as they do not support
sorting.
work with subqueries.
So basically seen, Elastic is not a database. Any sorting or relation to other documents should be based on scoring. And the score should be calculated independently for each document, distributed on shards.
But there is a tiny loophole, which might be the solution for your use case. It is based on a top_metrics aggregation followed by bucket selector to eliminate the unwanted event types:
GET test_analytic_report/_search
{
"size": 0,
"aggs": {
"by_id": {
"terms": {
"field": "user_id",
"size": 100
},
"aggs": {
"tm": {
"top_metrics": {
"metrics": {
"field": "event_type"
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
},
"event_type_filter": {
"bucket_selector": {
"buckets_path": {
"event_type": "tm.event_type"
},
"script": "params.event_type == 1"
}
}
}
}
}
}
If you require more fields from the source document you can add them to the top_metrics.
It is sorted by id now, but you can also use event_date.
I'm trying to limit the map in my view to a specific set of documents by either having the id "startsWith" a string or based on there being a specific node in the JSON> I can't seem to get a result set once I add an IF statement. The reduce is a simple _count:
function(doc, meta) {
if (doc.metricType == "Limit_Exceeded") {
emit([doc.ownedByCustomerNumber, doc.componentProduct.category], meta.id);
}
}
I've also tried if (doc.metricType) and also if(meta.id.startsWith("Turnaway:")
Example Doc:
{
"OvidUserId": 26105400,
"id": "Turnaway:00005792:10562440",
"ipAddress": "111187081038",
"journalTurnawayNumber": 10562440,
"metricType": "Limit_Exceeded",
"oaCode": "OA_Gold",
"orderNumber": 683980,
"ovidGroupID": 3113900,
"ovidGroupName": "tnu999",
"ovidUserName": "tnu999",
"ownedByCustomerNumber": 59310,
"platform": "Lippincott",
"samlString": "",
"serialName": "00005792",
"sessionID": "857616ee-dab7-43d0-a08b-abb2482297dd",
"soldProduct": {
"category": "Multidisciplinary Subjects",
"name": "Custom Collection For CALIS - LWW TA 2020",
"productCode": "CCFCCSI20",
"productNumber": 33410,
"subCategory": "",
"subject": "Multidisciplinary Subjects"
},
"soldToCustomer": {
"customerNumber": 59310,
"keyAccount": false,
"name": "Tongji University"
},
"turnawayDateTime": "2022-05-04T03:01:44.600",
"usedByCustomer": {
"customerNumber": 59310,
"keyAccount": false,
"name": "Tongji University"
},
"usedByCustomerNumber": 59310,
"yearMonth": "202205"
},
"id": "Turnaway:00005792:10562440"
}
Thanks,
Gerry
Found it (of course after posting the question) The second component of the Key in the emit has to exist. I entered doc.componentProduct.category instead of doc.soldProduct.Category.
I've got a Select-Operation on an an object that contains one key more than once.
It's practically two versions of one object in one JSON object.
I want to get the id of both of those Objects.
When i inspect the Object, i can clearly see the two different Id's, but the Select-Operation returns only one of them twice.
This is the original Object:
[
{
"Created": "2020-06-05T11:47:42",
"ID": 9,
},
{
"Created": "2020-06-05T11:06:04",
"ID": 10,
}
]
The Select-Operation looks like this:
{
"inputs": {
"from": "#body('Rest')?['value']",
"select": {
"ID": "#triggerBody()?['ID']",
"Created": "#triggerBody()?['Created']"
}
}
}
And it returns:
[
{
"Created": "2020-06-05T11:47:42",
"ID": 9,
},
{
"Created": "2020-06-05T11:47:42",
"ID": 9,
}
]
I don't really understand what's going on.
"select": {
"ID": "#triggerBody()?['ID']",
"Created": "#triggerBody()?['Created']"
}
is wrong, it should select item()?['ID']
I have a MongoDB that is structured as below:
[
{
"subject_id": "1",
"name": "Maria",
"dob": "1/1/00",
"gender": "F",
"visits": {
"1/1/18": {
"date_entered": "1/2/18",
"entered_by": "Sally"
},
"1/2/18": {
"date_entered": "1/2/18",
"entered_by": "Tim",
}
},
"samples": {
"XXX123": {
"collected_by": "Sally",
"collection_date": "1/3/18"
}
}
},
{
"subject_id": "2",
"name": "Bob",
"dob": "1/2/00",
"gender": "M",
"visits": {
"1/3/18": {
"date_entered": "1/4/18",
"entered_by": "Tim"
}
},
"samples": {
"YYY456": {
"collected_by": "Sally",
"collection_date": "1/5/18"
},
"ZZZ789": {
"collected_by": "Tim",
"collection_date": "1/6/18"
},
"AAA123": {
"collected_by": "Sally",
"collection_date": "1/7/18"
}
}
}
]
If I wanted to query the database to find all samples collected by Sally or all visits entered by Tim, what would be the best way of doing that?
I'm new to MongoDB and my attempts with various regex's haven't produced results. Any advice would be greatly appreciated.
I first used project on the required fields to use objectToArray followed by unwind to create separate records for array created in project.
The results are then filtered using match.
This works for the data provided in the question -
db.so.aggregate([
{$project: {visits: {$objectToArray: "$visits"}, samples: {$objectToArray: "$samples"}}},
{$unwind: "$visits"},
{$unwind: "$samples"},
{ $match: {
$or : [
{ "visits.v.entered_by" : "Tim" },
{ "samples.v.collected_by" : "Sally" }
]
}
}
])
I have Json data like below
{
"!type": "alarm",
"$": {
"12279": {
"!type": "alarm",
"title": "Default",
"$": {
"5955": {
"!type": "alarm",
"name": "Wake",
"day": "SUN",
"startTime": "06:00"
},
"29323": {
"!type": "alarm",
"name": "Away",
"day": "SUN",
"startTime": "08:00"
},
"2238": {
"!type": "alarm",
"name": "Home",
"day": "SUN",
"startTime": "18:00"
}
}
}
}
}
My fbs looks like this
namespace space.alarm;
table Atom{
!type:string;
name:string;
startDay:string;
startTime:string; }
table AtomShell{
key:string (required, key);
value: Atom; }
table Alarm{
!type:string;
title:string;
$:[AtomShell]; }
table AlarmShell{
key:string (required, key);
value:Alarm; }
table Weeklyalarm{
!type:string;
$:[AlarmShell]; } root_type Weeklyalarm;
Im trying to implement google flat buffers but I'm getting errors like
alarm.fbs:4:0: error: illegal character: !
alarm.fbs:23:0: error: illegal character: $ (i have removed ! from
!type and changed $ to dollar to test the working of flat buffers
but i can't change the dynamic ids )
Sample.json:25:0: error: unknown field: 12279
Now my question,
Is it possible to use dynamic ids in flat buffers, if possible how
shall i proceed?
Can is use special characters in ids, if possible how to do it?
Thanks in advance.
You can't have characters like ! and $ in field names. Just use type instead of !type, etc.
Not sure what you mean by dynamic ids. All field names (keys) have to be declared in the schema, so they can't be dynamic. You can still achieve similar results though, if you make your JSON look something like this:
{
"type": "alarm",
"data": [
{
id: "12279",
"type": "alarm",
"title": "Default",
"data": [
{
"id": "5955",
"type": "alarm",
"name": "Wake",
"day": "SUN",
"startTime": "06:00"
},
{
"id": "29323",
"type": "alarm",
"name": "Away",
"day": "SUN",
"startTime": "08:00"
},
{
"id": "2238",
"type": "alarm",
"name": "Home",
"day": "SUN",
"startTime": "18:00"
}
]
}
]
}
And then make the corresponding schema.
Note that I made the "dynamic" list into a vector, and moved the id into the object itself.
Other tip: string values that are not dynamic (like "alarm") will take up way less space if you make them into an enum instead.