How to format the TSV file in Druid

How to format the TSV file in Druid - json

I am trying to load in a TSV in druid using this ingestion speck:
MOST UPDATED SPEC BELOW:
{
"type" : "index",
"spec" : {
"ioConfig" : {
"type" : "index",
"inputSpec" : {
"type": "local",
"baseDir": "quickstart",
"filter": "test_data.json"
}
},
"dataSchema" : {
"dataSource" : "local",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none",
"intervals" : ["2016-07-18/2016-07-22"]
},
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : ["name", "email", "age"]
},
"timestampSpec" : {
"format" : "yyyy-MM-dd HH:mm:ss",
"column" : "date"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"type" : "doubleSum",
"name" : "age",
"fieldName" : "age"
}
]
}
}
}
If my schema looks like this:
Schema: name email age
And actual dataset looks like this:
name email age Bob Jones 23 Billy Jones 45
Is this how the columns should be formatted^^ in the above dataset for a TSV? Like name email age should be first (the columns) and then the actual data. I am confused how Druid will know how to map the columns to the actual dataset in TSV format.

TSV stands for tab separated format, so it looks the same as csv but you will use tabs instead of commas e.g.
Name<TAB>Age<TAB>Address
Paul<TAB>23<TAB>1115 W Franklin
Bessy the Cow<TAB>5<TAB>Big Farm Way
Zeke<TAB>45<TAB>W Main St
you will use frist line as header to define your column names - so you can use "name", "age" or "email" in dimensions in your spec file
as for the gmt and utc, they are basically the same
There is no time difference between Greenwich Mean Time and
Coordinated Universal Time
first one is time zone, the other one is a time standard
btw don`t forget to include a column with some time value in your tsv file!!
so e.g. if you will have tsv file that looks like:
"name" "position" "office" "age" "start_date" "salary"
"Airi Satou" "Accountant" "Tokyo" "33" "2016-07-16T19:20:30+01:00" "162700"
"Angelica Ramos" "Chief Executive Officer (CEO)" "London" "47" "2016-07-16T19:20:30+01:00" "1200000"
your spec file should look like this:
{
"spec" : {
"ioConfig" : {
"inputSpec" : {
"type": "local",
"baseDir": "path_to_folder",
"filter": "name_of_the_file(s)"
}
},
"dataSchema" : {
"dataSource" : "local",
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "hour",
"queryGranularity" : "none",
"intervals" : ["2016-07-01/2016-07-28"]
},
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "tsv",
"dimensionsSpec" : {
"dimensions" : [
"position",
"age",
"office"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "start_date"
}
}
},
"metricsSpec" : [
{
"name" : "count",
"type" : "count"
},
{
"name" : "sum_sallary",
"type" : "longSum",
"fieldName" : "salary"
}
]
}
}
}

Related

jq - how can I flat my list of lists into one level list

how can I have transformed my json
{
"clients": [
{
"id" : "qwerty",
"accounts" : [{"number" : "6666"}, {"number" : "7777"}]
},
{
"id" : "zxcvb",
"accounts" : [{"number" : "1111"}, {"number" : "2222"}]
}
]
}
into following type of json? using JQ
{
"items": [
{
"id" : "qwerty",
"number" : "6666"
},{
"id" : "qwerty",
"number" : "7777"
},{
"id" : "zxcvb",
"number" : "1111"
},{
"id" : "zxcvb",
"number" : "2222"
}]
}
What kind of tools from JQ can help me? I can't choose any possible way to do it

Something like this should do the trick:
{items: [.clients[] | {id} + .accounts[]]}
Online demo

Mongo forEach Query

I have the JSON that you can see below and I want to sum the values of the two objects, but when I make an aggregation it returns me 0.Here you can see the query that I use; really the first line I only use it to be sure that the path works, and it does. On the other hand,when I use this path in the aggregation query it gives me the "ID" and the "COUNT" with right values,but the "SUM" is always 0 when it must be 3600.Any idea?
db.getCollection('TEST').find({"prices.year.months.day.csv.price.valPrice":1800})
db.TEST.aggregate([
{ $match: {"location.cp":"20830"}},
{$group:{_id:"20830",total:{$sum:"$prices.year.months.day.csv.price.valPrice"}, count: { $sum: 1 }
}}])
And this is the JSON:
{
"_id" : "20830:cas:S:3639",
"lodgtype" : "Casa",
"lodg" : "Motrico: country holiday home - San sebastian",
"webid" : "6107939",
"location" : {
"thcod" : "20",
"cp" : "20830",
"th" : "Gipuzkoa",
"geometry" : {
"type" : "Point",
"coordinates" : [
43.31706238,
-2.40293598
]
}
},
"prices" : {
"year" : [
{
"valYear" : "2018",
"months" : [
{
"valMonth" : "02",
"day" : [
{
"valDay" : "13",
"csv" : [
{
"valCsv" : "20180205210908_223",
"price" : [
{
"valPrice" : 1800.0
}
]
}
]
}
]
}
]
}
]
},
"reg" : {
"created" : "20180213",
"updated" : "20180213",
"viewed" : "20180213"
}
},{
"_id" : "TEST20830:cas:S:3639",
"lodgtype" : "Casa",
"lodg" : "TESTMotrico: country holiday home - San sebastian",
"webid" : "6107930",
"location" : {
"thcod" : "20",
"cp" : "20830",
"th" : "Gipuzkoa",
"geometry" : {
"type" : "Point",
"coordinates" : [
43.31706238,
-2.40293598
]
}
},
"prices" : {
"year" : [
{
"valYear" : "2018",
"months" : [
{
"valMonth" : "02",
"day" : [
{
"valDay" : "13",
"csv" : [
{
"valCsv" : "20180205210908_223",
"price" : [
{
"valPrice" : 1800.0
}
]
}
]
}
]
}
]
}
]
},
"reg" : {
"created" : "20180213",
"updated" : "20180213",
"viewed" : "20180213"
}
}

Since you've deeply nested array you've to unwind to flatten to a document structure. To count the number of matches you've to use extra group after $match with $push with $$ROOT to keep the matching data.
db.TEST.aggregate([
{"$match":{"location.cp":"20830"}},
{"$group":{
"_id":"20830",
"data":{"$push":"$$ROOT"},
"count":{"$sum":1}
}},
{"$unwind":"$data.prices.year"},
{"$unwind":"$data.prices.year"},
{"$unwind":"$data.prices.year.months"},
{"$unwind":"$data.prices.year.months.day"},
{"$unwind":"$data.prices.year.months.day.csv"},
{"$unwind":"$data.prices.year.months.day.csv.price"},
{"$group":{
"_id":"20830",
"total":{"$sum":"$prices.year.months.day.csv.price.valPrice"},
"count":{"$first":"$count"}
}}
])

Suggest term search result not proper in elasticsearch

I am using elasticsearch 6.1 version and I want to use "Suggester" of its feature. I have dumped data in the format its required for suggester.
I have used these queries.
PUT /hotels
{
"mappings": {
"hotel" : {
"properties" : {
"name" : { "type" : "keyword" },
"city" : { "type" : "keyword" },
"name_suggest" : {
"type" : "completion"
}
}
}
}
}
put hotels/hotel/1
{
"name" : "Mercure Hotel Munich",
"city" : "Munich",
"name_suggest" : "Mercure Hotel Munich"
}
put /hotels/hotel/2
{
"name" : "Hotel Monaco",
"city" : "Munich",
"name_suggest" : "Hotel Monaco"
}
put /hotels/hotel/3
{
"name" : "Courtyard by Marriot Munich City",
"city" : "Munich",
"name_suggest" : "Courtyard by Marriot Munich City"
}
Then I fire my search query that is
Post http://localhost:9200/hotels/_search
{
"suggest": {
"name_suggest": {
"text": "h",
"completion": {
"field": "name_suggest"
}
}
}
}
I get the output result. It only return me the 2 data that is "Hotel Monaco" but it doesn't suggest me "Mercure Hotel Munich" it has hotel.
I want both in my suggest result set
I have tried with "prefix" as well.
Has anyone tried it. Please suggest me any solution to it.

How do I use min/max on a nested field in rethinkdb

I'm trying to determine the best way to calculate the elapsed time it took for each operation, series of actions. Looking at my example data below, how might I take the min/max for the "actions" array, for each corresponding operation, which includes 'take' and 'throw' actions:
{
"name" : "test",
"location" : "here",
"operation" "hammer use",
"actions" : [
{
"action" : "take",
"object" : "hammer",
"timestamp" : "12332234234"
},
{
"action" : "drop",
"object" : "hammer",
"timestamp" : "12332234255"
},
{
"action" : "take",
"object" : "hammer",
"timestamp" : "12332234266"
},
{
"action" : "throw",
"object" : "hammer",
"timestamp" : "12332234277"
}
},
{
"name" : "test 2",
"location" : "there",
"operation" : "rock use",
"actions" : [
{
"action" : "take",
"object" : "rock",
"timestamp" : "12332534277"
},
{
"action" : "drop",
"object" : "rock",
"timestamp" : "12332534288"
},
{
"action" : "take",
"object" : "rock",
"timestamp" : "12332534299"
},
{
"action" : "throw",
"object" : "rock",
"timestamp" : "12332534400"
},
{
"name" : "test 3",
"location" : "elsewhere",
"operation" : "seal hose",
"actions" : [
{
"action" : "create",
"object" : "grommet",
"timestamp" : "12332534277"
},
{
"action" : "place",
"object" : "grommet",
"timestamp" : "12332534288"
},
{
"action" : "tighten",
"object" : "hose",
"timestamp" : "12332534299"
}
}
Expected output:
{
"operation" : "hammer use",
"elapsed_time" : 123
},
{
"operation" : "rock use",
"elapsed_time" : 123
}
I'm still new to rethinkdb and trying to get a hang for it. So far, I've come up with the following query to pick the specific records, i'm interested in, from the table:
r.db('test').table('operations').filter(function(row) {
return row('actions').contains(function(x) {
return x('action').eq('take')}).and(
row('actions').contains(function(x) { return x('action').eq('throw') })
);
});
I'm still trying to figure out how to aggregate the results by taking the min/max of the timestamp and subtracting them from each other.
I hope there's enough detail there to get an idea for the goal at hand. Let me know otherwise. Any help greatly appreciated.

Well, nobody tugged on this so I had to solve it without any help. Took a bit longer but finally figured out how. Here's the pseudocode for finding min/max on the nested fields above, and elapsed_time:
r.db('test').table('operations').filter(function(row) {
return row('actions').contains(function(x) { return x('action').eq("take") }).and(
row('actions').contains(function(x) { return x('action').eq("throw") })
);
}).map(function(doc) {
return {
operation: doc('operation'),
min: doc('actions')('timestamp').min(),
max: doc('actions')('timestamp').max(),
elapsed_time: doc('actions')('timestamp').max().sub(doc('actions')('timestamp').min())
}
})

Sub-records in Avro with Morphlines

I'm trying to convert JSON into Avro using the kite-sdk morphline module. After playing around I'm able to convert the JSON into Avro using a simple schema (no complex data types).
Then I took it one step further and modified the Avro schema as displayed below (subrec.avsc). As you can see the schema consist of a subrecord.
As soon as I tried to convert the JSON to Avro using the morphlines.conf and the subrec.avsc it failed.
Somehow the JSON paths "/record_type[]/alert/action" are not translated by the toAvro function.
The morphlines.conf
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
# Read the JSON blob
{ readJson: {} }
{ logError { format : "record: {}", args : ["#{}"] } }
# Extract JSON
{ extractJsonPaths { flatten: false, paths: {
"/record_type[]/alert/action" : /alert/action,
"/record_type[]/alert/signature_id" : /alert/signature_id,
"/record_type[]/alert/signature" : /alert/signature,
"/record_type[]/alert/category" : /alert/category,
"/record_type[]/alert/severity" : /alert/severity
} } }
{ logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } }
{ extractJsonPaths { flatten: false, paths: {
timestamp : /timestamp,
event_type : /event_type,
source_ip : /src_ip,
source_port : /src_port,
destination_ip : /dest_ip,
destination_port : /dest_port,
protocol : /proto,
} } }
# Create Avro according to schema
{ logError { format : "WE GO TO AVRO"} }
{ toAvro { schemaFile : /etc/flume/conf/conf.empty/subrec.avsc } }
# Create Avro container
{ logError { format : "WE GO TO BINARY"} }
{ writeAvroToByteArray { format: containerlessBinary } }
{ logError { format : "DONE!!!"} }
]
}
]
And the subrec.avsc
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "event_type",
"type" : "string"
}, {
"name" : "source_ip",
"type" : "string"
}, {
"name" : "source_port",
"type" : "int"
}, {
"name" : "destination_ip",
"type" : "string"
}, {
"name" : "destination_port",
"type" : "int"
}, {
"name" : "protocol",
"type" : "string"
}, {
"name": "record_type",
"type" : ["null", {
"name" : "alert",
"type" : "record",
"fields" : [ {
"name" : "action",
"type" : "string"
}, {
"name" : "signature_id",
"type" : "int"
}, {
"name" : "signature",
"type" : "string"
}, {
"name" : "category",
"type" : "string"
}, {
"name" : "severity",
"type" : "int"
}
] } ]
} ]
}
The output on { logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } } I output the following:
[{
/record_type[]/alert / action = [allowed],
/record_type[]/alert / category = [],
/record_type[]/alert / severity = [3],
/record_type[]/alert / signature = [GeoIP from NL,
Netherlands],
/record_type[]/alert / signature_id = [88006],
_attachment_body = [{
"timestamp": "2015-03-23T07:42:01.303046",
"event_type": "alert",
"src_ip": "1.1.1.1",
"src_port": 18192,
"dest_ip": "46.231.41.166",
"dest_port": 62004,
"proto": "TCP",
"alert": {
"action": "allowed",
"gid": "1",
"signature_id": "88006",
"rev": "1",
"signature" : "GeoIP from NL, Netherlands ",
"category" : ""
"severity" : "3"
}
}],
_attachment_mimetype=[json/java + memory],
basename = [simple_eve.json]
}]

UPDATE 2017-06-22
you MUST populate the data in the structure in order for this to work, by using addValues or setValues
{
addValues {
micDefaultHeader : [
{
eventTimestampString : "2017-06-22 18:18:36"
}
]
}
}
after debugging the sources of morphline toAvro, it appears that the record is the first object to be evaluated, no matter what you put in your mappings structure.
the solution is quite simple, but unfortunately took a little extra time, eclipse, running the flume agent in debug mode, cloning the source code and lots of coffee.
here it goes.
my schema:
{
"type" : "record",
"name" : "co_lowbalance_event",
"namespace" : "co.tigo.billing.cboss.lowBalance",
"fields" : [ {
"name" : "dummyValue",
"type" : "string",
"default" : "dummy"
}, {
"name" : "micDefaultHeader",
"type" : {
"type" : "record",
"name" : "mic_default_header_v_1_0",
"namespace" : "com.millicom.schemas.root.struct",
"doc" : "standard millicom header definition",
"fields" : [ {
"name" : "eventTimestampString",
"type" : "string",
"default" : "12345678910"
} ]
}
} ]
}
morphlines file:
morphlines : [
{
id : convertJsonToAvro
importCommands : ["org.kitesdk.**"]
commands : [
{
readJson {
outputClass : java.util.Map
}
}
{
addValues {
micDefaultHeader : [{}]
}
}
{
logDebug { format : "my record: {}", args : ["#{}"] }
}
{
toAvro {
schemaFile : /home/asarubbi/Development/test/co_lowbalance_event.avsc
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
}
}
{
writeAvroToByteArray {
format : containerlessJSON
codec : null
}
}
]
}
]
the magic lies here:
{
addValues {
micDefaultHeader : [{}]
}
}
and in the mappings:
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
explanation:
inside the code the first field name that is evaluated is micDefaultHeader of type RECORD. as there's no way to specify a default value for a RECORD (logically correct), the toAvro code evaluates this, does not get any value configured in mappings and therefore it fails at it detects (wrongly) that the record is empty when it shouldn't.
however, taking a look at the code, you may see that it requires a Map object, containing no values to please the parser and continue to the next element.
so we add a map object using the addValues and fill it with an empty map [{}]. notice that this must match the name of the record that is causing you an empty value. in my case "micDefaultHeader"
feel free to comment if you have a better solution, as this looks like a "dirty fix"

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to format the TSV file in Druid - json

Related

jq - how can I flat my list of lists into one level list

Mongo forEach Query

Suggest term search result not proper in elasticsearch

How do I use min/max on a nested field in rethinkdb

Sub-records in Avro with Morphlines

Categories

Resources