Need to extract the timestamp from a logstash elasticsearch cluster - json

I'm trying to determine the freshness of the most recent record in my logstash cluster, but I'm having a bit of trouble digesting the Elasticsearch DSL.
Right now I am doing something like this to extract the timestamp:
curl -sX GET 'http://localhost:9200/logstash-2015.06.02/' -d'{"query": {"match_all": {} } }' | json_pp | grep timestamp
which gets me;
"#timestamp" : "2015-06-02T00:00:28.371+00:00",
I'd like to use an elasticsearch query directly with no grep hackiness.
The raw JSON (snipped for length) looks like this:
{
"took" : 115,
"timed_out" : false,
"hits" : {
"hits" : [
{
"_index" : "logstash-2015.06.02",
"_source" : {
"type" : "syslog",
"#timestamp" : "2015-06-02T00:00:28.371+00:00",
"tags" : [
"sys",
"inf"
],
"message" : " 2015/06/02 00:00:28 [INFO] serf: EventMemberJoin: generichost.example.com 10.1.1.10",
"file" : "/var/log/consul.log",
"#version" : 1,
"host" : "generichost.example.com"
},
"_id" : "AU4xcf51cXOri9NL1hro",
"_score" : 1,
"_type" : "syslog"
},
],
"total" : 8605141,
"max_score" : 1
},
"_shards" : {
"total" : 50,
"successful" : 50,
"failed" : 0
}
}
Any help would be appreciated. I know the query is simple, I just don't know what it is.

You don't need to use the DSL for this. You can simply cram everything into the URL query string, like this:
curl -s XGET 'localhost:9200/logstash-2015.06.02/_search?_source=#timestamp&size=1&sort=#timestamp:desc&format=yaml'
So:
_source=#timestamp means we're only interested in getting the #timestamp value
size=1 means we only need one result
sort=#timestamp:desc means we want to sort on #timestamp descending (i.e. latest first)
format=yaml will get you the result in YAML format which is a bit more concise than JSON in your case
The output would look like this:
- _index: "logstash-2015.06.02"
_type: "syslog"
_id: "AU4xcf51cXOri9NL1hro"
_score: 1.0
_source:
#timestamp: "2015-06-02T00:00:28.371+00:00"
You don't need json_pp anymore, you can still simply grep #timestamp to get the data you need.
Note that in 1.6.0, there will be a way to filter out all the metadata (i.e. _index, _type, _id, _score) and only get the _source for a search result using the filter_path parameter in the URL.

Related

How can i match fields with wildcards using jq?

I have a JSON object of the following form:
{
"Task11c-0-20181209-12:59:30-65611" : {
"attributes" : {
"configname" : "Task11c",
"datetime" : "20181209-12:59:30",
"experiment" : "Task11c",
"inifile" : "lab1.ini",
"iterationvars" : "",
"iterationvarsf" : "",
"measurement" : "",
"network" : "Manhattan1_1C",
"processid" : "65611",
"repetition" : "0",
"replication" : "#0",
"resultdir" : "results",
"runnumber" : "0",
"seedset" : "0"
},
......
},
......
"Task11b-12-20181209-13:03:17-65612" : {
....
....
},
.......
}
I reported only the first part, but in general I have many other sub-objects which match a string like Task11c-0-20181209-12:59:30-65611. They all have in common the initial word Task. I want to extract the processid from each sub-object. I'm trying to use a wildcard like in bash, but it seems not to be possible.
I also read about the match() function, but it works with strings and not json objects.
Thanks for the support.
Filter keys that start with Test and get only the attribute of your choice using the select() expression
jq 'to_entries[] | select(.key|startswith("Task")).value.attributes.processid' json

Reg Expression to reduce huge file size?

I have a series of gigantic (40-80mb) exported Google Location History JSON files, with which I've been tasked to analyze select activity data. Unfortunately Google has no parameter or option at their download site to choose anything except "one giant JSON containing forever". (The KML option is twice as big.)
Obvious choices like JSON-Converter (laexcel-test incarnation of VBA-JSON); parsing line-by line with VBA; even Notepad++. They all crash and burn. I'm thinking RegEx might be the answer.
This Python script can extract the timestamp and location from a 40mb file in two seconds (with RegEx?). How is Python doing it so fast? (Would it be as fast in VBA?)
I'd be able to extract everything I need, piece by piece, if only I had a magic chunk of RegEx, perhaps with this logic:
Delete everything except:
When timestampMs and WALKING appear between the *same set of [square brackets] :
I need the 13-digit number that follows timestampMS, and,
the one- to three- digit number that follows WALKING.
If it's simpler to include a little more data, like "all the timestamps", or "all activities", I could easily sift through it later. The goal is to make the file small enough that I can manipulate it without the need to rent a supercomputer, lol.
I tried adapting existing RegEx's but I have a serious issue with both RegEx and musical instruments: doesn't how hard I try, I just can't wrap my head around it. So, this is indeed a "please write code for me" question, but it's just one expression, and I'll pay it forward by writing code for others today! Thanks... ☺
.
}, {
"timestampMs" : "1515564666086", ◁― (Don't need this but it won't hurt)
"latitudeE7" : -6857630899,
"longitudeE7" : -1779694452999,
"activity" : [ {
"timestampMs" : "1515564665992", ◁― EXAMPLE: I want only this, and...
"activity" : [ {
"type" : "STILL",
"confidence" : 65
}, { ↓
"type" : "TILTING",
"confidence" : 4
}, {
"type" : "IN_RAIL_VEHICLE",
"confidence" : 20 ↓
}, {
"type" : "IN_ROAD_VEHICLE",
"confidence" : 5
}, {
"type" : "ON_FOOT", ↓
"confidence" : 3
}, {
"type" : "UNKNOWN",
"confidence" : 3
}, {
"type" : "WALKING", ◁―┬━━ ...AND, I also want this.
"confidence" : 3 ◁―┘
} ]
} ]
}, {
"timestampMs" : "1515564662594", ◁― (Don't need this but it won't hurt)
"latitudeE7" : -6857630899,
"longitudeE7" : -1779694452999,
"altitude" : 42
}, {
Edit:
For testing purposes I made a sample file, representative of the original (except for the size). The raw JSON can be loaded directly from this Pastebin link, or downloaded as a local copy with this TinyUpload link, or copied as "one long line" below:
{"locations" : [ {"timestampMs" : "1515565441334","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 2299}, {"timestampMs" : "1515565288606","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42,"activity" : [ {"timestampMs" : "1515565288515","activity" : [ {"type" : "STILL","confidence" : 98}, {"type" : "ON_FOOT","confidence" : 1}, {"type" : "UNKNOWN","confidence" : 1}, {"type" : "WALKING","confidence" : 1} ]} ]}, {"timestampMs" : "1515565285131","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42}, {"timestampMs" : "1513511490011","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511369962","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511179720","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513511059677","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510928842","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510942911","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]}, {"timestampMs" : "1513510913776","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 15,"altitude" : -11,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513507320258","activity" : [ {"type" : "TILTING","confidence" : 100} ]} ]}, {"timestampMs" : "1513510898735","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510874140","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 19,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510874245","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]} ]}
The file tested as valid with JSONLint and FreeFormatter.
Obvious choices ...
The obvious choice here is a JSON-aware tool that can handle large files quickly. In the following, I'll use jq, which can easily handle gigabyte-size files quickly so long as there is sufficient RAM to hold the file in memory, and which can also handle very large files even if there is insufficient RAM to hold the JSON in memory.
First, let's assume that the file consists of an array of JSON objects of the form shown, and that the goal is to extract the two values for each admissible sub-object.
This is a jq program that would do the job:
.[].activity[]
| .timestampMs as $ts
| .activity[]
| select(.type == "WALKING")
| [$ts, .confidence]
For the given input, this would produce:
["1515564665992",3]
More specifically, assuming the above program is in a file named program.jq and that the input file is input.json, a suitable invocation of jq would be as follows:
jq -cf program.jq input.json
It should be easy to modify the jq program given above to handle other cases, e.g. if the JSON schema is more complex than has been assumed above. For example, if there is some irregularity in the schema, try sprinkling in some postfix ?s, e.g.:
.[].activity[]?
| .timestampMs as $ts
| .activity[]?
| select(.type? == "WALKING")
| [$ts, .confidence]
You may try this
(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$
Regex Demo,,,in which I searched and approached to the target capturing values( timestamp value, walking value) through such keywords like "longitude", "activity", "[", "timestampMs", "]", "walking", "confidence".
Python script
ss=""" copy & paste the file contents' strings (above sample text) in this area """
regx= re.compile(r"(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$")
matching= regx.match(ss) # method 1 : using match() function's capturing group
timestamp= matching.group(1)
walkingval= matching.group(2)
print("\ntimestamp is %s\nwalking value is %s" %(timestamp,walkingval))
print("\n"+regx.sub(r'\1 \2',ss)) # another method by using sub() function
Output is
timestamp is 1515564665992
walking value is 3
1515564665992 3
Unfortunately it seems you've picked a language without a performant JSON parser.
With Python you could have:
#!/usr/bin/env python3
import time
import json
def get_history(filename):
with open(filename) as history_file:
return json.load(history_file)
def walking_confidence(history):
for location in history["locations"]:
if "activity" not in location:
continue
for outer_activity in location["activity"]:
confidence = extract_walking_confidence(outer_activity)
if confidence:
timestampMs = int(outer_activity["timestampMs"])
yield (timestampMs, confidence)
def extract_walking_confidence(outer_activity):
for inner_activity in outer_activity["activity"]:
if inner_activity["type"] == "WALKING":
return inner_activity["confidence"]
if __name__ == "__main__":
start = time.clock()
history = get_history("history.json")
middle = time.clock()
wcs = list(walking_confidence(history))
end = time.clock()
print("load json: " + str(middle - start) + "s")
print("loop json: " + str(end - middle) + "s")
On my 98MB JSON history this prints:
load json: 3.10292s
loop json: 0.338841s
That isn't terribly performant, but certainly not bad.

Curl Get Specific value from the output

I have one curl command if I run it , output as below,
{
"page" : 1,
"records" : 1,
"total" : 1,
"rows" : [ {
"automated" : true,
"collectionProtocol" : "MagBead Standard Seq v2",
"comments" : "",
"copy" : false,
"createdBy" : "stest",
"custom1" : "User Defined Field 1=",
"custom2" : "User Defined Field 2=",
"custom3" : "User Defined Field 3=",
"custom4" : "User Defined Field 4=",
"custom5" : "User Defined Field 5=",
"custom6" : "User Defined Field 6=",
"description" : null,
"editable" : false,
"expanded" : false,
"groupName" : "99111",
"groupNames" : [ "all" ],
"inputCount" : 1,
"instrumentId" : 1,
"instrumentName" : "42223",
"jobId" : 11111,
"jobStatus" : "In Progress",
"leaf" : true,
"modifiedBy" : null,
"name" : "Copy_of_Test_Running2"
} ]
}
I want to extract only jobId`s value.
This output will be
11111
If there is multiple rows then, there is multiple jobId
11111
11112
11113
I want to extract only jobId and process in the while loop.
like below,
while read job; do
echo $job
done < < (curl command)
and I want to use that job id in another command.
That curl results could be multiple.
Do you have idea to get easy to extract curl output and make a while or for loop?
I think jq (thanks to #Mircea ) is a nice solution.
Besides, I can provide a simple awk solution only if the curl's output format is disciplinary and does not has any dirty symbol.
So, just be careful to use this:
while IFS= read -r line
do
echo $line|awk -F':' '/jobId/{split($2,a,",");for(i in a){if(a[i]){printf("%d\n",a[i])}}}'
done < "$file"

Elasticsearch queries on "empty index"

in my application I use several elasticsearch indices, which will contain no indexed documents in their initial state. I consider that can be called "empty" :)
The document's mapping is correct and working.
The application also has a relational database that contain entities, that MIGHT have documents associated in elasticsearch.
In the initial state of the appliation it is very common that there are only entities without documents, so not a single document has been indexed, therefore "empty index". The index has been created nevertheless and also the document's mapping has been put to the index and is present in the indexes metadata.
Anyway, when I query elasticsearch with a SearchQuery to find an document for one of the entities (the document contains an unique id from the entity), elasticsearch will throw an ElasticSearchException, that complains about no mapping present for field xy etc.
BUT IF I insert one single blank document into the index first, the query wont fail.
Is there a way to "initialize" an index in a way to prevent the query from failing and to get rid of the silly "dummy document workaround"?
UPDATE:
Plus, the workaround with the dummy doc pollutes the index, as for example a count query now returns always +1....so I added a deletion to the workaround as well...
Your questions lacks details and is not clear. If you had provided gist of your index schema and query, that would have helped. You should have also provided the version of elasticsearch that you are using.
"No mapping" exception that you have mentioned has nothing to do with initializing the index with some data. Most likely you are sorting on the field which doesn't exist. This is common if you are querying multiple indexes at once.
Solution: Solution is based on the version of elasticsearch. If you are on 1.3.x or lower then you should use ignore_unmapped. If you are on a version greater than 1.3.5 then you should use unmapped_type.
Click here to read official documentation.
If you are find the documentation confusing, then this example will make it clear:
Lets create two indexes testindex1 and testindex2
curl -XPUT localhost:9200/testindex1 -d '{"mappings":{"type1":{"properties":{"firstname":{"type":"string"},"servers":{"type":"nested","properties":{"name":{"type":"string"},"location":{"type":"nested","properties":{"name":{"type":"string"}}}}}}}}}'
curl -XPUT localhost:9200/testindex2 -d '{"mappings":{"type1":{"properties":{"firstname":{"type":"string"},"computers":{"type":"nested","properties":{"name":{"type":"string"},"location":{"type":"nested","properties":{"name":{"type":"string"}}}}}}}}}'
The only difference between these two indexes is - testindex1 has "server" field and textindex2 has "computers" field.
Now let's insert test data in both the indexes.
Index test data on testindex1:
curl -XPUT localhost:9200/testindex1/type1/1 -d '{"firstname":"servertom","servers":[{"name":"server1","location":[{"name":"location1"},{"name":"location2"}]},{"name":"server2","location":[{"name":"location1"}]}]}'
curl -XPUT localhost:9200/testindex1/type1/2 -d '{"firstname":"serverjerry","servers":[{"name":"server2","location":[{"name":"location5"}]}]}'
Index test data on testindex2:
curl -XPUT localhost:9200/testindex2/type1/1 -d '{"firstname":"computertom","computers":[{"name":"computer1","location":[{"name":"location1"},{"name":"location2"}]},{"name":"computer2","location":[{"name":"location1"}]}]}'
curl -XPUT localhost:9200/testindex2/type1/2 -d '{"firstname":"computerjerry","computers":[{"name":"computer2","location":[{"name":"location5"}]}]}'
Query examples:
Using "unmapped_type" for elasticsearch version > 1.3.x
curl -XPOST 'localhost:9200/testindex2/_search?pretty' -d '{"fields":["firstname"],"query":{"match_all":{}},"sort":[{"servers.location.name":{"order":"desc","unmapped_type":"string"}}]}'
Using "ignore_unmapped" for elasticsearch version <= 1.3.5
curl -XPOST 'localhost:9200/testindex2/_search?pretty' -d '{"fields":["firstname"],"query":{"match_all":{}},"sort":[{"servers.location.name":{"order":"desc","ignore_unmapped":"true"}}]}'
Output of query1:
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : null,
"hits" : [ {
"_index" : "testindex2",
"_type" : "type1",
"_id" : "1",
"_score" : null,
"fields" : {
"firstname" : [ "computertom" ]
},
"sort" : [ null ]
}, {
"_index" : "testindex2",
"_type" : "type1",
"_id" : "2",
"_score" : null,
"fields" : {
"firstname" : [ "computerjerry" ]
},
"sort" : [ null ]
} ]
}
}
Output of query2:
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : null,
"hits" : [ {
"_index" : "testindex2",
"_type" : "type1",
"_id" : "1",
"_score" : null,
"fields" : {
"firstname" : [ "computertom" ]
},
"sort" : [ -9223372036854775808 ]
}, {
"_index" : "testindex2",
"_type" : "type1",
"_id" : "2",
"_score" : null,
"fields" : {
"firstname" : [ "computerjerry" ]
},
"sort" : [ -9223372036854775808 ]
} ]
}
}
Note:
These examples were created on elasticserch 1.4.
These examples also demonstrate how to do sorting on nested fields.
Are you doing a sort when you search? I've run into the same issue ("No mapping found for [field] in order to sort on"), but only when trying to sort results. In that case, the solution is simply to add the ignore_unmapped: true property to the sort parameter in your query:
{
...
"body": {
...
"sort": [
{"field_name": {
"order": "asc",
"ignore_unmapped": true
}}
]
...
}
...
}
I found my solution here:
No mapping found for field in order to sort on in ElasticSearch

MongoDB AND Comparison Fails

I have a Collection named StudentCollection with two documents given below,
> db.studentCollection.find().pretty()
{
"_id" : ObjectId("52d7c0c744b4dd77efe93df7"),
"regno" : 101,
"name" : "Ajeesh",
"gender" : "Male",
"docs" : [
"voterid",
"passport",
"drivinglic"
]
}
{
"_id" : ObjectId("52d7c6a144b4dd77efe93df8"),
"regno" : 102,
"name" : "Sathish",
"gender" : "Male",
"dob" : ISODate("2013-12-09T21:05:00Z")
}
Why does the below query returns a document when it doesn't fulfil the criteria which I gave in find command. I know it's a bad & stupid query for AND comparison. I tried this with MySQL and it doesn't return anything as expected but why does NOSQL makes problem. I hope it's considering the last field for comparison.
> db.studentCollection.find({regno:101,regno:102}).pretty()
{
"_id" : ObjectId("52d7c6a144b4dd77efe93df8"),
"regno" : 102,
"name" : "Sathish",
"gender" : "Male",
"dob" : ISODate("2013-12-09T21:05:00Z")
}
Can anyone brief why does Mongodb works this way?
MongoDB leverages JSON/BSON and names should be unique (http://www.ietf.org/rfc/rfc4627.txt # 2.2.) Found this in another post How to generate a JSON object dynamically with duplicate keys? . I am guessing the value for 'regno' gets overridden to '102' in your case.
If what you want is an OR query, try the following:
db.studentCollection.find ( { $or : [ { "regno" : "101" }, {"regno":"102"} ] } );
Or even better, use $in:
db.studentCollection.find ( { "regno" : { $in: ["101", "102"] } } );
Hope this helps!
Edit : Typo!
MongoDB converts your query to a Javascript document. Since you have not mentioned anything for $and condition in your document, your query clause is getting overwritten by the last value which is "regno":"102". Hence you get last document as result.
If you want to use an $and, you may use any of the following:
db.studentCollection.find({$and:[{regno:"102"}, {regno:"101"}]});
db.studentCollection.find({regno:{$gte:"101", $lte:"102"}});