I have a series of gigantic (40-80mb) exported Google Location History JSON files, with which I've been tasked to analyze select activity data. Unfortunately Google has no parameter or option at their download site to choose anything except "one giant JSON containing forever". (The KML option is twice as big.)
Obvious choices like JSON-Converter (laexcel-test incarnation of VBA-JSON); parsing line-by line with VBA; even Notepad++. They all crash and burn. I'm thinking RegEx might be the answer.
This Python script can extract the timestamp and location from a 40mb file in two seconds (with RegEx?). How is Python doing it so fast? (Would it be as fast in VBA?)
I'd be able to extract everything I need, piece by piece, if only I had a magic chunk of RegEx, perhaps with this logic:
Delete everything except:
When timestampMs and WALKING appear between the *same set of [square brackets] :
I need the 13-digit number that follows timestampMS, and,
the one- to three- digit number that follows WALKING.
If it's simpler to include a little more data, like "all the timestamps", or "all activities", I could easily sift through it later. The goal is to make the file small enough that I can manipulate it without the need to rent a supercomputer, lol.
I tried adapting existing RegEx's but I have a serious issue with both RegEx and musical instruments: doesn't how hard I try, I just can't wrap my head around it. So, this is indeed a "please write code for me" question, but it's just one expression, and I'll pay it forward by writing code for others today! Thanks... ☺
.
}, {
"timestampMs" : "1515564666086", ◁― (Don't need this but it won't hurt)
"latitudeE7" : -6857630899,
"longitudeE7" : -1779694452999,
"activity" : [ {
"timestampMs" : "1515564665992", ◁― EXAMPLE: I want only this, and...
"activity" : [ {
"type" : "STILL",
"confidence" : 65
}, { ↓
"type" : "TILTING",
"confidence" : 4
}, {
"type" : "IN_RAIL_VEHICLE",
"confidence" : 20 ↓
}, {
"type" : "IN_ROAD_VEHICLE",
"confidence" : 5
}, {
"type" : "ON_FOOT", ↓
"confidence" : 3
}, {
"type" : "UNKNOWN",
"confidence" : 3
}, {
"type" : "WALKING", ◁―┬━━ ...AND, I also want this.
"confidence" : 3 ◁―┘
} ]
} ]
}, {
"timestampMs" : "1515564662594", ◁― (Don't need this but it won't hurt)
"latitudeE7" : -6857630899,
"longitudeE7" : -1779694452999,
"altitude" : 42
}, {
Edit:
For testing purposes I made a sample file, representative of the original (except for the size). The raw JSON can be loaded directly from this Pastebin link, or downloaded as a local copy with this TinyUpload link, or copied as "one long line" below:
{"locations" : [ {"timestampMs" : "1515565441334","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 2299}, {"timestampMs" : "1515565288606","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42,"activity" : [ {"timestampMs" : "1515565288515","activity" : [ {"type" : "STILL","confidence" : 98}, {"type" : "ON_FOOT","confidence" : 1}, {"type" : "UNKNOWN","confidence" : 1}, {"type" : "WALKING","confidence" : 1} ]} ]}, {"timestampMs" : "1515565285131","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 12,"velocity" : 0,"heading" : 350,"altitude" : 42}, {"timestampMs" : "1513511490011","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511369962","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 25,"altitude" : -9,"verticalAccuracy" : 2}, {"timestampMs" : "1513511179720","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513511059677","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510928842","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510942911","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]}, {"timestampMs" : "1513510913776","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 15,"altitude" : -11,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513507320258","activity" : [ {"type" : "TILTING","confidence" : 100} ]} ]}, {"timestampMs" : "1513510898735","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 16,"altitude" : -12,"verticalAccuracy" : 2}, {"timestampMs" : "1513510874140","latitudeE7" : 123456789,"longitudeE7" : -123456789,"accuracy" : 19,"altitude" : -12,"verticalAccuracy" : 2,"activity" : [ {"timestampMs" : "1513510874245","activity" : [ {"type" : "STILL","confidence" : 100} ]} ]} ]}
The file tested as valid with JSONLint and FreeFormatter.
Obvious choices ...
The obvious choice here is a JSON-aware tool that can handle large files quickly. In the following, I'll use jq, which can easily handle gigabyte-size files quickly so long as there is sufficient RAM to hold the file in memory, and which can also handle very large files even if there is insufficient RAM to hold the JSON in memory.
First, let's assume that the file consists of an array of JSON objects of the form shown, and that the goal is to extract the two values for each admissible sub-object.
This is a jq program that would do the job:
.[].activity[]
| .timestampMs as $ts
| .activity[]
| select(.type == "WALKING")
| [$ts, .confidence]
For the given input, this would produce:
["1515564665992",3]
More specifically, assuming the above program is in a file named program.jq and that the input file is input.json, a suitable invocation of jq would be as follows:
jq -cf program.jq input.json
It should be easy to modify the jq program given above to handle other cases, e.g. if the JSON schema is more complex than has been assumed above. For example, if there is some irregularity in the schema, try sprinkling in some postfix ?s, e.g.:
.[].activity[]?
| .timestampMs as $ts
| .activity[]?
| select(.type? == "WALKING")
| [$ts, .confidence]
You may try this
(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$
Regex Demo,,,in which I searched and approached to the target capturing values( timestamp value, walking value) through such keywords like "longitude", "activity", "[", "timestampMs", "]", "walking", "confidence".
Python script
ss=""" copy & paste the file contents' strings (above sample text) in this area """
regx= re.compile(r"(?s)^.*?\"longitude[^\[]*?\"activity[^\[]*\[[^\]]*?timestampMs\"[^\"\]]*\"(\d+)\"[^\]]*WALKING[^\]]*?confidence\"\s*:\s*(\b\d{1,3}\b)[^\]]*?\].*$")
matching= regx.match(ss) # method 1 : using match() function's capturing group
timestamp= matching.group(1)
walkingval= matching.group(2)
print("\ntimestamp is %s\nwalking value is %s" %(timestamp,walkingval))
print("\n"+regx.sub(r'\1 \2',ss)) # another method by using sub() function
Output is
timestamp is 1515564665992
walking value is 3
1515564665992 3
Unfortunately it seems you've picked a language without a performant JSON parser.
With Python you could have:
#!/usr/bin/env python3
import time
import json
def get_history(filename):
with open(filename) as history_file:
return json.load(history_file)
def walking_confidence(history):
for location in history["locations"]:
if "activity" not in location:
continue
for outer_activity in location["activity"]:
confidence = extract_walking_confidence(outer_activity)
if confidence:
timestampMs = int(outer_activity["timestampMs"])
yield (timestampMs, confidence)
def extract_walking_confidence(outer_activity):
for inner_activity in outer_activity["activity"]:
if inner_activity["type"] == "WALKING":
return inner_activity["confidence"]
if __name__ == "__main__":
start = time.clock()
history = get_history("history.json")
middle = time.clock()
wcs = list(walking_confidence(history))
end = time.clock()
print("load json: " + str(middle - start) + "s")
print("loop json: " + str(end - middle) + "s")
On my 98MB JSON history this prints:
load json: 3.10292s
loop json: 0.338841s
That isn't terribly performant, but certainly not bad.
Have a shell script running on Unix that is going through a list of JSON objects like the following, collecting values like <init>() # JSONInputData.java:82. There are also other objects with other values that I need to retrieve.
Is there a better option than grepping for "STACKTRACE_LINE",\n\s*.* and then splitting up that result?
inb4: "add X package to the OS". Need to run generically.
. . .
"probableStartLocationView" : {
"lines" : [ {
"fragments" : [ {
"type" : "STACKTRACE_LINE",
"value" : "<init>() # JSONInputData.java:82"
} ],
"text" : "<init>() # JSONInputData.java:82"
} ],
"nested" : false
},
. . . .
What if I was looking for "description" : "Dangerous Data Received" in a series of objects like the following knowing that I need to know that it is associated with event 12345 and not with another event listed in the same file?
. . .
"events" : [ {
"id" : "12345",
"important" : true,
"type" : "Creation",
"description" : "Dangerous Data Received",
. . .
Is there a better option than grepping for "STACKTRACE_LINE",\n\s*.* and then splitting up that result?
Yes. Use jq to filter and extract the interesting parts.
Example 1, given this JSON:
{
"probableStartLocationView": {
"lines": [
{
"fragments": [
{
"type": "STACKTRACE_LINE",
"value": "<init>() # JSONInputData.java:82"
}
],
"text": "<init>() # JSONInputData.java:82"
}
],
"nested": false
}
}
Extract value where type is "STACKTRACE_LINE":
jq -r '.probableStartLocationView.lines[] | .fragments[] | select(.type == "STACKTRACE_LINE") | .value' file.json
This is going to produce one line per value.
Example 2, given this JSON:
{
"events": [
{
"id": "12345",
"important": true,
"type": "Creation",
"description": "Dangerous Data Received"
}
]
}
Extract the id where description starts with "Dangerous":
jq -r '.events[] | select(.description | startswith("Dangerous")) | .id'
And so on.
See the jq manual for more examples and capabilities.
Also there are many questions on Stack Overflow using jq,
that should help you find the right combination of filtering and extracting the relevant parts.
I have a JSON like this
{
"images" : [
{
"size" : "29x29",
"idiom" : "iphone",
"filename" : "Icon-Small#2x.png",
"scale" : "2x"
}
......
......
{
"size" : "60x60",
"idiom" : "iphone",
"filename" : "Icon-60#3x.png",
"scale" : "3x"
}
],
"info" : {
"version" : 1,
"author" : "xcode"
}
}
I want to iterate through each dictionary in images array.
For that I wrote
declare -a images=($(cat Contents.json | jq ".images[]"))
for image in "${images[#]}"
do
echo "image --$image"
done
I am expecting output that each dictionary is printing in an iteration. That is
image --{
"size" : "29x29",
"idiom" : "iphone",
"filename" : "Icon-Small#2x.png",
"scale" : "2x"
}
image --{
"size" : "29x29",
"idiom" : "iphone",
"filename" : "Icon-Small#3x.png",
"scale" : "3x"
}
image --{
"size" : "40x40",
"idiom" : "iphone",
"filename" : "Icon-Spotlight-40#2x.png",
"scale" : "2x"
}
Etc
But its iterating through each and every single elements in each dictionary like
image --{
image --"size":
image --"29x29",
image --"idiom":
image --"iphone",
image --"filename":
....
....
....
What is wrong with my code
The problem with your code is that an array initialization in bash looks like this:
declare -a arr=(item1 item2 item3)
Items are separated by space or newline. You can also use:
declare -a arr(
item1
item2
item3
)
However, the jq output in the example contains both spaces and newlines, that's why the reported behaviour is as expected.
Workaround:
I would get the keys first, pipe them to a read loop and then call jq for each item of the list:
jq -r '.images|keys[]' Contents.json | while read key ; do
echo "image --$(jq ".images[$key]" Contents.json)"
done
You can also use this jq command if you don't care about pretty printing:
jq -r '.images[]|"image --" + tostring' Contents.json
To access a certain property of the subarray you can use:
jq -r '.images|keys[]' Contents.json | while read key ; do
echo "image --$(jq ".images[$key].filename" Contents.json)"
done
The above node will print the filename property for each node for example.
However this can be expressed much simpler using jq only:
jq -r '.images[]|"image --" + .filename' Contents.json
Or even simpler:
jq '"image --\(.images[].filename)"' Contents.json
I have one curl command if I run it , output as below,
{
"page" : 1,
"records" : 1,
"total" : 1,
"rows" : [ {
"automated" : true,
"collectionProtocol" : "MagBead Standard Seq v2",
"comments" : "",
"copy" : false,
"createdBy" : "stest",
"custom1" : "User Defined Field 1=",
"custom2" : "User Defined Field 2=",
"custom3" : "User Defined Field 3=",
"custom4" : "User Defined Field 4=",
"custom5" : "User Defined Field 5=",
"custom6" : "User Defined Field 6=",
"description" : null,
"editable" : false,
"expanded" : false,
"groupName" : "99111",
"groupNames" : [ "all" ],
"inputCount" : 1,
"instrumentId" : 1,
"instrumentName" : "42223",
"jobId" : 11111,
"jobStatus" : "In Progress",
"leaf" : true,
"modifiedBy" : null,
"name" : "Copy_of_Test_Running2"
} ]
}
I want to extract only jobId`s value.
This output will be
11111
If there is multiple rows then, there is multiple jobId
11111
11112
11113
I want to extract only jobId and process in the while loop.
like below,
while read job; do
echo $job
done < < (curl command)
and I want to use that job id in another command.
That curl results could be multiple.
Do you have idea to get easy to extract curl output and make a while or for loop?
I think jq (thanks to #Mircea ) is a nice solution.
Besides, I can provide a simple awk solution only if the curl's output format is disciplinary and does not has any dirty symbol.
So, just be careful to use this:
while IFS= read -r line
do
echo $line|awk -F':' '/jobId/{split($2,a,",");for(i in a){if(a[i]){printf("%d\n",a[i])}}}'
done < "$file"
I'm trying to determine the freshness of the most recent record in my logstash cluster, but I'm having a bit of trouble digesting the Elasticsearch DSL.
Right now I am doing something like this to extract the timestamp:
curl -sX GET 'http://localhost:9200/logstash-2015.06.02/' -d'{"query": {"match_all": {} } }' | json_pp | grep timestamp
which gets me;
"#timestamp" : "2015-06-02T00:00:28.371+00:00",
I'd like to use an elasticsearch query directly with no grep hackiness.
The raw JSON (snipped for length) looks like this:
{
"took" : 115,
"timed_out" : false,
"hits" : {
"hits" : [
{
"_index" : "logstash-2015.06.02",
"_source" : {
"type" : "syslog",
"#timestamp" : "2015-06-02T00:00:28.371+00:00",
"tags" : [
"sys",
"inf"
],
"message" : " 2015/06/02 00:00:28 [INFO] serf: EventMemberJoin: generichost.example.com 10.1.1.10",
"file" : "/var/log/consul.log",
"#version" : 1,
"host" : "generichost.example.com"
},
"_id" : "AU4xcf51cXOri9NL1hro",
"_score" : 1,
"_type" : "syslog"
},
],
"total" : 8605141,
"max_score" : 1
},
"_shards" : {
"total" : 50,
"successful" : 50,
"failed" : 0
}
}
Any help would be appreciated. I know the query is simple, I just don't know what it is.
You don't need to use the DSL for this. You can simply cram everything into the URL query string, like this:
curl -s XGET 'localhost:9200/logstash-2015.06.02/_search?_source=#timestamp&size=1&sort=#timestamp:desc&format=yaml'
So:
_source=#timestamp means we're only interested in getting the #timestamp value
size=1 means we only need one result
sort=#timestamp:desc means we want to sort on #timestamp descending (i.e. latest first)
format=yaml will get you the result in YAML format which is a bit more concise than JSON in your case
The output would look like this:
- _index: "logstash-2015.06.02"
_type: "syslog"
_id: "AU4xcf51cXOri9NL1hro"
_score: 1.0
_source:
#timestamp: "2015-06-02T00:00:28.371+00:00"
You don't need json_pp anymore, you can still simply grep #timestamp to get the data you need.
Note that in 1.6.0, there will be a way to filter out all the metadata (i.e. _index, _type, _id, _score) and only get the _source for a search result using the filter_path parameter in the URL.