I have a dump of the IMDB database in form of a CSV.
The CSV look like this :
name, movie, role
"'El Burro' Van Rankin, Jorge","Serafín (1999)",PLAYED_IN
"'El Burro' Van Rankin, Jorge","Serafín (1999)",PLAYED_IN
"'El Burro' Van Rankin, Jorge","Serafín (1999)",PLAYED_IN
.........
"A.S., Alwi","Rumah masa depan (1984)",PLAYED_IN
"A.S., Giri","Sumangali (1940)",PLAYED_IN
"A.S., Luis","Bob the Drag Queen: Bloodbath (2016)",PLAYED_IN
"A.S., Pragathi","Suli (2016)",PLAYED_IN
"A.S.F. Dancers, The","D' Lucky Ones! (2006)",PLAYED_IN
.........
My goal is to put the data in Elastic Search but I don't want to have duplicate of actors so I want to aggregate the movie they are playing in so that the dataset look like this :
{
"_index": "imdb13",
"_type": "logs",
"_id": "AVmw9JHCrsOFTsZwAmBm",
"_score": 13.028783,
"_source": {
"#timestamp": "2017-01-18T09:42:15.149Z",
"movie": [
"Naomi and Ely's No Kiss List (2015)",
"Staten Island Summer (2015/II)",
"What Happened Last Night (2016)",
...
],
"#version": "1",
"name": "Abernethy, Kevin",
}
}
So I am using Logstash to push the data into ElasticSearch. I use the aggregate plugin and my configuration file is the following :
input {
file {
path => "/home/maeln/imdb-data/roles.csv"
start_position => "beginning"
}
}
filter {
csv {
columns => [ "name", "movie" ]
remove_field => ["role", "message", "host", "column3", "path"]
separator => ","
}
aggregate {
task_id => "%{name}"
code => "
map['movie'] ||= []
event.to_hash.each do |key,value|
map[key] = value unless map.has_key?(key)
map[key] << value if map[key].is_a?(Array)
end
"
push_previous_map_as_event => true
timeout => 30
timeout_tags => ['aggregated']
}
if "aggregated" not in [tags] {
drop {}
}
}
output {
elasticsearch {
hosts => "localhost:9200"
index => "imdb13"
}
}
But then, when I do a simple search on the index, all the actors are duplicated with only one movie in the "movie" field, like this :
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 149,
"max_score": 13.028783,
"hits": [
{
"_index": "imdb13",
"_type": "logs",
"_id": "AVmw9JHCrsOFTsZwAmBm",
"_score": 13.028783,
"_source": {
"#timestamp": "2017-01-18T09:42:15.149Z",
"movie": [
"Naomi and Ely's No Kiss List (2015)"
],
"#version": "1",
"name": "Abernethy, Kevin",
"tags": [
"aggregated"
]
}
},
{
"_index": "imdb13",
"_type": "logs",
"_id": "AVmw9JHCrsOFTsZwAmBq",
"_score": 12.998644,
"_source": {
"#timestamp": "2017-01-18T09:42:15.149Z",
"movie": [
"Staten Island Summer (2015/II)"
],
"#version": "1",
"name": "Abernethy, Kevin",
"tags": [
"aggregated"
]
}
},
{
"_index": "imdb13",
"_type": "logs",
"_id": "AVmw9JHCrsOFTsZwAmBu",
"_score": 12.998644,
"_source": {
"#timestamp": "2017-01-18T09:42:15.150Z",
"movie": [
"What Happened Last Night (2016)"
],
"#version": "1",
"name": "Abernethy, Kevin",
"tags": [
"aggregated"
]
}
},
.....
Is there a way to fix this ?
The log from logstash with the --debug option (only partially, the whole log is around ~1Gio) : paste (I put it on pastebin because of the 30000 chars limit in stackoverflow).
The last lines of the log :
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] filters/LogStash::Filters::CSV: removing field {:field=>"path"}
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] filters/LogStash::Filters::CSV: removing field {:field=>"role"}
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] Event after csv filter {:event=>2017-01-18T10:34:09.900Z %{host} %{message}}
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] filters/LogStash::Filters::CSV: removing field {:field=>"message"}
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] filters/LogStash::Filters::CSV: removing field {:field=>"path"}
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] filters/LogStash::Filters::CSV: removing field {:field=>"host"}
[2017-01-18T11:34:09,977][DEBUG][logstash.pipeline ] output received {"event"=>{"#timestamp"=>2017-01-18T10:34:09.897Z, "movie"=>["Tayong dalawa (2009)"], "#version"=>"1", "name"=>"Anselmuccio, Alex", "tags"=>["aggregated"]}}
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] Event after csv filter {:event=>2017-01-18T10:34:09.915Z %{host} %{message}}
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] filters/LogStash::Filters::CSV: removing field {:field=>"column3"}
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.aggregate] Aggregate create_timeout_event call with task_id 'Anson, Christopher'
[2017-01-18T11:34:09,977][DEBUG][logstash.filters.csv ] filters/LogStash::Filters::CSV: removing field {:field=>"path"}
[2017-01-18T11:34:09,977][DEBUG][logstash.util.decorators ] filters/LogStash::Filters::Aggregate: adding tag {"tag"=>"aggregated"}
[2017-01-18T11:34:09,977][DEBUG][logstash.pipeline ] output received {"event"=>{"#timestamp"=>2017-01-18T10:34:09.917Z, "movie"=>["Tabi tabi po! (2001)"], "#version"=>"1", "name"=>"Anson, Alvin", "tags"=>["aggregated"]}}
[2017-01-18T11:34:09,978][DEBUG][logstash.filters.csv ] Event after csv filter {:event=>2017-01-18T10:34:09.921Z %{host} %{message}}
[2017-01-18T11:34:09,978][DEBUG][logstash.filters.aggregate] Aggregate successful filter code execution {:code=>"\n\t\t\t\tmap['movie'] ||= []\n\t\t\t\t\tevent.to_hash.each do |key,value|\n\t\t\t\t\tmap[key] = value unless map.has_key?(key)\n\t\t\t\t\tmap[key] << value if map[key].is_a?(Array)\n\t\t\t\tend\n\t\t\t\t"}
[2017-01-18T11:34:09,978][DEBUG][logstash.pipeline ] output received {"event"=>{"#timestamp"=>2017-01-18T10:34:09.911Z, "movie"=>["21 Jump Street (1987)"], "#version"=>"1", "name"=>"Ansley, Zachary", "tags"=>["aggregated"]}}
[2017-01-18T11:34:09,978][DEBUG][logstash.filters.aggregate] Aggregate create_timeout_event call with task_id 'Anseth, Elias Moussaoui'
[2017-01-18T11:34:09,978][DEBUG][logstash.pipeline ] output received {"event"=>{"#timestamp"=>2017-01-18T10:34:09.897Z, "movie"=>["Tayong dalawa (2009)"], "#version"=>"1", "name"=>"Anselmuccio, Alex", "tags"=>["aggregated"]}}
[2017-01-18T11:34:09,978][DEBUG][logstash.util.decorators ] filters/LogStash::Filters::Aggregate: adding tag {"tag"=>"aggregated"}
[2017-01-18T11:34:09,978][DEBUG][logstash.pipeline ] output received {"event"=>{"#timestamp"=>2017-01-18T10:34:09.917Z, "movie"=>["The Death Match: Fighting Fist of Samurai Joe (2013)"], "#version"=>"1", "name"=>"Anson, Alvin", "tags"=>["aggregated"]}}
[2017-01-18T11:34:09,978][DEBUG][logstash.filters.aggregate] Aggregate successful filter code execution {:code=>"\n\t\t\t\tmap['movie'] ||= []\n\t\t\t\t\tevent.to_hash.each do |key,value|\n\t\t\t\t\tmap[key] = value unless map.has_key?(key)\n\t\t\t\t\tmap[key] << value if map[key].is_a?(Array)\n\t\t\t\tend\n\t\t\t\t"}
[2017-01-18T11:34:09,978][DEBUG][logstash.pipeline ] output received {"event"=>{"#timestamp"=>2017-01-18T10:34:09.917Z, "movie"=>["The Diplomat Hotel (2013)"], "#version"=>"1", "name"=>"Anson, Alvin", "tags"=>["aggregated"]}}
[2017-01-18T11:34:09,978][DEBUG][logstash.filters.aggregate] Aggregate create_timeout_event call with task_id 'Anson, Alvin'
[2017-01-18T11:34:09,978][DEBUG][logstash.pipeline ] output received {"event"=>{"#timestamp"=>2017-01-18T10:34:09.897Z, "movie"=>["Tayong dalawa (2009)"], "#version"=>"1", "name"=>"Anselmuccio, Alex", "tags"=>["aggregated"]}}
[2017-01-18T11:34:09,978][DEBUG][logstash.pipeline ] filter received {"event"=>{"path"=>"/home/maeln/Projets/oracle-of-bacon/imdb-data/roles.csv", "#timestamp"=>2017-01-18T10:34:09.900Z, "#version"=>"1", "host"=>"maeln-GE70-2PE", "message"=>"\"Ansfelt, Jacob\",\"Manden med de gyldne ører (2009)\",PLAYED_IN"}}
[2017-01-18T11:34:09,978][DEBUG][logstash.util.decorators ] filters/LogStash::Filters::Aggregate: adding tag {"tag"=>"aggregated"}
[2017-01-18T11:34:09,978][DEBUG][logstash.filters.aggregate] Aggregate successful filter code execution {:code=>"\n\t\t\t\tmap['movie'] ||= []\n\t\t\t\t\tevent.to_hash.each do |key,value|\n\t\t\t\t\tmap[key] = value unless map.has_key?(key)\n\t\t\t\t\tmap[key] << value if map[key].is_a?(Array)\n\t\t\t\tend\n\t\t\t\t"}
Pastebin with only the line containing logstash.filters.aggregate : link
The issue you're facing relates to the fact that once a line is read it is handed out to a filter+output thread.
If you have several CPUs, several of those threads will be processing your lines in parallel and hence the output order is not guaranteed anymore. More importantly, each of your aggregate filters will be local to a given thread so it's definitely possible that several lines relating to the same actor (even if in order) get processed by different threads in parallel and the output order might differ.
Once solution would be to run logstash with the option -w 1 to only create a single worker thread, but you'll decrease the throughput by doing so.
Related
I need to split the results of a sonarqube analysis history into individual files. Assuming a starting input below,
{
"paging": {
"pageIndex": 1,
"pageSize": 100,
"total": 3
},
"measures": [
{
"metric": "coverage",
"history": [
{
"date": "2018-11-18T12:37:08+0000",
"value": "100.0"
},
{
"date": "2018-11-21T12:22:39+0000",
"value": "100.0"
},
{
"date": "2018-11-21T13:09:02+0000",
"value": "100.0"
}
]
},
{
"metric": "bugs",
"history": [
{
"date": "2018-11-18T12:37:08+0000",
"value": "0"
},
{
"date": "2018-11-21T12:22:39+0000",
"value": "0"
},
{
"date": "2018-11-21T13:09:02+0000",
"value": "0"
}
]
},
{
"metric": "vulnerabilities",
"history": [
{
"date": "2018-11-18T12:37:08+0000",
"value": "0"
},
{
"date": "2018-11-21T12:22:39+0000",
"value": "0"
},
{
"date": "2018-11-21T13:09:02+0000",
"value": "0"
}
]
}
]
}
How do I use jq to clean the results so it only retains the history array entries for each element? The desired output is something like this (output-20181118123808.json for analysis done on "2018-11-18T12:37:08+0000"):
{
"paging": {
"pageIndex": 1,
"pageSize": 100,
"total": 3
},
"measures": [
{
"metric": "coverage",
"history": [
{
"date": "2018-11-18T12:37:08+0000",
"value": "100.0"
}
]
},
{
"metric": "bugs",
"history": [
{
"date": "2018-11-18T12:37:08+0000",
"value": "0"
}
]
},
{
"metric": "vulnerabilities",
"history": [
{
"date": "2018-11-18T12:37:08+0000",
"value": "0"
}
]
}
]
}
I am lost on how to operate only on the sub-elements while leaving the parent structure intact. The naming of the JSON file is going to be handled externally from the jq utility. The sample data provided will be split into 3 files. Some other input can have a variable number of entries, some may be up to 10000. Thanks.
Here is a solution which uses awk to write the distinct files. The solution assumes that the dates for each measure are the same and in the same order, but imposes no limit on the number of distinct dates, or the number of distinct measures.
jq -c 'range(0; .measures[0].history|length) as $i
| (.measures[0].history[$i].date|gsub("[^0-9]";"")), # basis of filename
reduce range(0; .measures|length) as $j (.;
.measures[$j].history |= [.[$i]])' input.json |
awk -F\\t 'fn {print >> fn; fn="";next}{fn="output-" $1 ".json"}'
Comments
The choice of awk here is just for convenience.
The disadvantage of this approach is that if each file is to be neatly formatted, an additional run of a pretty-printer (such as jq) would be required for each file. Thus, if the output in each file is required to be neat, a case could be made for running jq once for each date, thus obviating the need for the post-processing (awk) step.
If the dates of the measures are not in lock-step, then the same approach as above could still be used, but of course the gathering of the dates and the corresponding measures would have to be done differently.
Output
The first two lines produced by the invocation of jq above are as follows:
"201811181237080000"
{"paging":{"pageIndex":1,"pageSize":100,"total":3},"measures":[{"metric":"coverage","history":[{"date":"2018-11-18T12:37:08+0000","value":"100.0"}]},{"metric":"bugs","history":[{"date":"2018-11-18T12:37:08+0000","value":"0"}]},{"metric":"vulnerabilities","history":[{"date":"2018-11-18T12:37:08+0000","value":"0"}]}]}
In the comments, the following addendum to the original question appeared:
is there a variation wherein the filtering is based on the date value and not the position? It is not guaranteed that the order will be the same or the number of elements in each metric is going to be the same (i.e. some dates may be missing "bugs", some might have additional metric such as "complexity").
The following will produce a stream of JSON objects, one per date. This stream can be annotated with the date as per my previous answer, which shows how to use these annotations to create the various files. For ease of understanding, we use two helper functions:
def dates:
INDEX(.measures[].history[].date; .)
| keys;
def gather($date): map(select(.date==$date));
dates[] as $date
| .measures |= map( .history |= gather($date) )
INDEX/2
If your jq does not have INDEX/2, now would be an excellent time to upgrade, but in case that's not feasible, here is its def:
def INDEX(stream; idx_expr):
reduce stream as $row ({};
.[$row|idx_expr|
if type != "string" then tojson
else .
end] |= $row);
So I basically have JSON output from the JIRA Insights API, been digging around and found jq for parsing the JSON. Struggling to wrap my head around on how parse the following to only return values for the objectTypeAttributeId's that I am interested in.
For Example I'm only interested in the value of objectTypeAttributeId 887 provided that objectTypeAttributeId 911's name states as active, but then would like to return the name value of another objectTypeAttributeId
Can this be achieved using jq only? Or shoudl I be using something else?
I can filter down to this level which is the 'attributes' section of the JSON output and print each value, but struggling to find an example catering for my situation.
{
"id": 137127,
"objectTypeAttributeId": 887,
"objectAttributeValues": [
{
"value": "false"
}
],
"objectId": 9036,
"position": 16
},
{
"id": 137128,
"objectTypeAttributeId": 888,
"objectAttributeValues": [
{
"value": "false"
}
],
"objectId": 9036,
"position": 17
},
{
"id": 137296,
"objectTypeAttributeId": 911,
"objectAttributeValues": [
{
"status": {
"id": 1,
"name": "Active",
"category": 1
}
}
],
"objectId": 9036,
"position": 18
},
Can this be achieved using jq only?
Yes, jq was designed precisely for this kind of query. In your case, you could use any, select and if ... then ... else ... end, along the lines of:
if any(.[]; .objectTypeAttributeId == 911 and
any(.objectAttributeValues[]; .status.name == "Active"))
then map(select(.objectTypeAttributeId == 887))
else "whatever"
end
I'm looking for a way to parse RAW JSON into CSV and I'm a total novice with anything related to coding, programming, etc. I've found a site https://json-csv.com/ that does exactly what I need but the data sets I'm parsing are bigger than their free amount so I basically pay $10 a month for something I believe could be done by way of macro or something I could figure out.
I'm essentially looking for a quick way to parse this below chunk into a structured, column based detail. The columns would be: Key, Value, Context_Geography, Context_CompanyID, Context_ProductID, Description, Created by, Updated by, updated date.
{"policies":[{"key":"viaPayEnabledRates","value":"","context":{"geography":"","companyID":"","productID":""},"created_by":"0","updated_by":"0","updated_date":"2014-03-24T21:22:25.420+0000"},{"key":"viaPayEnabledRates","value":"[\"WSPNConsortia\",\"WSPNNegotiated\",\"WSPNPublished\"]","context":{"geography":"","companyID":"*","productID":"60003"},"description":"Central Payment Pilot","created_by":"10130590","updated_by":"10130590","updated_date":"2016-04-05T07:51:29.043+0000"}
Here is a solution using jq
If the file filter.jq contains
def headers:
[
"Key", "Value", "Context_Geography", "Context_CompanyID", "Context_ProductID",
"Description", "Created by", "Updated by", "updated date"
]
;
def fields:
[
.key, .value, .context.geography, .context.companyID, .context.productID,
.description, .created_by, .updated_by, .updated_date
]
;
headers, (.policies[] | fields)
| #csv
and the file data.jq contains your sample data
{
"policies": [
{
"key": "viaPayEnabledRates",
"value": "",
"context": {
"geography": "",
"companyID": "",
"productID": ""
},
"created_by": "0",
"updated_by": "0",
"updated_date": "2014-03-24T21:22:25.420+0000"
},
{
"key": "viaPayEnabledRates",
"value": "[\"WSPNConsortia\",\"WSPNNegotiated\",\"WSPNPublished\"]",
"context": {
"geography": "",
"companyID": "*",
"productID": "60003"
},
"description": "Central Payment Pilot",
"created_by": "10130590",
"updated_by": "10130590",
"updated_date": "2016-04-05T07:51:29.043+0000"
}
]
}
then running jq as
jq -M -r -f filter.jq data.json
produces the output
"Key","Value","Context_Geography","Context_CompanyID","Context_ProductID","Description","Created by","Updated by","updated date"
"viaPayEnabledRates","","","","",,"0","0","2014-03-24T21:22:25.420+0000"
"viaPayEnabledRates","[""WSPNConsortia"",""WSPNNegotiated"",""WSPNPublished""]","","*","60003","Central Payment Pilot","10130590","10130590","2016-04-05T07:51:29.043+0000"
I am using JQ 1.4 on Windows 64 bit machine.
Below are the contents of input file IP.txt
{
"results": [
{
"name": "Google",
"employees": [
{
"name": "Michael",
"division": "Engineering"
},
{
"name": "Laura",
"division": "HR"
},
{
"name": "Elise",
"division": "Marketing"
}
]
},
{
"name": "Microsoft",
"employees": [
{
"name": "Brett",
"division": "Engineering"
},
{
"name": "David",
"division": "HR"
}
]
}
]
}
{
"results": [
{
"name": "Amazon",
"employees": [
{
"name": "Watson",
"division": "Marketing"
}
]
}
]
}
File contains two "results". 1st result containts information for 2 companies: Google and Microsoft. 2nd result contains information for Amazon.
I want to convert this JSON into csv file with company name and employee name.
"Google","Michael"
"Google","Laura"
"Google","Elise"
"Microsoft","Brett"
"Microsoft","David"
"Amazon","Watson"
I am able to write below script:
jq -r "[.results[0].name,.results[0].employees[0].name]|#csv" IP.txt
"Google","Michael"
"Amazon","Watson"
Can someone guide me to write the script without hardcoding the index values?
Script should be able generate output for any number results and each cotaining information of any number of companies.
I tried using below script which didn't generate expected output:
jq -r "[.results[].name,.results[].employees[].name]|#csv" IP.txt
"Google","Microsoft","Michael","Laura","Elise","Brett","David"
"Amazon","Watson"
You need to flatten down the results first to rows of company and employee names. Then with that, you can convert to csv rows.
map(.results | map({ cn: .name, en: .employees[].name } | [ .cn, .en ])) | add[] | #csv
Since you have a stream of inputs, you'll have to slurp (-s) it in. Since you want to output csv, you'll want to use raw output (-r).
EDIT 2014-05-01: I tried fromJSON first (as suggested below), but that only parsed the first line. I found out that there were commas missing between the brackets of each JSON line so I changed that in TextEdit and saved the file. I also added [ at the beginning of the file and ] at the end and then it worked with JSON. Now the next step: from a list (with embedded lists) to a dataframe (or csv).
I get a data package from edX every now and then on the courses we are evaluating. Some of these are just plain .csv files which are quite easy to handle, others are more difficult for me (not having a CS or programming background).
I have 2 files I want to open and parse into csv files for analysis in R. I have tried many many json2csv tools out there, but to no avail. I also tried the simple methods described here to turn json into csv.
The data is confidential, so I cannot share the entire data set, but will share the first two lines of the file, maybe that helps. The problem is that nowhere I find anything about .mongo files, which to me seems quite strange, do they even exist? Or is this just a JSON file that may be corrupted (which could explain the errors)?
Any suggestions are welcome.
The first 2 lines in one of the .mongo files:
{
"_id": {
"$oid": "52d1e62c350e7a3156000009"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
],
"at_position_list": [
],
"body": "the delft university accredited course with the scholarship (fundamentals of water treatment) is supposed to start in about a month's time. But have the scholarship list been published? Any tentative date??",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"author_id": "269835",
"comment_thread_id": {
"$oid": "52cd40c5ab40cf347e00008d"
},
"author_username": "tachak59",
"sk": "52d1e62c350e7a3156000009",
"updated_at": {
"$date": 1389487660636
},
"created_at": {
"$date": 1389487660636
}
}{
"_id": {
"$oid": "52d0a66bcb3eee318d000012"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
{
"$oid": "52c63278100c07c0d1000028"
}
],
"at_position_list": [
],
"body": "I got it. Thank you!",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"parent_id": {
"$oid": "52c63278100c07c0d1000028"
},
"author_id": "2655027",
"comment_thread_id": {
"$oid": "52c4f303b03c4aba51000013"
},
"author_username": "dmoronta",
"sk": "52c63278100c07c0d1000028-52d0a66bcb3eee318d000012",
"updated_at": {
"$date": 1389405803386
},
"created_at": {
"$date": 1389405803386
}
}{
"_id": {
"$oid": "52ceea0cada002b72c000059"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
{
"$oid": "5287e8d5906c42f5aa000013"
}
],
"at_position_list": [
],
"body": "if u please send by mail \n",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"parent_id": {
"$oid": "5287e8d5906c42f5aa000013"
},
"author_id": "2276302",
"comment_thread_id": {
"$oid": "528674d784179607d0000011"
},
"author_username": "totah1993",
"sk": "5287e8d5906c42f5aa000013-52ceea0cada002b72c000059",
"updated_at": {
"$date": 1389292044203
},
"created_at": {
"$date": 1389292044203
}
}
R doesn't have "native" support for these files but there is a JSON parser with the rjson package. So I might load my .mongo file with:
myfile <- "path/to/myfile.mongo"
myJSON <- readLines(myfile)
myNiceData <- fromJSON(myJSON)
Since RJson converts into a data structure that fits the object being read, you'll have to do some additional snooping but once you have an R data type you shouldn't have any trouble working with it from there.
Another package to consider when parsing JSON data is jsonlite. It will make data frames for you so you can write them to a csv format with write.table or some other applicable method for writing objects.
NOTE: if it is easier to connect to the MongoDB and get the data from a request, then RMongo may be a good bet. The R-Bloggers also made a post about using RMongo that has a nice little walkthrough.
I used RJSON as suggested by #theWanderer and with the help of a colleague wrote the following code to parse the data into columns, choosing the specific columns that are needed, and checking each of the instances if they return the right variables.
Entire workflow:
Checked some of the data in jsonlint - corrected the errors → },{ instead of }{ between each line and [ and ] at the beginning and end of the file
Made a smaller file to play with, containing about 11 JSON lines
Used the code below to parse the datafile - however, checking the different listItems first if they are not lists themselves (that gives problems) // as you will see, I also removed things like \n because that gave errors and added an empty value for parent_id if there is none in the data (otherwise it would mix up the data)
The code to import the .mongo file into R and then parse it into CSV:
library(rjson)
###### set working directory to write out the data file
setwd("/your/favourite/dir/json to csv/")
#never ever convert strings to factors
options(stringsAsFactors = FALSE)
#import the .mongo file to R
temp.data = fromJSON(file="temp.mongo", method="C", unexpected.escape="error")
file.remove("temp.csv") ## removes the old datafile if there is one
## (so the data is not appended to the file,
## but a new file is created)
listItem = temp.data[[1]] ## prepare the listItem the first time
for (listItem in temp.data){
parent_id = ""
if (length(listItem$parent_id)>0){
parent_id = listItem$parent_id
}
write.table(t(c(
listItem$votes$up_count, listItem$visible, parent_id,
gsub("\n", "", listItem$body), listItem$course_id, unlist(listItem["_type"]),
listItem$endorsed, listItem$anonymous, listItem$author_id,
unlist(listItem$comment_thread_id), listItem$author_username,
as.POSIXct(unlist(listItem$created_at)/1000, origin="1970-01-01"))), # end t(), c()
file="temp.csv", sep="\t", append=TRUE, row.names=FALSE, col.names=FALSE)
}