I'm pretty new to Spark and to teach myself I have been using small json files, which work perfectly. I'm using Pyspark with Spark 2.2.1 However I don't get how to read in a single data line instead of the entire json file. I have been looking for documentation on this but it seems pretty scarce. I have to process a single large (larger than my RAM) json file (wikipedia dump: https://archive.org/details/wikidata-json-20150316) and want to do this in chuncks or line by line. I thought Spark was designed to do just that but can't find out how to do it and when I request the top 5 observations in a naive way I run out of memory. I have tried RDD .
SparkRDD= spark.read.json("largejson.json").rdd
SparkRDD.take(5)
and Dataframe
SparkDF= spark.read.json("largejson.json")
SparkDF.show(5,truncate = False)
So in short:
1) How do I read in just a fraction of a large JSON file? (Show first 5 entries)
2) How do I filter a large JSON file line by line to keep just the required results?
Also: I don't want to predefine the datascheme for this to work.
I must be overlooking something.
Thanks
Edit: With some help I have gotten a look at the first observation but it by itself is already too huge to post here so I'll just put a fraction of it here.
[
{
"id": "Q1",
"type": "item",
"aliases": {
"pl": [{
"language": "pl",
"value": "kosmos"
}, {
"language": "pl",
"value": "\\u015bwiat"
}, {
"language": "pl",
"value": "natura"
}, {
"language": "pl",
"value": "uniwersum"
}],
"en": [{
"language": "en",
"value": "cosmos"
}, {
"language": "en",
"value": "The Universe"
}, {
"language": "en",
"value": "Space"
}],
...etc
That's very similar to Select only first line from files under a directory in pyspark
Hence something like this should work :
def read_firstline(filename):
with open(filename, 'rb') as f:
return f.readline()
# files is a list of filenames
rdd_of_firstlines = sc.parallelize(files).flatMap(read_firstline)
Related
I am trying to read a JSON file which is 370 MB
import json
data = open( "data.json" ,"r")
json.loads(data.read())
and it's not possible to easily find the root cause of the following error,
json.decoder.JSONDecodeError: Extra data: line 1 column 1024109 (char 1024108)
I looked at similar questions and tried the following StackOverflow answer
import json
data = [json.loads(line) for line in open('data.json', 'r')]
But it didn't resolve the issue. I am wondering if there is any solution to find where the error happens in the file. I am getting some other files from the same source and they run without any problem.
A small piece of the Json file is a list of dicts like,
{
"uri": "p",
"source": {
"uri": "dail",
"dataType": "pr",
"title": "Daily"
},
"authors": [
{
"type": "author",
"isAgency": false
}
],
"concepts": [
{
"amb": false,
"imp": true,
"date": "2019-05-23",
"textStart": 2459,
"textEnd": 2467
},
{
"amb": false,
"imp": true,
"date": "2019-05-09",
"textStart": 2684,
"textEnd": 2691
}
],
"shares": {},
"wgt": 100,
"relevance": 100
}
The problem with json library is loaded everything to memory and parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.
Instead I would suggest to take a look at https://github.com/henu/bigjson
import bigjson
with open('data.json', 'rb') as f:
json_data = bigjson.load(f)
I'm currently trying to do a basic JSON file import into my ELK stack. I tried importing it directly via a POST request like this:
curl -XPOST http://localhost:9200/kwd_results/TS_Cart -d #/home/local/TS_Cart.json
ES says ok for the import, but when I'm trying to view the logs in Kibanna, they are not indexed by the nodes of the JSON file. I'm guessing I need like a template mapping to view it properly.
My JSON file looks like this:
{
"testResults": {
"FitNesseVersion": "v20160618",
"rootPath": "K1System.CountryDe.DriverFirefox.TestCases.MainFolder.TestVariants.SmokeTests_B2C.TS_Cart",
"result": [
{
"counts": {
"right": "16",
"wrong": "2",
"ignores": "3",
"exceptions": "1"
},
"date": "2017-05-10T00:01:11+02:00",
"runTimeInMillis": "117242",
"relativePageName": "TestCase_1",
"pageHistoryLink": "K1System.CountryDe.DriverFirefox.TestCases.MainFolder.TestVariants.SmokeTests_B2C.TS_Cart.B2CFreeCatalogueOrder?pageHistory&resultDate=20170510000111",
"tags": "de, at"
},
{
"counts": {
"right": "16",
"wrong": "0",
"ignores": "0",
"exceptions": "0"
},
"date": "2017-05-10T00:03:08+02:00",
"runTimeInMillis": "85680",
"relativePageName": "TestCase_2",
"pageHistoryLink": "K1System.CountryDe.DriverFirefox.TestCases.MainFolder.TestVariants.SmokeTests_B2C.TS_Cart.B2CGiftCardOrderWithAdvancePayment?pageHistory&resultDate=20170510000308",
"tags": "at, de"
}
],
"finalCounts": {
"right": "4",
"wrong": "1",
"ignores": "0",
"exceptions": "0"
},
"totalRunTimeInMillis": "482346"
}
}
Basically I would need rootPath to be used as an index, while having the following childs: counts, relativePageName, date and tags. Notice that I have two nodes that are childs of the result[] array.
Any help would be greatly appreciated!
Thank you.
Well, it's one JSON document so Elasticsearch treats it as such.
You'll need to (programmatically) split up the document into the right documents and then you can store them (potentially with one _bulk request).
For the index name:
Must be lowercase, so you'll need to cast that value.
Will you have many different root paths with jut a few docs each? Then you shouldn't make all of them an index since there is an overhead for each one of them (actually the underlying shards).
I'm experiencing issues when loading JSON which are dependent on formatting of input JSON file.
According to Spark documentation on JSON Datasets, each line on input file must be a valid JSON Object. re:
"Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail."
So, if I have an input JSON file such as:
{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
},
{
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}
Are there any existing tools or scripts to convert to:
{"Year": "2013","First Name": "DAVID","County": "KINGS","Sex": "M","Count":"272"},
{"Year": "2013","First Name": "JAYDEN","County": "KINGS","Sex": "M","Count": "268"}
where the JSON conforms to "Each line must contain a separate, self-contained valid JSON object"
If I format to this style above, things work as expected. But, I made these mods manually over a few rows. I cannot do this for entire data set, so looking for an existing script or tool.
OR
I could load to JDBC available database if that's a better option. Thoughts?
Thanks in advance
You can simply load the JSON files into an RDD first using sc.wholeTextFiles() and remove the file name column, then run the SQLContext read on the RDD contents.
e.g.
val jsonRdd = sc.wholeTextFiles("samplefile.json").map(x => x._2)
val jsonDf = sqlContext.read.json(jsonRdd)
What if you make it an array by adding square brackets. like this;
[
{
"Year": "2013",
"FName": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
},
{
"Year": "2013",
"FName": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}
]
If I take your file and add the brackets I can iterate through it with Node.js and output a file that looks like what you want. The caveat in node.js being I cant have variable First Name-- I had to change it to FName.
I have a large JSON file that looks similar to the code below. Is there anyway I can iterate through each object, look for the field "element_type" (it is not present in all objects in the file if that matters) and extract or write each object with the same element type to a file? For example each user would end up in a file called user.json and each book in a file called book.json?
I thought about using javascript but to my knowledge js can't write to files, I also tried to do it using linux command line tools by removing all new lines, then inserting a new line after each "}," and then iterating through each line to find the element type and write it to a file. This worked for most of the data; however, where there were objects like the "problem_type" below, it inserted a new line in the middle of the data due to the nested json in the "times" element. I've run out of ideas at this point.
{
"data": [
{
"element_type": "user",
"first": "John",
"last": "Doe"
},
{
"element_type": "user",
"first": "Lucy",
"last": "Ball"
},
{
"element_type": "book",
"name": "someBook",
"barcode": "111111"
},
{
"element_type": "book",
"name": "bookTwo",
"barcode": "111111"
},
{
"element_type": "problem_type",
"name": "problem object",
"times": "[{\"start\": \"1230\", \"end\": \"1345\", \"day\": \"T\"}, {\"start\": \"1230\", \"end\": \"1345\", \"day\": \"R\"}]"
}
]
}
I would recommend Java for this purpose. It sounds like you're running on Linux so it should be a good fit.
You'll have no problems writing to files. And you can use a library like this - http://json-lib.sourceforge.net/ - to gain access to things like JSONArray and JSONObject. Which you can easily use to iterate through the data in your JSON request, and check what's in "element_type" and write to a file accordingly.
EDIT 2014-05-01: I tried fromJSON first (as suggested below), but that only parsed the first line. I found out that there were commas missing between the brackets of each JSON line so I changed that in TextEdit and saved the file. I also added [ at the beginning of the file and ] at the end and then it worked with JSON. Now the next step: from a list (with embedded lists) to a dataframe (or csv).
I get a data package from edX every now and then on the courses we are evaluating. Some of these are just plain .csv files which are quite easy to handle, others are more difficult for me (not having a CS or programming background).
I have 2 files I want to open and parse into csv files for analysis in R. I have tried many many json2csv tools out there, but to no avail. I also tried the simple methods described here to turn json into csv.
The data is confidential, so I cannot share the entire data set, but will share the first two lines of the file, maybe that helps. The problem is that nowhere I find anything about .mongo files, which to me seems quite strange, do they even exist? Or is this just a JSON file that may be corrupted (which could explain the errors)?
Any suggestions are welcome.
The first 2 lines in one of the .mongo files:
{
"_id": {
"$oid": "52d1e62c350e7a3156000009"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
],
"at_position_list": [
],
"body": "the delft university accredited course with the scholarship (fundamentals of water treatment) is supposed to start in about a month's time. But have the scholarship list been published? Any tentative date??",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"author_id": "269835",
"comment_thread_id": {
"$oid": "52cd40c5ab40cf347e00008d"
},
"author_username": "tachak59",
"sk": "52d1e62c350e7a3156000009",
"updated_at": {
"$date": 1389487660636
},
"created_at": {
"$date": 1389487660636
}
}{
"_id": {
"$oid": "52d0a66bcb3eee318d000012"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
{
"$oid": "52c63278100c07c0d1000028"
}
],
"at_position_list": [
],
"body": "I got it. Thank you!",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"parent_id": {
"$oid": "52c63278100c07c0d1000028"
},
"author_id": "2655027",
"comment_thread_id": {
"$oid": "52c4f303b03c4aba51000013"
},
"author_username": "dmoronta",
"sk": "52c63278100c07c0d1000028-52d0a66bcb3eee318d000012",
"updated_at": {
"$date": 1389405803386
},
"created_at": {
"$date": 1389405803386
}
}{
"_id": {
"$oid": "52ceea0cada002b72c000059"
},
"votes": {
"up": [
],
"down": [
],
"up_count": 0,
"down_count": 0,
"count": 0,
"point": 0
},
"visible": true,
"abuse_flaggers": [
],
"historical_abuse_flaggers": [
],
"parent_ids": [
{
"$oid": "5287e8d5906c42f5aa000013"
}
],
"at_position_list": [
],
"body": "if u please send by mail \n",
"course_id": "DelftX/CTB3365x/2013_Fall",
"_type": "Comment",
"endorsed": false,
"anonymous": false,
"anonymous_to_peers": false,
"parent_id": {
"$oid": "5287e8d5906c42f5aa000013"
},
"author_id": "2276302",
"comment_thread_id": {
"$oid": "528674d784179607d0000011"
},
"author_username": "totah1993",
"sk": "5287e8d5906c42f5aa000013-52ceea0cada002b72c000059",
"updated_at": {
"$date": 1389292044203
},
"created_at": {
"$date": 1389292044203
}
}
R doesn't have "native" support for these files but there is a JSON parser with the rjson package. So I might load my .mongo file with:
myfile <- "path/to/myfile.mongo"
myJSON <- readLines(myfile)
myNiceData <- fromJSON(myJSON)
Since RJson converts into a data structure that fits the object being read, you'll have to do some additional snooping but once you have an R data type you shouldn't have any trouble working with it from there.
Another package to consider when parsing JSON data is jsonlite. It will make data frames for you so you can write them to a csv format with write.table or some other applicable method for writing objects.
NOTE: if it is easier to connect to the MongoDB and get the data from a request, then RMongo may be a good bet. The R-Bloggers also made a post about using RMongo that has a nice little walkthrough.
I used RJSON as suggested by #theWanderer and with the help of a colleague wrote the following code to parse the data into columns, choosing the specific columns that are needed, and checking each of the instances if they return the right variables.
Entire workflow:
Checked some of the data in jsonlint - corrected the errors → },{ instead of }{ between each line and [ and ] at the beginning and end of the file
Made a smaller file to play with, containing about 11 JSON lines
Used the code below to parse the datafile - however, checking the different listItems first if they are not lists themselves (that gives problems) // as you will see, I also removed things like \n because that gave errors and added an empty value for parent_id if there is none in the data (otherwise it would mix up the data)
The code to import the .mongo file into R and then parse it into CSV:
library(rjson)
###### set working directory to write out the data file
setwd("/your/favourite/dir/json to csv/")
#never ever convert strings to factors
options(stringsAsFactors = FALSE)
#import the .mongo file to R
temp.data = fromJSON(file="temp.mongo", method="C", unexpected.escape="error")
file.remove("temp.csv") ## removes the old datafile if there is one
## (so the data is not appended to the file,
## but a new file is created)
listItem = temp.data[[1]] ## prepare the listItem the first time
for (listItem in temp.data){
parent_id = ""
if (length(listItem$parent_id)>0){
parent_id = listItem$parent_id
}
write.table(t(c(
listItem$votes$up_count, listItem$visible, parent_id,
gsub("\n", "", listItem$body), listItem$course_id, unlist(listItem["_type"]),
listItem$endorsed, listItem$anonymous, listItem$author_id,
unlist(listItem$comment_thread_id), listItem$author_username,
as.POSIXct(unlist(listItem$created_at)/1000, origin="1970-01-01"))), # end t(), c()
file="temp.csv", sep="\t", append=TRUE, row.names=FALSE, col.names=FALSE)
}