split big json files into small pieces without breaking the format - json

I'm using spark.read() to read a big json file on databricks. And it failed due to: spark driver has stopped unexpectedly and is restarting after a long time of runing.I assumed it is because the file is too big, so i decided to split it. So I used command:
split -b 100m -a 1 test.json
This actually split my files into small pieces and I can now read that on databricks. But then I found what I got is a set of null values. I think that is because i splitted the file only by the size,and some files might become files that are not in json format. For example , i might get something like this in the end of a file.
{"id":aefae3,......
Then it can't be read by spark.read.format("json").So is there any way i can seperate the json file into small pieces without breaking the json format?

Related

Pyspark: how to read CSV file with additional lines

so I am having the following CSV file. It has some additional strings between the valid rows. Excel seems to do a good job when reading those(and just ignores the additional ones).
However, the story with spark is a bit different.
I have set it as spark.read.csv(path, header=True, multiLine=True, sep='|')
Is there some simple way to handle it?

Spark File Format Escaping \n Loading CSV

I'm reading a CSV pipe delimited data file using spark. It's quote qualified. A block of text has a /n in it and it's causing the read to corrupt. What I don't understand is that it's quote qualified text so surely it should just skip that!? The rows themselves are CR+LN delimited.
Anyhow it's not. How do I get around this? I can cleanse them out on extract but doesn't seem that elegant to me.
This is what I'm using to load the data
val sch = spark.table("db.mytable").schema
val df = spark.read
.format("csv")
.schema(sch)
.option("header", "true")
.option("delimiter", "|")
.option("mode", "PERMISSIVE")
.option("quote", "\"")
.load("/yadaydayda/mydata.txt")
Glad to know I'm not the only one who's dealt with this issue in Spark!
Spark reads files line-by-line, so CSVs with newlines in them cause problems for the parser. Reading line-by-line makes it easier for Spark to handle large CSV files, rather than trying to parse all of the content for quotes, which would significantly impair performance for a case is more likely to not be an issue when trying to have high-performing analytics.
For cases where I knew newlines were a possibility, I've used a third party CSV parsing library, run the CSV "lines" through that (which would handle the newlines correctly), strip the newlines, write/cache the file somewhere, and read from that cached file. For a production use case, those files would be loaded into a database, or for log files or something where you don't want them in a database, using Parquet like you suggested works pretty well, or really just enforcing the lack of newlines somewhere before the files get to Spark.
Got around this by initially striping them on extract. The final solution I settled on however was to use a parquet format on extract then all these problems just go away.

Remove last few characters from big JSON file

I have several huge (>2GB) JSON files that end in ,\n]. Here is my test file example, which is the last 25 characters of a 2 GB JSON file:
test.json
":{"value":false}}}}}},
]
I need to delete the ,\n and add back in the ] from the last three characters of the last line. The entire file is on three lines: both the front and end brackets are on their own line, and all the contents of the JSON array is on the second line.
I can't load the entire stream into memory to do something like:
string[0..-2]
because the file is way too large. I tried several approaches, including Ruby's:
chomp!(",\n]")
and UNIX's:
sed
both of which made no change to my JSON file. I viewed the last 25 characters by doing:
tail -c 25 filename.json
and also did:
ls -l
to verify that the byte size of the new and the old file versions were the same.
Can anyone help me understand why none of these approaches is working?
It's not necessary to read in the whole file if you're looking to make a surgical operation like this. Instead you can just overwrite the last few bytes in the file:
file = 'huge.json'
IO.write(file, "\n]\n", File.stat(file).size - 5)
The key here is to write as many bytes out as you back-track from the end, otherwise you'll need to trim the file length, though you can do that as well if necessary with truncate.

Spark: spark-csv takes too long

I am trying to create a DataFrame from a CSV source that is on S3 on an EMR Spark cluster, using the Databricks spark-csv package and the flights dataset:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('s3n://h2o-airlines-unpacked/allyears.csv')
df.first()
This does not terminate on a cluster of 4 m3.xlarges. I am looking for suggestions to create a DataFrame from a CSV file on S3 in PySpark. Alternatively, I have tried putting the file on HDFS and reading from HFDS as well, but that also does not terminate. The file is not overly large (12 GB).
For reading a well-behaved csv file that is only 12GB, you can copy it onto all of your workers and the driver machines, and then manually split on ",". This may not parse any RFC4180 csv, but it parsed what I had.
Add at least 12GB extra space for worker disk space for each worker when you requisition the cluster.
Use a machine type that has at least 12GB RAM, such as c3.2xlarge. Go bigger if you don't intend to keep the cluster around idle and can afford the larger charges. Bigger machines means less disk file copying to get started. I regularly see c3.8xlarge under $0.50/hour on the spot market.
copy the file to each of your workers, in the same directory on each worker. This should be a physically attached drive, i.e. different physical drives on each machine.
Make sure you have that same file and directory on the driver machine as well.
raw = sc.textFile("/data.csv")
print "Counted %d lines in /data.csv" % raw.count()
raw_fields = raw.first()
# this regular expression is for quoted fields. i.e. "23","38","blue",...
matchre = r'^"(.*)"$'
pmatchre = re.compile(matchre)
def uncsv_line(line):
return [pmatchre.match(s).group(1) for s in line.split(',')]
fields = uncsv_line(raw_fields)
def raw_to_dict(raw_line):
return dict(zip(fields, uncsv_line(raw_line)))
parsedData = (raw
.map(raw_to_dict)
.cache()
)
print "Counted %d parsed lines" % parsedData.count()
parsedData will be a RDD of dicts, where the keys of the dicts are the CSV field names from the first row, and the values are the CSV values of the current row. If you don't have a header row in the CSV data, this may not be right for you, but it should be clear that you could override the code reading the first line here and set up the fields manually.
Note that this is not immediately useful for creating data frames or registering a spark SQL table. But for anything else, it is OK, and you can further extract and transform it into a better format if you need to dump it into spark SQL.
I use this on a 7GB file with no issues, except I've removed some filter logic to detect valid data that has as a side effect the removal of the header from the parsed data. You might need to reimplement some filtering.

Cassandra RPC Timeout on import from CSV

I am trying to import a CSV into a column family in Cassandra using the following syntax:
copy data (id, time, vol, speed, occupancy, status, flags) from 'C:\Users\Foo\Documents\reallybig.csv' with header = true;
The CSV file is about 700 MB, and for some reason when I run this command in cqlsh I get the following error:
"Request did not complete within rpc_timeout."
What is going wrong? There are no errors in the CSV, and it seems to me that Cassandra should be suck in this CSV without a problem.
Cassandra installation folder has a .yaml file to set rpc timeout value which is "rpc_timeout_in_ms ", you could modify the value and restart cassandra.
But another way is cut your big csv to multiply files and input the files one by one.
This actually ended up being my own misinterpretation of COPY-FROM as the CSV was about 17 million rows. Which in this case the best option was to use the bulk loader example and run sstableloader. However, the answer above would certainly work if I wanted to break the CSV into 17 different CSV's which is an option.