How to get file of RDD in spark - json

I am playing with spark RDD with json files and i am doing something like below
val uisJson5 = sqlContext.read.json(
sc.textFile("s3n://localtion/*")
.filter(line =>
line.contains("\"xyz\":\"A\"")
&& line.contains("\"id\":\"adasdfasdfasd\"")
))
uisJson5.show()
I want to know the source json files as well from where the results are coming. Is there any way i can do this?
Edit:
I was able to do it using below code
val uisJson1 = sc.textFile("s3n://localtion/*”)
.filter(line => line.contains("\"xyz\":\"A\"")
&& line.contains("\"id\":\"adasdfasdfasd\""))
uisJson1.collect().foreach(println)

You are looking for wholeTextFiles along with flatMapValues.
wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.

Related

Python: Reading and Writing HUGE Json files

I am new to python. So please excuse me if I am not asking the questions in pythonic way.
My requirements are as follows:
I need to write python code to implement this requirement.
Will be reading 60 json files as input. Each file is approximately 150 GB.
Sample structure for all 60 json files is as shown below. Please note each file will have only ONE json object. And the huge size of each file is because of the number and size of the "array_element" array contained in that one huge json object.
{
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"string_1":"abc",
"array_element":[]
}
Transformation logic is simple. I need to merge all the array_element from all 60 files and write it into one HUGE json file. That is almost 150GB X 60 will be the size of the output json file.
Questions for which I am requesting your help on:
For reading: Planning on using "ijson" module's ijson.items(file_object, "array_element"). Could you please tell me if ijson.items will "Yield" (that is NOT load the entire file into memory) one item at a time from "array_element" array in the json file? I dont think json.load is an option here because we cannot hold such a huge dictionalry in-memory.
For writing: I am planning to read each item using ijson.item, and do json.dumps to "encode" and then write it to the file using file_object.write and NOT using json.dump since I cannot have such a huge dictionary in memory to use json.dump. Could you please let me know if f.flush() applied in the code shown below is needed? To my understanding, the internal buffer will automatically get flushed by itself when it is full and the size of the internal buffer is constant and wont dynamically grow to an extent that it will overload the memory? please let me know
Are there any better approach to the ones mentioned above for incrementally reading and writing huge json files?
Code snippet showing above described reading and writing logic:
for input_file in input_files:
with open("input_file.json", "r") as f:
objects = ijson.items(f, "array_element")
for item in objects:
str = json.dumps(item, indent=2)
with open("output.json", "a") as f:
f.write(str)
f.write(",\n")
f.flush()
with open("output.json", "a") as f:
f.seek(0,2)
f.truncate(f.tell() - 1)
f.write("]\n}")
Hope I have asked my questions clearly. Thanks in advance!!
The following program assumes that the input files have a format that is predictable enough to skip JSON parsing for the sake of performance.
My assumptions, inferred from your description, are:
All files have the same encoding.
All files have a single position somewhere at the start where "array_element":[ can be found, after which the "interesting portion" of the file begins
All files have a single position somewhere at the end where ]} marks the end of the "interesting portion"
All "interesting portions" can be joined with commas and still be valid JSON
When all of these points are true, concatenating a predefined header fragment, the respective file ranges, and a footer fragment would produce one large, valid JSON file.
import re
import mmap
head_pattern = re.compile(br'"array_element"\s*:\s*\[\s*', re.S)
tail_pattern = re.compile(br'\s*\]\s*\}\s*$', re.S)
input_files = ['sample1.json', 'sample2.json']
with open('result.json', "wb") as result:
head_bytes = 500
tail_bytes = 50
chunk_bytes = 16 * 1024
result.write(b'{"JSON": "fragment", "array_element": [\n')
for input_file in input_files:
print(input_file)
with open(input_file, "r+b") as f:
mm = mmap.mmap(f.fileno(), 0)
start = head_pattern.search(mm[:head_bytes])
end = tail_pattern.search(mm[-tail_bytes:])
if not (start and end):
print('unexpected file format')
break
start_pos = start.span()[1]
end_pos = mm.size() - end.span()[1] + end.span()[0]
if input_files.index(input_file) > 0:
result.write(b',\n')
pos = start_pos
mm.seek(pos)
while True:
if pos + chunk_bytes >= end_pos:
result.write(mm.read(end_pos - pos))
break
else:
result.write(mm.read(chunk_bytes))
pos += chunk_bytes
result.write(b']\n}')
If the file format is 100% predictable, you can throw out the regular expressions and use mm[:head_bytes].index(b'...') etc for the start/end position arithmetic.

Python3 Replacing special character from .csv file after convert the same from JSON

I am trying to develop a program using Python3.6.4 which convert a JSON file into a CSV file and also we need to clean the data in the csv file. as for example:
My JSON File:
{emp:[{"Name":"Bo#b","email":"bob#gmail.com","Des":"Unknown"},
{"Name":"Martin","email":"mar#tin#gmail.com","Des":"D#eveloper"}]}
Problem 1:
After converting that into csv its creating a blank row between every 2 rows. As
**Name email Des**
[<BLANK ROW>]
Bo#b bob#gmail.com Unknown
[<BLANK ROW>]
Martin mar#tin#gmail.com D#eveloper
Problem 2:
In my code I am using emp but I need to use it dynamically.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read()
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
emp_data = employee_parsed['employee']
As we will not know the structure or content of up-coming JSON file.
Problem 3:
I also need to remove all # characters from the CSV file.
For solving Problem 3, you can use .replace (https://www.tutorialspoint.com/python/string_replace.htm).
For problem 2, you can use the dictionary keys and then get the zeroth item out of it.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read().replace("#", "")
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
first_key = employee_parsed.keys()[0]
emp_data = employee_parsed[first_key]
I can't solve problem 1 without more code to see how your are exporting the result. It may be that your data has newlines in it. In which case, you could add .replace("\n","") and/or .replace("\r","") after the previous replace so the line would read fobj.read().replace("#", "").replace("\n", "").replace("\r", "").

Save content of Spark DataFrame as a single CSV file [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 4 years ago.
Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.
The default behavior is to save the output in multiple part-*.csv files inside the path provided.
How would I save a DF with :
Path mapping to the exact file name instead of folder
Header available in first line
Save as a single file instead of multiple files.
One way to deal with it, is to coalesce the DF and then save the file.
df.coalesce(1).write.option("header", "true").csv("sample_file.csv")
However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.
Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?
Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.
save_location= "s3a://landing-bucket-test/export/"+year
csv_location = save_location+"temp.folder"
file_location = save_location+'export.csv'
df.repartition(1).write.csv(path=csv_location, mode="append", header="true")
file = dbutils.fs.ls(csv_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(csv_location, recurse=True)
This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).
Use:
df.toPandas().to_csv("sample_file.csv", header=True)
See documentation for details:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.toPandas
df.coalesce(1).write.option("inferSchema","true").csv("/newFolder",header =
'true',dateFormat = "yyyy-MM-dd HH:mm:ss")
The following scala method works in local or client mode, and writes the df to a single csv of the chosen name. It requires that the df fit into memory, otherwise collect() will blow up.
import org.apache.hadoop.fs.{FileSystem, Path}
val SPARK_WRITE_LOCATION = some_directory
val SPARKSESSION = org.apache.spark.sql.SparkSession
def saveResults(results : DataFrame, filename: String) {
var fs = FileSystem.get(this.SPARKSESSION.sparkContext.hadoopConfiguration)
if (SPARKSESSION.conf.get("spark.master").toString.contains("local")) {
fs = FileSystem.getLocal(new conf.Configuration())
}
val tempWritePath = new Path(SPARK_WRITE_LOCATION)
if (fs.exists(tempWritePath)) {
val x = fs.delete(new Path(SPARK_WRITE_LOCATION), true)
assert(x)
}
if (results.count > 0) {
val hadoopFilepath = new Path(SPARK_WRITE_LOCATION, filename)
val writeStream = fs.create(hadoopFilepath, true)
val bw = new BufferedWriter( new OutputStreamWriter( writeStream, "UTF-8" ) )
val x = results.collect()
for (row : Row <- x) {
val rowString = row.mkString(start = "", sep = ",", end="\n")
bw.write(rowString)
}
bw.close()
writeStream.close()
val resultsWritePath = new Path(WRITE_DIRECTORY, filename)
if (fs.exists(resultsWritePath)) {
fs.delete(resultsWritePath, true)
}
fs.copyToLocalFile(false, hadoopFilepath, resultsWritePath, true)
} else {
System.exit(-1)
}
}
This solution is based on a Shell Script and is not parallelized, but is still very fast, especially on SSDs. It uses cat and output redirection on Unix systems. Suppose that the CSV directory containing partitions is located on /my/csv/dir and that the output file is /my/csv/output.csv:
#!/bin/bash
echo "col1,col2,col3" > /my/csv/output.csv
for i in /my/csv/dir/*.csv ; do
echo "Processing $i"
cat $i >> /my/csv/output.csv
rm $i
done
echo "Done"
It will remove each partition after appending it to the final CSV in order to free space.
"col1,col2,col3" is the CSV header (here we have three columns of name col1, col2 and col3). You must tell Spark to don't put the header in each partition (this is accomplished with .option("header", "false") because the Shell Script will do it.
For those still wanting to do this here's how I got it done using spark 2.1 in scala with some java.nio.file help.
Based on https://fullstackml.com/how-to-export-data-frame-from-apache-spark-3215274ee9d6
val df: org.apache.spark.sql.DataFrame = ??? // data frame to write
val file: java.nio.file.Path = ??? // target output file (i.e. 'out.csv')
import scala.collection.JavaConversions._
// write csv into temp directory which contains the additional spark output files
// could use Files.createTempDirectory instead
val tempDir = file.getParent.resolve(file.getFileName + "_tmp")
df.coalesce(1)
.write.format("com.databricks.spark.csv")
.option("header", "true")
.save(tempDir.toAbsolutePath.toString)
// find the actual csv file
val tmpCsvFile = Files.walk(tempDir, 1).iterator().toSeq.find { p =>
val fname = p.getFileName.toString
fname.startsWith("part-00000") && fname.endsWith(".csv") && Files.isRegularFile(p)
}.get
// move to desired final path
Files.move(tmpCsvFile, file)
// delete temp directory
Files.walk(tempDir)
.sorted(java.util.Comparator.reverseOrder())
.iterator().toSeq
.foreach(Files.delete(_))
The FileUtil.copyMerge() from the Hadoop API should solve your problem.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
See Write single CSV file using spark-csv
This is how distributed computing work! Multiple files inside a directory is exactly how distributed computing works, this is not a problem at all since all software can handle it.
Your question should be "how is it possible to download a CSV composed of multiple files?" -> there are already lof of solutions in SO.
Another approach could be to use Spark as a JDBC source (with the awesome Spark Thrift server), write a SQL query and transform the result to CSV.
In order to prevent OOM in the driver (since the driver will get ALL
the data), use incremental collect
(spark.sql.thriftServer.incrementalCollect=true), more info at
http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/.
Small recap about Spark "data partition" concept:
INPUT (X PARTITIONs) -> COMPUTING (Y PARTITIONs) -> OUTPUT (Z PARTITIONs)
Between "stages", data can be transferred between partitions, this is the "shuffle". You want "Z" = 1, but with Y > 1, without shuffle? this is impossible.

In Scrapy how to seperate items in output json file

I am a new learner of Scrapy and encounter a problem. I get several json responses when crawling websites(that part I have already done). I want to fill them in items and then output to one json file. But the output file is not what I expected.
The item class looks like this:
class USLPlayer(scrapy.Item):
ln = scrapy.Field()
fn = scrapy.Field() ...
The original json file structure looks like this:
{"players":{"4752569":{"ln":"Musa","fn":"Yahaya", .... ,"apprvd":"59750"}, "4801435":{"ln":"Ackley","fn":"Brian", ... ,"apprvd":"59750"}, ...}}
The expected result I hope to be looks like this:
{"item" :{"ln":"Musa","fn":"Yahaya", .... ,"apprvd":"59750"}},{"item": {"ln":"Ackley","fn":"Brian", ... ,"apprvd":"59750"}, ...
Basically I hope every item should be separated list.
The code about fill item is:
players = json.loads(plain_text)
for id, player in players["players"].items():
for key, value in player.items():
item = USLPlayer() item[key] = value
yield item
Is there any way I can ouput json file as I expected. Thank you very much for kind answer.
Have you tried the JSON lines feed exporter?
It will output your items as JSON objects one per line. Then, reading the list of players from the file is as easy as using json.loads on each line.

Using Python's csv.dictreader to search for specific key to then print its value

BACKGROUND:
I am having issues trying to search through some CSV files.
I've gone through the python documentation: http://docs.python.org/2/library/csv.html
about the csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) object of the csv module.
My understanding is that the csv.DictReader assumes the first line/row of the file are the fieldnames, however, my csv dictionary file simply starts with "key","value" and goes on for atleast 500,000 lines.
My program will ask the user for the title (thus the key) they are looking for, and present the value (which is the 2nd column) to the screen using the print function. My problem is how to use the csv.dictreader to search for a specific key, and print its value.
Sample Data:
Below is an example of the csv file and its contents...
"Mamer","285713:13"
"Champhol","461034:2"
"Station Palais","972811:0"
So if i want to find "Station Palais" (input), my output will be 972811:0. I am able to manipulate the string and create the overall program, I just need help with the csv.dictreader.I appreciate any assistance.
EDITED PART:
import csv
def main():
with open('anchor_summary2.csv', 'rb') as file_data:
list_of_stuff = []
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
print list_of_stuff
main()
The documentation you linked to provides half the answer:
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
[...] maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
It would seem that if the fieldnames parameter is passed, the given file will not have its first record interpreted as headers (the parameter will be used instead).
# file_data is the text of the file, not the filename
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
which will (apparently; I've been having trouble with it) produce the following data structure:
[{"title": "Mamer", "value": "285713:13"},
{"title": "Champhol", "value": "461034:2"},
{"title": "Station Palais", "value": "972811:0"}]
which may need to be further massaged into a title-to-value mapping by something like this:
data = {}
for i in list_of_stuff:
data[i["title"]] = i["value"]
Now just use the keys and values of data to complete your task.
And here it is as a dictionary comprehension:
data = {row["title"]: row["value"] for row in csv.DictReader(file_data, ("title", "value"))}
The currently accepted answer is fine, but there's a slightly more direct way of getting at the data. The dict() constructor in Python can take any iterable.
In addition, your code might have issues on Python 3, because Python 3's csv module expects the file to be opened in text mode, not binary mode. You can make your code compatible with 2 and 3 by using io.open instead of open.
import csv
import io
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
data = dict(csv.reader(f))
print(data['Champhol'])
As a warning, if your csv file has two rows with the same value in the first column, the later value will overwrite the earlier value. (This is also true of the other posted solution.)
If your program really is only supposed to print the result, there's really no reason to build a keyed dictionary.
import csv
import io
# Python 2/3 compat
try:
input = raw_input
except NameError:
pass
def main():
# Case-insensitive & leading/trailing whitespace insensitive
user_city = input('Enter a city: ').strip().lower()
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
for city, value in csv.reader(f):
if user_city == city.lower():
print(value)
break
else:
print("City not found.")
if __name __ == '__main__':
main()
The advantage of this technique is that the csv isn't loaded into memory and the data is only iterated over once. I also added a little code the calls lower on both the keys to make the match case-insensitive. Another advantage is if the city the user requests is near the top of the file, it returns almost immediately and stops looking through the file.
With all that said, if searching performance is your primary consideration, you should consider storing the data in a database.