Converting CSV to Parquet in spark without writing/saving it - csv

I am trying to convert a CSV to Parquet and then use it for other purposes and I couldn't find how to do it without saving it, I've only seen this kind of structure to convert and it always writes/saves the data:
data = spark.read.load(sys.argv[1], format="csv", sep="|", inferSchema="true", header="true")
data.write.format("parquet").mode("overwrite").save(sys.argv[2])
Am I missing something? I am also happy with scala solutions.
Thanks!
I've tried to play with the mode option in "write" but it didn't help.

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

reading .csv file + JSON with Matlab

So I have a .CSV file that contains dataset information, the data seems to be described in JSON. I want to read it with MatLab. One line example(7000 total) of the data:
imagename.jpg,"[[{""name"":""nose"",""position"":[2911.68,1537.92]},{""name"":""left eye"",""position"":[3101.76,544.32]},{""name"":""right eye"",""position"":[2488.32,544.32]},{""name"":""left ear"",""position"":null},{""name"":""right ear"",""position"":null},{""name"":""left shoulder"",""position"":null},{""name"":""right shoulder"",""position"":[190.08,1270.08]},{""name"":""left elbow"",""position"":null},{""name"":""right elbow"",""position"":[181.44,3231.36]},{""name"":""left wrist"",""position"":[2592,3093.12]},{""name"":""right wrist"",""position"":[2246.4,3965.76]},{""name"":""left hip"",""position"":[3006.72,3360.96]},{""name"":""right hip"",""position"":[155.52,3412.8]},{""name"":""left knee"",""position"":null},{""name"":""right knee"",""position"":null},{""name"":""left ankle"",""position"":[2350.08,4786.56]},{""name"":""right ankle"",""position"":[1460.16,5019.84]}]]","[[{""segment"":[[0,17.28],[933.12,5175.36],[0,5166.72],[0,2306.88]]}]]",https://imageurl.jpg,
If I use the Import functionlity/tool, I am able separate the data in four colums using the , as delimiter:
Image File Name,Key Points,Segmentation,Image URL,
imagename.jpg,
"[[{""name"":""nose"",""position"":[2911.68,1537.92]},{""name"":""left eye"",""position"":[3101.76,544.32]},{""name"":""right eye"",""position"":[2488.32,544.32]},{""name"":""left ear"",""position"":null},{""name"":""right ear"",""position"":null},{""name"":""left shoulder"",""position"":null},{""name"":""right shoulder"",""position"":[190.08,1270.08]},{""name"":""left elbow"",""position"":null},{""name"":""right elbow"",""position"":[181.44,3231.36]},{""name"":""left wrist"",""position"":[2592,3093.12]},{""name"":""right wrist"",""position"":[2246.4,3965.76]},{""name"":""left hip"",""position"":[3006.72,3360.96]},{""name"":""right hip"",""position"":[155.52,3412.8]},{""name"":""left knee"",""position"":null},{""name"":""right knee"",""position"":null},{""name"":""left ankle"",""position"":[2350.08,4786.56]},{""name"":""right ankle"",""position"":[1460.16,5019.84]}]]",
"[[{""segment"":[[0,17.28],[933.12,5175.36],[0,5166.72],[0,2306.88]]}]]",
https://imageurl.jpg,
But I have truble trying to use the tool to do further decomposition of the data. Of corse the ideal would be to separate the data in a code.
I hope someone can orientate me on how to or the tools I need to use. I have seen other questions, but they don't seem to fit my particular case.
Thank you very much!!
You can read a JSON file and store it in a MATLAB structure using the following command structure1 = matlab.internal.webservices.fromJSON(json_string)
You can create a JSON string from a MATLAB structure using the following command json_string= matlab.internal.webservices.toJSON(structure1)
JSONlab is what you want. It has a 'loadjson' function which inputs a char array of JSON data and returns a struct with all the data

CSV data exported/copied to HDFS going in weird format

I am using a spark job for reading csv file data from a stating area and coping that data into HDFS using following code line:
val conf = new SparkConf().setAppName("WCRemoteReadHDFSWrite").set("spark.hadoop.validateOutputSpecs", "true");
val sc = new SparkContext(conf)
val rdd = sc.textFile(source)
rdd.saveAsTextFile(destination)
csv file is having data in following format:
CTId,C3UID,region,product,KeyWord
1,1004634181441040000,East,Mobile,NA
2,1004634181441040000,West,Tablet,NA
whereas when data goes into HDFS it goes in following format:
CTId,C3UID,region,product,KeyWord
1,1.00463E+18,East,Mobile,NA
2,1.00463E+18,West,Tablet,NA
I am not able to find any valid reason behind this.
Any kind of help would be appreciated.
Regards,
Bhupesh
What happens is that because your C3UID is a large number, it gets parsed as Double and then is saved in standard Double notation. You need to fix the schema, and make sure you read the second column either as Long, BigDecimal or String, then there will be no change in String-representation.
Sometimes your CSV file could also be the culprit. Do NOT open CSV file in excel as excel could convert those big numeric values into exponential format and hence once you use spark job for importing data into hdfs, it goes as it is in string format.
Hence be very sure that your data in CSV should never be opened in excel before importing to hdfs using spark job. If you really want to see the content of your excel use either notepad++ or any other text editor tool

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly.
Can you please tell me step by step how to open this file and maybe how to convert it to csv?
https://www.yelp.com.tr/dataset_challenge
https://github.com/Yelp/dataset-examples
data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files
yes you can use pandas. Take a look:
import pandas as pd
# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
Now 'data_df' contains the yelp data ;)
Case, you want convert it directly to csv, you can use this script
https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
I hope it can help you
To process huge json files, use a streaming parser.
Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.
With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.
There is no single json-to-csv conversion.
Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.
The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.
Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.