CSV data exported/copied to HDFS going in weird format - csv

I am using a spark job for reading csv file data from a stating area and coping that data into HDFS using following code line:
val conf = new SparkConf().setAppName("WCRemoteReadHDFSWrite").set("spark.hadoop.validateOutputSpecs", "true");
val sc = new SparkContext(conf)
val rdd = sc.textFile(source)
rdd.saveAsTextFile(destination)
csv file is having data in following format:
CTId,C3UID,region,product,KeyWord
1,1004634181441040000,East,Mobile,NA
2,1004634181441040000,West,Tablet,NA
whereas when data goes into HDFS it goes in following format:
CTId,C3UID,region,product,KeyWord
1,1.00463E+18,East,Mobile,NA
2,1.00463E+18,West,Tablet,NA
I am not able to find any valid reason behind this.
Any kind of help would be appreciated.
Regards,
Bhupesh

What happens is that because your C3UID is a large number, it gets parsed as Double and then is saved in standard Double notation. You need to fix the schema, and make sure you read the second column either as Long, BigDecimal or String, then there will be no change in String-representation.

Sometimes your CSV file could also be the culprit. Do NOT open CSV file in excel as excel could convert those big numeric values into exponential format and hence once you use spark job for importing data into hdfs, it goes as it is in string format.
Hence be very sure that your data in CSV should never be opened in excel before importing to hdfs using spark job. If you really want to see the content of your excel use either notepad++ or any other text editor tool

Related

Converting CSV to Parquet in spark without writing/saving it

I am trying to convert a CSV to Parquet and then use it for other purposes and I couldn't find how to do it without saving it, I've only seen this kind of structure to convert and it always writes/saves the data:
data = spark.read.load(sys.argv[1], format="csv", sep="|", inferSchema="true", header="true")
data.write.format("parquet").mode("overwrite").save(sys.argv[2])
Am I missing something? I am also happy with scala solutions.
Thanks!
I've tried to play with the mode option in "write" but it didn't help.

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

reading .csv file + JSON with Matlab

So I have a .CSV file that contains dataset information, the data seems to be described in JSON. I want to read it with MatLab. One line example(7000 total) of the data:
imagename.jpg,"[[{""name"":""nose"",""position"":[2911.68,1537.92]},{""name"":""left eye"",""position"":[3101.76,544.32]},{""name"":""right eye"",""position"":[2488.32,544.32]},{""name"":""left ear"",""position"":null},{""name"":""right ear"",""position"":null},{""name"":""left shoulder"",""position"":null},{""name"":""right shoulder"",""position"":[190.08,1270.08]},{""name"":""left elbow"",""position"":null},{""name"":""right elbow"",""position"":[181.44,3231.36]},{""name"":""left wrist"",""position"":[2592,3093.12]},{""name"":""right wrist"",""position"":[2246.4,3965.76]},{""name"":""left hip"",""position"":[3006.72,3360.96]},{""name"":""right hip"",""position"":[155.52,3412.8]},{""name"":""left knee"",""position"":null},{""name"":""right knee"",""position"":null},{""name"":""left ankle"",""position"":[2350.08,4786.56]},{""name"":""right ankle"",""position"":[1460.16,5019.84]}]]","[[{""segment"":[[0,17.28],[933.12,5175.36],[0,5166.72],[0,2306.88]]}]]",https://imageurl.jpg,
If I use the Import functionlity/tool, I am able separate the data in four colums using the , as delimiter:
Image File Name,Key Points,Segmentation,Image URL,
imagename.jpg,
"[[{""name"":""nose"",""position"":[2911.68,1537.92]},{""name"":""left eye"",""position"":[3101.76,544.32]},{""name"":""right eye"",""position"":[2488.32,544.32]},{""name"":""left ear"",""position"":null},{""name"":""right ear"",""position"":null},{""name"":""left shoulder"",""position"":null},{""name"":""right shoulder"",""position"":[190.08,1270.08]},{""name"":""left elbow"",""position"":null},{""name"":""right elbow"",""position"":[181.44,3231.36]},{""name"":""left wrist"",""position"":[2592,3093.12]},{""name"":""right wrist"",""position"":[2246.4,3965.76]},{""name"":""left hip"",""position"":[3006.72,3360.96]},{""name"":""right hip"",""position"":[155.52,3412.8]},{""name"":""left knee"",""position"":null},{""name"":""right knee"",""position"":null},{""name"":""left ankle"",""position"":[2350.08,4786.56]},{""name"":""right ankle"",""position"":[1460.16,5019.84]}]]",
"[[{""segment"":[[0,17.28],[933.12,5175.36],[0,5166.72],[0,2306.88]]}]]",
https://imageurl.jpg,
But I have truble trying to use the tool to do further decomposition of the data. Of corse the ideal would be to separate the data in a code.
I hope someone can orientate me on how to or the tools I need to use. I have seen other questions, but they don't seem to fit my particular case.
Thank you very much!!
You can read a JSON file and store it in a MATLAB structure using the following command structure1 = matlab.internal.webservices.fromJSON(json_string)
You can create a JSON string from a MATLAB structure using the following command json_string= matlab.internal.webservices.toJSON(structure1)
JSONlab is what you want. It has a 'loadjson' function which inputs a char array of JSON data and returns a struct with all the data

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly.
Can you please tell me step by step how to open this file and maybe how to convert it to csv?
https://www.yelp.com.tr/dataset_challenge
https://github.com/Yelp/dataset-examples
data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files
yes you can use pandas. Take a look:
import pandas as pd
# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
Now 'data_df' contains the yelp data ;)
Case, you want convert it directly to csv, you can use this script
https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
I hope it can help you
To process huge json files, use a streaming parser.
Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.
With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.
There is no single json-to-csv conversion.
Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.
The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.
Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.

How to use ILNumerics function csvread to load complex array from a csv file

I want to load a csv format data file with function csvread. The data is in complex format. For example, it looks like "36.151-202.64i,236.74+2.1788i,26.234+201.94i, ...."
When I call csvread with ILNumerics, I can only first column data. All other data are zeros.
Here is the simple 2 lines of code.
var inputDataFile = File.ReadAllText(#"C:\Temp\Input.txt"); ILArray datComplex = csvread(inputDataFile);
Please help.
It is a popular request, and we are currently working on it. The upcoming version will support reading complex numbers and arbitrary formatted elements in CSV files. Stay tuned!