convert CSV to JSON using Python - json

I need to convert a CSV file to JSON file using Python. I used this,
variable = csv.DictReader(file.csv)
It throws this ERROR
csv.Error: line contains NULL byte
I checked the CSV file in Excel, it shows no NULL chars, but when I printed the data in CSV file using Python. There are some data like SOHNULNULHG (here last 2 letters, HG is the data displaying in the Excel). I need to remove these ASCII chars in the CSV file, while converting to JSON. (i.e. I need only HG from the above string)

I just ran into the same issue. I converted my csv file to csv UTF-8 and ran it again without any errors. That seemed to fix the ASCII char issue. Hope that helps.
To convert the csv type, I just opened my file up in Excel, did save as, then selected CSV UTF-8(Comma delimited)(*.csv) in the Save as type.
Hope that helps.

Related

Invalid literal because symbol appears when reading a csv file

When I am using replit I can remove the little symbol that appears when I drag and drop in a csv file so my main.py can read it, otherwise I get invalid literal base 10 issue. I am trying to run this on local machine with sublime text and getting same error now as it is reading the file from the directory, so I assume it is adding this symbol in before reading.... I can click on the csv file in replit and edit, but cannot do this in sublime.
Can someone explain what this is for? HOw can I get it to read the basic comma delimited numbers in the file (It is a game tile map).
with open(f'level{level}_data.csv', newline= '') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
Saved it is comma delimited csv instead of UTF-8 comma delimited csv. It then imports without the 'question mark in a diamon' symbol. I understand this is an unrecognised special character, but I have nothing apart from integers in my table. Maybe someone could clarify that?...

Apache Nifi : How to create parquet file from CSV file with schema saved in "avro.schema" attribute

I am trying to create a parquet file from a CSV file using Apache Nifi.
I am able to convert the CSV to parquet file, but the problem is, the schema of the parquet file contains struct type(Which I need to overcome) and convert it into string type.
I am using Apache Nifi 1.14.0 on Windows Server 2016.
This is what I've tried to convert CSV to parquet till now...
I have used the below 3 controllers
CSVReader
CSVRecordSetWriter
ParquetRecordSetWriter
And, These are the processors/Flow
GetFile
ConvertRecord(CSVReader to CSVRecordSetWriter and this will automatically generate "avro.schema" attribute and in next step I am updating this attribute)
UpdateAttribute(Updating "avro.schema" attribute, where ever I've got 2 data types inferred, I am replacing it to '["null","string"]')
ConvertRecord(CSVReader to ParquetRecordSetWriter)
UpdatedAttribute(For appending '.parquet' in the filename)
PutFile
I also want to know, how to view a .parquet file in Windows OS. Currently, I am reading the parquet file via PySpark and checking the schema. :|
This is how parquet file schema looks like after conversion. I want string instead of Struct as output.
Please Note: There are lots of CSVs with many columns/fields. I don't want to create schema manually.
OR
Any other ways to achieve this would be very helpfull.
Thanks!
After playing around with some more options of "ParquetRecordSetWriter", I was able to create a parquet file with the schema that I've captured in "avro.schema" attribute.

Reading csv without specifying enclosure characters in Weka

I have a dataset that I want to open in Weka, so I converted it as csv file. (The file contains some text including commas/apostrophes/quotation marks, while its seperator is pipeline character.)
When I try to read this csv file, in options window, I specify pipeline (|) as my fieldSeperator, leave enclosureCharacters empty, and don't touch the rest of the options. This can be seen in the screenshot:
Then I get this error:
File not recognised as an 'CSV data files' file. Reason: Enclosures
can only be single characters.
Seems like Weka's csv loader does not accept enclosureCharacters field empty? What can I write into this field? I think my file does not have enclosures for its text data.

how to use ascii character for quote in COPY in cqlsh

I am uploading data from a a big .csv file into Cassandra using copy in cqlsh.
I am using cassandra 1.2 and CQL 3.0.
However since " is part of my data I have to use some other character for uploading my data, I need to use any extended ASCII characters. I tried various approaches but fails.
The following works, but need to use an extended ascii characters for my purpose..
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '"';
copy (<columnnames>) from <filename> where deleimiter='|' and quote = '~';
When I give quote='ß', I get the error below:
:"quotechar" must be an 1-character string
Pls advice on how I can use an extended ASCII character for quote parameter..
Thanks in advance
A note on the COPY documentation page suggests that for bulk loading (like in your case), the json2sstable utility should be used. You can then load the sstables to your cluster using sstableloader. So I suggest that you write a script/program to convert your CSV to JSON and use these tools for your big CSV. JSON will not have any problem handling all characters from ASCII table.
I had a similar problem, and inspected the source code of cqlsh (it's a python script). In my case, I was generating the csv with python, so it was a matter of finding the right python csv parameters.
Here's the key information from cqlsh:
csv_dialect_defaults = dict(delimiter=',', doublequote=False,
escapechar='\\', quotechar='"')
So if you are lucky enough to generate your .csv file from python, it's just a matter of using the csv module with:
writer = csv.writer(open("output.csv", 'w'), **csv_dialect_defaults)
Hope this helps, even if you are not using python.

Sqoop HDFS to Couchbase: json file format

I'm trying the export data form HDFS to Couchbase and I have a problem with my file format.
My configuration:
Couchbase server 2.0
Stack hadoop cdh4.1.2
sqoop 1.4.2 (compiled with hadoop2.0.0)
couchbase/hadoop connector (compiled with hadoop2.0.0)
When I run the export command, I can easily export files with this kind of format:
id,"value"
or
id,42
or
id,{"key":"value"}
But when I want to apply a Json object it doesn't work!
id,{"key1":"value1,"key2":"value2"}
The content is truncated at the first comma and diplay in base64 by couchbase because now the content is not a correct JSON...
So, my question is how the file must by formated to be stored as a json document?
Can we only export a key/value file?
I want to export json files form HDFS like the cbdocloader do it with files from local file system...
I'm afraid that this expected behavior as Sqoop is parsing your input file as CSV with comma as a separator. You might need tweak your file format to either escape separator or enclose entire JSON string. I would recommend reading how exactly Sqoop is dealing with escaping separators and enclosing strings in the user guide [1].
Links:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#id387098
I think your best bet is to convert the files to tab-delimited, if you're still working on this. If you look at the Sqoop documentation (http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_large_objects), there's an option --fields-terminated-by which allows you to specify which characters Sqoop splits fields on.
If you passed it --fields-terminated-by '\t', and a tab-delimited file, it would leave the commas in place in your JSON.
#mpiffaretti can you post your sqoop export command? I think each JSON object should have its own key value.
key1 {"dataOne":"ValueOne"}
key2 {"dataTwo":"ValueTwo"}
http://ajanacs.weebly.com/blog
In your case change the datea like below may help you solve the issue.
id,{"key":"value"}
id2,{"key2":"value2"}
Let me know if you have further questions on it.
[json] [sqoopexport] [couchbase]