I am trying to parse the XML file and write to DataFrame result to CSV file.
My problem is some of characters are not supported when i write the output to the CSV. For eg, there is a field Nectarine tree named ‘Polar Zee’ its writes like Nectarine tree named ‘Polar Zee’.
Is there any settings need to be change? or any properties need to be added?
Related
I am trying to create a parquet file from a CSV file using Apache Nifi.
I am able to convert the CSV to parquet file, but the problem is, the schema of the parquet file contains struct type(Which I need to overcome) and convert it into string type.
I am using Apache Nifi 1.14.0 on Windows Server 2016.
This is what I've tried to convert CSV to parquet till now...
I have used the below 3 controllers
CSVReader
CSVRecordSetWriter
ParquetRecordSetWriter
And, These are the processors/Flow
GetFile
ConvertRecord(CSVReader to CSVRecordSetWriter and this will automatically generate "avro.schema" attribute and in next step I am updating this attribute)
UpdateAttribute(Updating "avro.schema" attribute, where ever I've got 2 data types inferred, I am replacing it to '["null","string"]')
ConvertRecord(CSVReader to ParquetRecordSetWriter)
UpdatedAttribute(For appending '.parquet' in the filename)
PutFile
I also want to know, how to view a .parquet file in Windows OS. Currently, I am reading the parquet file via PySpark and checking the schema. :|
This is how parquet file schema looks like after conversion. I want string instead of Struct as output.
Please Note: There are lots of CSVs with many columns/fields. I don't want to create schema manually.
OR
Any other ways to achieve this would be very helpfull.
Thanks!
After playing around with some more options of "ParquetRecordSetWriter", I was able to create a parquet file with the schema that I've captured in "avro.schema" attribute.
I have a CSV file with a column with JSON formatted data.
How can I extract the JSON data into a CSV file that can be processed in Access or SQL?
What code language can be used and how will that code look like?
suppose I have multiple CSV files in the same directory, these files all share the same schema.
/tmp/data/myfile1.csv, /tmp/data/myfile2.csv, /tmp/data.myfile3.csv, /tmp/datamyfile4.csv
I would like to read these files into a Spark DataFrame or RDD, and I would like each file to be a parition of the DataFrame. How can I do this?
You have two options I can think of:
1) Use the Input File name
Instead of trying to control the partitioning directly, add the name of the input file to your DataFrame and use that for any grouping/aggregation operations you need to do. This is probably your best option as it is more aligned with the parallel processing intent of spark where you tell it what to do and let it figure out the how. You do this with code like this:
SQL:
SELECT input_file_name() as fname FROM dataframe
Or Python:
from pyspark.sql.functions import input_file_name
newDf = df.withColumn("filename", input_file_name())
2) Gzip your CSV files
Gzip is not a splittable compression format. This means when loading gzipped files, each file will be it's own partition.
I have a dataset that I want to open in Weka, so I converted it as csv file. (The file contains some text including commas/apostrophes/quotation marks, while its seperator is pipeline character.)
When I try to read this csv file, in options window, I specify pipeline (|) as my fieldSeperator, leave enclosureCharacters empty, and don't touch the rest of the options. This can be seen in the screenshot:
Then I get this error:
File not recognised as an 'CSV data files' file. Reason: Enclosures
can only be single characters.
Seems like Weka's csv loader does not accept enclosureCharacters field empty? What can I write into this field? I think my file does not have enclosures for its text data.
I have a large CSV file where each line consists (id, description) in a Text format. I wanted to convert each line to a vector using "seq2sparse" and then later run "rowsimilarity" to generate a textual similarity result.
Problem is i need to convert the CSV file to SEQ somehow to work with "seq2sparse", and existing method "seqdirectory" takes a directory of text files rather than a CSV file. Anyway to accomplish this?