I am trying to store my pyspark output into csv, but when I try to save it in csv, the output does not look the same. I have the output in this form:
When I try to convert this to csv, the Concat tasks column does not show up properly, due to the size of the data. Given my requirement, it's necessary for me to store the data in csv format. Is there a way out for this. (P.S- I also see columns showing nonsensical values, even though the pyspark output shows correct value)
Related
I have encountered issues when converting csv files to parquet in PySpark. When multiple files of the same schema were converted, they don't have the same schema, because sometimes a string of number will be read as float, others as integer, etc. There also seem to be issue with the order of columns. It seems that when writing dataframes with the same columns, but arranged in different order to parquet, then these parquet cannot be loaded in the same statement.
How to write dataframes to parquet so that all columns are stored as string type? How to handle the order of columns? Shall I rearange the columns to the same order for all the dataframes before writing to parquet?
If you want to sort the columns and convert to string type, you can do:
out_df = df.select([F.col(c).cast('string') for c in sorted(df.columns)])
out_df.write.parquet(...)
I'm trying to convert an excel file to a json file, while not losing any data.
I have a column in the excel that contains values which are both text and numbers (but both are stored as text).
First, I've set the column values to Text values.
Then, I tried attempting to solve this problem in 2 ways:
I tried converting the excel to a json file using an online converter but it didn't work well since I have text in foreign language in the excel.
I tried converting the excel to a csv, and then to a json file (also using an online converter to convert csv to json) and it worked but for some reason the numbers (which were stored as text) have become numbers again.
If there is a solution that involves code I'm great with that too.
Thanks
I have a table in Cassandra DB and one of the column has value in JSON format. I am using Datastax DevCenter for querying the DB and when I try to export the result to CSV, JSON value gets broken to separate column wherever there is coma(,). I even tried to export from command prompt without giving and delimiter, that too resulted in broken JSON value.
Is there anyway to achieve this task?
Use the COPY command to export the table as a whole with a different delimiter.
For example :
COPY keyspace.your_table (your_id,your_col) TO 'your_table.csv' WITH DELIMETER='|' ;
Then filter on this data programmatically in whatever way you want.
I have a MySQL table whose data I have to export to .csv and then ingest this .csv to GeoMesa.
My Mysql table structure is like below:
[
Now, as you can see the the_geom attribute of table has data type point and in database it is stored as blob like shown below:
Now I have two problems :
When I export the MySQL data into a (.csv) file my csv file shows (...) for the_geom attribute as shown below instead of any binary representation or anything which will allow it to be ingested in GeoMesa. So, how to overcome this?
Csv file also shows # for any attribute with datetime datatype but if you expand the column the date time can be seen as sown in below picture (however my question is does it will cause problem in geomesa?).
For #1, MySQL's export is not automatically converting the Point datatype into text for you. You might need to call a conversion function such as AsWKT to output the geometry as Well Known Text. The WKT format can be used by GeoMesa to read in the Point data.
For #2, I think you'll need to do the same for the date field. Check out the date and time functions.
How can Mapreduce parse a CSV file with 80 columns and for each row in excel format it results two to three lines in CSV format? Text input format doesn't work in this case. Does key value input format work in this case?
You can write your own InoutFormat & RecordReader which will read multiple lines and send as a single record to your Mapper.