Question on writing to parquet in PySpark - csv

I have encountered issues when converting csv files to parquet in PySpark. When multiple files of the same schema were converted, they don't have the same schema, because sometimes a string of number will be read as float, others as integer, etc. There also seem to be issue with the order of columns. It seems that when writing dataframes with the same columns, but arranged in different order to parquet, then these parquet cannot be loaded in the same statement.
How to write dataframes to parquet so that all columns are stored as string type? How to handle the order of columns? Shall I rearange the columns to the same order for all the dataframes before writing to parquet?

If you want to sort the columns and convert to string type, you can do:
out_df = df.select([F.col(c).cast('string') for c in sorted(df.columns)])
out_df.write.parquet(...)

Related

Unpack JSON file with different separators

I have a JSON file seperating data entries with multiple and different seperators. How can I unpack the column with plenty of string values into seperate columns using Excel? (This is how it looks like (Screenshot)

Storing Pyspark Data in csv creating problem

I am trying to store my pyspark output into csv, but when I try to save it in csv, the output does not look the same. I have the output in this form:
When I try to convert this to csv, the Concat tasks column does not show up properly, due to the size of the data. Given my requirement, it's necessary for me to store the data in csv format. Is there a way out for this. (P.S- I also see columns showing nonsensical values, even though the pyspark output shows correct value)

TPT12109: Export Operator does not support JSON column.Is there a way to export results involving json column other than bteq export?

I have a table having 2 million records. I am trying to dump the contents of the table in json format. This issue is that the TPT export does not allow JSON columns and BTEQ export would take a lot of time to do this export. Is there any way to handle this export in a more optimized way.
Your help is really appreciated.
If the JSON values are not too large, you could potentially CAST them in your SELECT as VARCHAR(64000) CHARACTER SET LATIN, or VARCHAR(32000) CHARACTER SET UNICODE if you have non-LATIN characters, and export them in-line.
Otherwise each JSON object has to be transferred DEFERRED BY NAME where each object is stored in a separate file and the corresponding filename stored in the output row. In that case you would need to use BTEQ, or TPT SQL Selector operator - or write your own application.
You can do one thing. Load the json formatted rows in another teradata table.
Keep that table column as varchar and then do a tptexport of that column/table.
It should work.
INSERT INTO test (col1,col2...,jsn_obj)
SELECT col1,col2,..
JSON_Compose(<. columns you want to inlcude in your json file)
FROM <schemaname>.<tablename>
;

how to read multiple csv files with different schema in pyspark?

I have different csv files kept in sub folders in a given folder and some of them have one format and some of them have another format in the column names.
april_df = spark.read.option("header", True).option("inferSchema", True).csv('/mnt/range/2018_04_28_00_11_11/')
Above command only refers to one format and ignores other format. Is there any quick way in the parameter like mergeschema for parquet?
format of some files is like:
id ,f_facing ,l_facing ,r_facing ,remark
other is
id, f_f, l_f ,r_f ,remark
but there could be chances in the future that some columns are missing etc so need a robust way to handle this.
It is not. Either the column should be filled with null in the pipeline or you will have to specify the schema before you import the file. But if you have an understanding of what columns might be missing in the future, you could possibly create a scenario where based on length of the df.columns, you specify the schema, although it seems tedious.

Apache Drill Query PostgreSQL Json

I am trying to query a jsonb field in PostgreSQL in drill and read it as if were coming from a json storage type but am running into trouble. I can conver from text to json but cannot seem to query the json object. At least I think I can convert to JSON. My goal is to avoid reading through millions of uneven json objects from PostgreSQL, perform joins and things with text files such as CSV files and XML files. Is there a way to query the text field as if it were coming from a json storage type without writing large files to disk?
The goal is to generate results implicitly which PostgreSQL nor Pentaho do and integrate these data sets with others of any format.
Attempt:
SELECT * FROM (SELECT convert_to(json_field,'JSON') as data FROM postgres.mytable) as q1
Sample Result:
[B#7106dd
Attempt to existing field that should be in any json object:
SELECT data[field] FROM (SELECT convert_to(json_field,'JSON') as data FROM postgres.mytable) as q1
Result:
null
Attempting to do anything with jsonb results in a Null Pointer Error.