How to store a CSV into a PCollection in Apache Beam - csv

I have a CSV Data stored at some location. I want to read this data and store it in some PCollection but I am unable to specify the type of PCollection or the way it should be stored in a PCollection

As a quite straitforward way - you can read it as just strings (e.g. TextIO for Java SDK will return a PCollection<String>) and then parse it with your own PTransform into required format that returns a PCollection<YourPOJO> (see an example).

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Export as JSON using BigQueryToCloudStorageOperator

When I use the BigQuery console manually, I can see that the 3 options when exporting a table to GCS are CSV, JSON (Newline delimited), and Avro.
With Airflow, when using the BigQueryToCloudStorageOperator operator, what is the correct value to pass to export_format in order to transfer the data to GCS as JSON (Newline delimited)? Is it simply JSON? All examples I've seen online for BigQueryToCloudStorageOperator use export_format='CSV', never for JSON, so I'm not sure what the correct value here is. Our use case needs JSON, since the 2nd task in our DAG (after transferring data to GCS) is to then load that data from GCS into our MongoDB Cluster with mongoimport.
I found that the value export_format='NEWLINE_DELIMITED_JSON' was required after finding the documentation https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationextract and refering to the values for destinationFormat
According to the BigQuery documentation the three possible formats to which you can export BigQuery query results are: CSV, JSON, and Avro (and this is compatible with the UI drop-down menu).
I would try with export_format='JSON' as you already proposed.

Is format always json when SELECTing from stage?

Snowflake supports multiple file types via creation FILE_FORMAT (avro, json, csv etc).
Now I have tested SELECTing from snowflake stage (s3) both:
*.avro files (generated from nifi processor batching 10k source oracle table).
*.json files (json per line).
And when Select $1 from #myStg, snowflake expands as many rows as records on avro or json files (cool), but.. the $1 variant is both json format and now i wonder if whatever snowflake file_format we use do records always arrive as json on the variant $1 ?
I haven't tested csv or others snowflake file_formats.
Or i wonder if i get json from the avros (from oracle table) because maybe NiFi processor creates avro files (with internally uses json format).
Maybe im making some confusion here.. i know avro files contain both:
avro schema - language similar to json key/value.
compressed data (binary).
Thanks,
Emanuel O.
I tried with CSV, When Its came to CSV its parsing each records in the file like below
So when its came to JSON it will treat one complete JSON as one records so its displaying in JSON format.

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

What is the efficient way of reading and processing very large CSV file in scala (> 1GB)?

In Scala how do you efficiently (memory consumption + performance) read very large csv file? is it fast enough to just stream it line by line and process each line at each iteration?
What i need to do with CSV data :->
In my application Single line in CSV file is treated as an one single record and all the records of the CSV file are to be converted into XML elements and JSON format and save it into another file in xml and json formats.
So here question is while reading the file from csv is it a good idea to read the file in chunks and provide that chunk to another thread which will convert that CSV records into an xml/json and write that xml/json to file? If yes how?
Data of the CSV can be anything, there is no restriction on the type of the data it can be numeric, big decimal, string or date. Any easy way to handle this different data types before saving it to xml? or we don't need to take care of types?
Many Thanks
If this is not a one time task, create a program that will break this 1GB file to small size files. Then provide those new files as a input to separate futures.
Each future will read one file and resolve in the order of file content. File4 resolves after File3, which resolves after File2, which resolves after File1.
As the file has no key-value pair or hierarchical data structure, so I will suggest, just read as a string.
Hope this helps.