How to add fields to the Avro schema in Apache NiFi? - json

I'm able to get Apache NiFi to generate a schema via the CSVReader, and then I can write the schema out to an attribute using ConvertRecord. However I then need to add fields using UpdateRecord, but the fields are not being added to the flow file or the to the schema attribute. I believe this is because the fields are not part of the initially inferred schema. I can't create the schema in the registry because it's being inferred from the file. So how can I add fields to a record when the schema doesn't include the fields?

Are you using InferAvroSchema to not have to worry about generating the schema(s), or because you really will not know the schema of the CSV files? If the former, then send one CSV through, then copy the inferred schema into a CSVReader, and add the fields from UpdateRecord into the write schema.
I've written up NIFI-5524 to cover the automation of adding/updating fields in the outgoing schema based on UpdateRecord properties.

Yes, that is because your writer controller service doesn't have the new fields defined in it.
If you are adding new fields then we need to define new avro schema with the additional fields included in the schema writer controller service.
Change the Schema Access Strategy to either
Use 'Schema Name' Property (or) Use 'Schema Text' Property
then Define your new schema including new fields in it so that Update Record processor will add the new fields to the output flowfile.
Please look into this article, as i have added ts_tz,current_ts..etc fields in it that doesn't exist in the input data and defined the writer controller service with the new avro schema that includes all the new/old fields in it.

I achieved the same by adding columns to the CSV using replace text processor (this will add same values for header and values in the csv), use replacement mode "Line-By-Line" and then use update record to update values only of the new columns to something meaningful.
No need to know the schema beforehand using this approach.

Related

Apache NiFi: Identifying csv records containing special characters

Using Apache NiFi I need to filter out the records in a csv which have a set of special characters.
As an example if the set of special characters are "FFF". My csv would be
name,age,city
John,23,New York
FFF,45,London
Himsara,18,Adelaide
Then the second record must be taken out from the csv and put into another csv. Also even if "FFF" is in the city or age columns the whole record must be removed.
Please suggest me the processors that are needed for me to achieve this. Also it would be really helpful if u can list out the configurations, that are needed to be changed.
As an alternative, you can use the RouteText processor. It will split the flow file based on a condition. Lines containing FFF will route to the matched relationship, the other lines will route to the unmatched relationship.
RouteText processor setup:
Use QueryRecord processor in nifi and define Record Reader/Writer Avro schemas to read your incoming flowfile.
Then add new property to QueryRecord processor as (Apache calcite sql)
select * from FLOWFILE where name !="FFF"
Now use the newly added relationship from QueryRecord processor for further processing and NiFi will result flowfile where name is not equal to 'FFF'.

Need help creating schema for loading CSV into BigQuery

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.
So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?
I would recommend doing 2 things here:
Preprocess your file and store the final layout of the file sans the first row i.e. the header row
BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required
Allowing BQ to autodetect schema is not advised.
Schema auto detection in BigQuery should be able to detect the first row of your CSV file as column names in most cases. One of the cases for which column name detection fails is when you have similar data types all over your CSV file. For instance, BigQuery schema auto detect would not be able to detect header names for the following file since every field is a String.
headerA, headerB
row1a, row1b
row2a, row2b
row3a, row3b
The "Header rows to skip" option in the UI would not help fixing this shortcoming of schema auto detection in BigQuery.
If you are following the GCP documentation for Loading CSV Data from Google Cloud Storage you have the option to skip n number of rows:
(Optional) An integer indicating the number of header rows in the source data.
The option is called "Header rows to skip" in the Web UI, but it's also available as a CLI flag (--skip_leading_rows) and as BigQuery API property (skipLeadingRows)
Yes you can modify the existing schema (aka DDL) using bq show..
bq show --schema --format=prettyjson project_id:dataset.table > myschema.json
Note that this will result in you creating a new BQ table all together.
I have way to schema for loading csv into bigquery. You just enough edit value column, for example :
weight|total|summary
2|4|just string
2.3|89.5|just string
if use schema generator by bigquery, field weight and total will define as INT64, but when insert second rows so error or failed. So, you just enough edit first rows like this
weight|total|summary
'2'|'4'|just string
2.3|89.5|just string
You must set field weight & total as STRING, and if you want to aggregate you just use convert type data in bigquery.
cheers
If 'column name' type and 'datatype' are the same for all over the csv file, then BigQuery misunderstood that 'column name' as data. And add a self generated name for the column. I couldn't find any technical way to solve this. So I took another approach. 
If the data is not sensitive, then add another column with the 'column name' in string type. And all of the values in the column in number type. Ex. Column name 'Test' and all values are 0. Upload the file to the BigQuery and use this query to drop the column name.
ALTER TABLE <table name> DROP COLUMN <Test>
Change and according to your Table.

Ingesting CSV data to MySQL DB in NiFi

I am trying to ingest the data of my CSV file into MySQL Db. My CSV file have field called 'MeasurementTime' value as 2018-06-27 11:14.50. My flow is taking that field as string and thus PutSQL is giving error. I am using the same template as per this Template but not using the InferAvro processor as i already have a pre-defined schema. This is the website Website link
How can I pass a Datetime field into my MySQL db as correct datatype and not as string. What setting should I change?
Thank you
With PutDatabaseRecord you can avoid all this chain of transformations and overengineering. The flow would be like:
GetFile -> PutDatabaseRecord
You need to configure PutDatabaseRecord with RecordReader property configured to CSVReader and configure CSVReader and set its Schema Registry to AvroSchemaRegistry and provide the valid schema. you can find the template for a sample flow here.

pyspark: Specified data types in schema while creating dataframe don't reflect in data

I'm creating a dataframe in Spark and I've defined the schema as follows:
SCHEMA = StructType([StructField('s3_location', StringType()),
StructField('partition_date', StringType()),
StructField('table_name', StringType()),
StructField('column_name', StringType()),
StructField('data_type', StringType()),
StructField('number_of_nulls', LongType()),
StructField('min', DoubleType()),
StructField('max', DoubleType()),
StructField('mean', DoubleType()),
StructField('variance', DoubleType()),
StructField('max_length', LongType())])
I have a bunch of rows that follow this exact schema, and I'm creating the dataframe as follows:
DF = SPARK.createDataFrame(ROWS, schema=SCHEMA)
Then I write this dataframe to a CSV file in AWS S3:
DF.repartition(1).write.mode('append').partitionBy('partition_date').csv(SAVE_PATH,
header=True)
This process is successful and creates the CSV file in S3. Now, I crawl this S3 location in AWS Glue and it infers the schema differently. All the fields I specified as DoubleType() are inferred as string instead. So if I want to run any aggregate functions on these values using something like QuickSight, I can't.
Why is this happening? Is there a way to fix it?
A CSV is an untyped file which contains text - i.e. strings.
If you tell AWS Glue that the table contains numeric values then it will read those values are numbers, but the AWS Glue crawler isn't recognizing your numeric values as such. This could be because you have a header row, or it could be because the columns are quoted, or because you didn't specify.
If you manually create the table in Glue you'll be able to specify the data type for columns. Here's how you can do that from the Athena console.
Click the vertical ellipsis next to your name table, and select Generate Create Table DDL.
Using the result from this query, modify data type of your numeric column in the CREATE TABLE query - you might use FLOAT, DOUBLE, or DECIMAL.
Drop the table (e.g. DROP TABLE myschema.mytable;)
Run the modified CREATE TABLE script. It's useful to keep all the table properties that Glue initially added, so that any downstream process understands the continues to recognize the table in the same manner.
Can you include data types in your file and avoid having to tell Glue about data types? Yes! Use one of Glue's more structured file formats, such as Parquet (Spark's favourite) or ORC.
While Importing CSV files the Crawler auto assigns the column names. This can be fixed by:
Schema Edit and save.
Editing the table schema in the Glue console after the first crawl. Please do the save the schema with necessary data types.
Change Crawler settings.
Since you have specified that there is no change in your schema for the future runs, update/edit your crawler's output configuration options(optional) before the second run(after fixing the schema - changing to double).
Select Ignore the change and don't modify data catalog.
Run the crawler again. It doesn't show tables being updated or added but your data gets populated in the required format.

write a spark Dataset to json with all keys in the schema, including null columns

I am writing a dataset to json using:
ds.coalesce(1).write.format("json").option("nullValue",null).save("project/src/test/resources")
For records that have columns with null values, the json document does not write that key at all.
Is there a way to enforce null value keys to the json output?
This is needed since I use this json to read it onto another dataset (in a test case) and cannot enforce a schema if some documents do not have all the keys in the case class (I am reading it by putting the json file under resources folder and transforming to a dataset via RDD[String], as explained here: https://databaseline.bitbucket.io/a-quickie-on-reading-json-resource-files-in-apache-spark/)
I agree with #philantrovert.
ds.na.fill("")
.coalesce(1)
.write
.format("json")
.save("project/src/test/resources")
Since DataSets are immutable you are not altering the data in ds and you can process it (complete with null values and all) in any following code. You are simply replacing null values with an empty string in the saved file.
Since Pyspark 3, one can use the ignoreNullFields option when writing to a JSON file.
spark_dataframe.write.json(output_path,ignoreNullFields=False)
Pyspark docs: https://spark.apache.org/docs/3.1.1/api/python/_modules/pyspark/sql/readwriter.html#DataFrameWriter.json