I am trying to ingest the data of my CSV file into MySQL Db. My CSV file have field called 'MeasurementTime' value as 2018-06-27 11:14.50. My flow is taking that field as string and thus PutSQL is giving error. I am using the same template as per this Template but not using the InferAvro processor as i already have a pre-defined schema. This is the website Website link
How can I pass a Datetime field into my MySQL db as correct datatype and not as string. What setting should I change?
Thank you
With PutDatabaseRecord you can avoid all this chain of transformations and overengineering. The flow would be like:
GetFile -> PutDatabaseRecord
You need to configure PutDatabaseRecord with RecordReader property configured to CSVReader and configure CSVReader and set its Schema Registry to AvroSchemaRegistry and provide the valid schema. you can find the template for a sample flow here.
Related
I am exporting F&O D365 data to ADLS in CSV format. Now, I am trying to read the CSV stored in ADLS and copy into Azure Synapse dedicated SQL pool table using Azure data factory. However, I can create the pipeline and it's working for few tables without any issue. But it's failing for one table (salesline) because of mismatch in number of column.
Below is the CSV format sample, there is no column name(header) in CSV because it's exported from F&O system and column name stored in salesline.CDM.json file.
5653064010,,,"2022-06-03T20:07:38.7122447Z",5653064010,"B775-92"
5653064011,,,"2022-06-03T20:07:38.7122447Z",5653064011,"Small Parcel"
5653064012,,,"2022-06-03T20:07:38.7122447Z",5653064012,"somedata"
5653064013,,,"2022-06-03T20:07:38.7122447Z",5653064013,"someotherdata",,,,test1, test2
5653064014,,,"2022-06-03T20:07:38.7122447Z",5653064014,"parcel"
5653064016,,,"2022-06-03T20:07:38.7122447Z",5653064016,"B775-92",,,,,,test3
I have created ADF pipeline using copy data activity to copy the data from ADLS(CSV) to Synapse SQL table however I am getting below error.
Operation on target Copy_hs1 failed: ErrorCode=DelimitedTextMoreColumnsThanDefined,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Error found when processing 'Csv/Tsv Format Text' source 'SALESLINE_00001.csv' with row number 4: found more columns than expected column count 6.,Source=Microsoft.DataTransfer.Common,'
Column mapping looks like below- Because CSV first row has 6 column so it's appearing 6 only while importing schema.
I have repro’d with your sample data and got the same error while copying the file using the copy data activity.
Alternatively, I have tried to copy the file using data flow and was able to load the data without any errors.
Source file:
Data flow:
Source dataset: only the first 6 columns are read as the first row contains only 6 columns in the file.
Source transformation: connect source dataset in source transformation.
Source preview:
Sink transformation: Connect sink to synapse dataset.
Settings:
Mappings:
Sink output:
After running the data flow, data is loaded to the sink synapse table.
Change my csv to xlsx help me to solve this problem in Copy Activity ADF.
1.From Copy data settings set "Fault Tolerance" = "Skip Incompatible rows"
skip incompatible rows
2.From Dataset connection settings set Escape character to Double quotes
Escape character
I have a table in Cassandra DB and one of the column has value in JSON format. I am using Datastax DevCenter for querying the DB and when I try to export the result to CSV, JSON value gets broken to separate column wherever there is coma(,). I even tried to export from command prompt without giving and delimiter, that too resulted in broken JSON value.
Is there anyway to achieve this task?
Use the COPY command to export the table as a whole with a different delimiter.
For example :
COPY keyspace.your_table (your_id,your_col) TO 'your_table.csv' WITH DELIMETER='|' ;
Then filter on this data programmatically in whatever way you want.
I'm able to get Apache NiFi to generate a schema via the CSVReader, and then I can write the schema out to an attribute using ConvertRecord. However I then need to add fields using UpdateRecord, but the fields are not being added to the flow file or the to the schema attribute. I believe this is because the fields are not part of the initially inferred schema. I can't create the schema in the registry because it's being inferred from the file. So how can I add fields to a record when the schema doesn't include the fields?
Are you using InferAvroSchema to not have to worry about generating the schema(s), or because you really will not know the schema of the CSV files? If the former, then send one CSV through, then copy the inferred schema into a CSVReader, and add the fields from UpdateRecord into the write schema.
I've written up NIFI-5524 to cover the automation of adding/updating fields in the outgoing schema based on UpdateRecord properties.
Yes, that is because your writer controller service doesn't have the new fields defined in it.
If you are adding new fields then we need to define new avro schema with the additional fields included in the schema writer controller service.
Change the Schema Access Strategy to either
Use 'Schema Name' Property (or) Use 'Schema Text' Property
then Define your new schema including new fields in it so that Update Record processor will add the new fields to the output flowfile.
Please look into this article, as i have added ts_tz,current_ts..etc fields in it that doesn't exist in the input data and defined the writer controller service with the new avro schema that includes all the new/old fields in it.
I achieved the same by adding columns to the CSV using replace text processor (this will add same values for header and values in the csv), use replacement mode "Line-By-Line" and then use update record to update values only of the new columns to something meaningful.
No need to know the schema beforehand using this approach.
I have json files, volume is approx 500 TB. I have loaded complete set into hive data warehouse.
How would I validate or test the data that was loaded into hive warehouse. What should be my testing strategy ?
Client want us to validate the json data. Whether the data loaded into hive is correct ot not. Is there any miss? If yes, which field it was?
Please help.
How is your data being stored in hive tables ?
One option is create a Hive UDF function that receive the JSON string and validate the data and return another string with the error message or an empty string if the JSON string is well formed.
Here is a Hve UDF tutorial: http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html
With the Hive UDF function in place you can executequeries like:
select strjson, validateJson(strjson) from jsonTable where validateJson(strjson) != "";
Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.
Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?
Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.
First, you would infer the schema of your JSON:
kite-dataset json-schema sample-file.json -o schema.avsc
Then you can use that file to create a Parquet Hive table:
kite-dataset create mytable --schema schema.avsc --format parquet
And finally, you can load your JSON into the dataset.
kite-dataset json-import sample-file.json mytable
You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.
You can actually use Drill itself to create a parquet file from the output of any query.
create table student_parquet as select * from `student.json`;
The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.
To complete the answer of #rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.
create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t
I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.
But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:
If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to
select members.id as members_id, members.name as members_name
or
select members.id as `members.id`, members.name as `members.name`
to get it to work.
I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.
The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.