We are trying to move our database from mysql to couchbase and implement some CDC (change data capture) logic for copying data to our new db.
All enviroments set up and running. mysql, debezium, kafka, couchbase, kubernetes, pipeline etc. And also we are set up our kafka-source connector for debezium. here it is:
- name: "our-connector"
config:
connector.class: "io.debezium.connector.mysql.MySqlConnector"
tasks.max: "1"
group.id: "our-connector"
database.server.name: "our-api"
database.hostname: "******"
database.user: "******"
database.password: "******"
database.port: "3306"
database.include.list: "our_db"
column.include.list: "our_db.our_table.our_field"
table.include.list: "our_db.our_table"
database.history.kafka.topic: "inf.our_table.our_db.schema-changes"
database.history.kafka.bootstrap.servers: "kafka-cluster-kafka-bootstrap.kafka:9092"
value.converter: "org.apache.kafka.connect.json.JsonConverter"
value.converter.schemas.enable: "false"
key.converter: "org.apache.kafka.connect.json.JsonConverter"
key.converter.schemas.enable: "false"
snapshot.locking.mode: "none"
tombstones.on.delete: "false"
event.deserialization.failure.handling.mode: "ignore"
database.history.skip.unparseable.ddl: "true"
include.schema.changes: "false"
snapshot.mode: "initial"
transforms: "extract,filter,unwrap"
predicates: "isOurTableChangeOurField"
predicates.isOurTableChangeOurField.type: "org.apache.kafka.connect.transforms.predicates.TopicNameMatches"
predicates.isOurTableChangeOurField.pattern: "our-api.our_db.our_table"
transforms.filter.type: "com.redhat.insights.kafka.connect.transforms.Filter"
transforms.filter.if: "!!record.value() && record.value().get('op') == 'u' && record.value().get('before').get('our_field') != record.value().get('after').get('our_field')"
transforms.filter.predicate: "isOurTableChangeOurField"
transforms.unwrap.type: "io.debezium.transforms.ExtractNewRecordState"
transforms.unwrap.drop.tombstones: "false"
transforms.unwrap.delete.handling.mode: "drop"
transforms.extract.type: "org.apache.kafka.connect.transforms.ExtractField{{.DOLLAR_SIGN}}Key"
transforms.extract.field: "id"
and this configuration publish this message to kafka. captured from kowl.
as you can see we have original records id and changed fields new value.
no problem so far. Actually we have problem :) Our field is DATETIME type in mysql, but debezium publish it as unixtime.
First question how can we publish this with formatted datetime (YYYY-mm-dd HH:ii:mm for example)?
lets move on.
here is the actual problem. we have searched a lot but all examples are recording whole data to couchbase. but we already created this record in couchbase, just want to data up to date. actually we manipulated data also.
here is example data from couchbase
we want to change only bill.dateAccepted field in couchbase. tried some yaml config but no success on sink.
here is are sink config
- name: "our-sink-connector-1"
config:
connector.class: "com.couchbase.connect.kafka.CouchbaseSinkConnector"
tasks.max: "2"
topics: "our-api.our_db.our_table"
couchbase.seed.nodes: "dev-couchbase-couchbase-cluster.couchbase.svc.cluster.local"
couchbase.bootstrap.timeout: "10s"
couchbase.bucket: "our_bucket"
couchbase.topic.to.collection: "our-api.our_db.our_table=our_bucket._default.ourCollection"
couchbase.username: "*******"
couchbase.password: "*******"
key.converter: "org.apache.kafka.connect.storage.StringConverter"
key.converter.schemas.enable: "false"
value.converter: "org.apache.kafka.connect.json.JsonConverter"
value.converter.schemas.enable: "false"
connection.bucket : "our_bucket"
connection.cluster_address: "couchbase://couchbase-srv.couchbase"
couchbase.document.id: "${/id}"
Partial answer to your first question. One approach would be that You can use an SPI converter to convert the unixdatetime to string. if you want to convert all the datetimes and your input message contains many datetime fields, you can just look at the JDBCType and do the conversion
https://debezium.io/documentation/reference/stable/development/converters.html
As for extracting I/U , you can write a custom SMT (Single message transform) that has before and after records and also has the operation type (I/U/D) and comparing before and after fields extract the delta. In the past when i tried something for this , I bumped upon the following which came in quite handy as a reference. This way you have a delta field and a key and that can just update instead of updating the full document (though the sink has to support it will come in at some point)
https://github.com/michelin/kafka-connect-transforms-qlik-replicate
The Couchbase source connector does not support watching individual fields. In general, the Couchbase source connector is better suited for replication than for change data capture. See the caveats mentioned in the Delivery Guarantees documentation.
The Couchbase Kafka sink connector supports partial document updates via the built-in SubDocumentSinkHandler or N1qlSinkHandler. You can select the sink handler by configuing the couchbase.sink.handler connector config property, and customize its behavior with the Sub Document Sink Handler config options.
Here's a config snippet that tells the connector to update the bill.dateAccepted property with the entire value of the Kafka record. (You'd also need to use a Single Message Transform to extract just this field from the source record.)
couchbase.sink.handler=com.couchbase.connect.kafka.handler.sink.SubDocumentSinkHandler
couchbase.subdocument.path=/bill/dateAccepted
If the built-in sink handlers are not flexible enough, you can write your own custom sink handler using the CustomSinkHandler.java example as a template.
Related
I am trying to evaluate InfluxDB as a real time, time series data visualization tool. I have an account with InfluDB and I have created a bucket for data storage. I now want to upload a csv file into the bucket via the click to upload feature but I keep getting errors associated with incorrect annotations. The last error I received was:
'Failed to upload the selected CSV: error in csv.from(): failed to read metadata: failed to read header row: wrong number of fields'
I have tried to decipher their docs and examples on how to annotate a csv file and have tried many different combinations of #datatype, #group and #default but nothing works.
This is the latest attempt that generated the error above.
#datatype,string,string,double,dateTime
#group,true,true,false,false
#default,,,,
_measurement,station,_value,_time
device,MBL,-0.814075542,1.65E+18
device,MBL,-0.837942395,1.65E+18
device,MBL,-0.862699339,1.65E+18
device,MBL,-0.891686336,1.65E+18
device,MBL,-0.891492408,1.65E+18
device,MBL,-0.933193098,1.65E+18
device,MBL,-0.933193098,1.65E+18
device,MBL,-0.976859072,1.65E+18
device,MBL,-0.981019863,1.65E+18
device,MBL,-1.011647128,1.65E+18
device,MBL,-1.017813258,1.65E+18
Any thoughts would be greatly appreciated. Thanks.
From the sample data above, I assume "device" is the name of a measurement and "MBL" is a tag whose name is station. Hence, there is 1 measurement and 1 tag, 1 field and a timestamp.
And you are mixing data types and line protocol elements when using annotated CSV. You could try following version:
#datatype,measurement,tag,double,dateTime
#default device,MBL,
thisIsYouMeasurementName,station,thisIsYourFieldKeyName,time
device,MBL,-0.814075542,1652669077000000000
device,MBL,-0.837942395,1652669077000000001
device,MBL,-0.862699339,1652669077000000002
device,MBL,-0.891686336,1652669077000000003
device,MBL,-0.891492408,1652669077000000004
device,MBL,-0.933193098,1652669077000000005
device,MBL,-0.933193098,1652669077000000006
device,MBL,-0.976859072,1652669077000000007
device,MBL,-0.981019863,1652669077000000008
device,MBL,-1.011647128,1652669077000000009
device,MBL,-1.017813258,1652669077000000010
Note that time column should avoid using "1.65E+18". See more details here.
I'm using the bulk loader to load data from csv files on S3 into a Neptune DB cluster.
The data is loaded successfully. However, when I reload the data with some of the nodes' property values modified, the new value is not replacing the old one, but rather being added to it ,making it a list of values separated by a comma. For example:
Initial values loaded:
~id,~label,ip:string,creationTime:date
2,user,"1.2.3.4",2019-02-13
If I reload this node with a different ip:
2,user,"5.6.7.8",2019-02-13
Then I run the following traversal: g.V(2).valueMap(), and getting: ip=[1.2.3.4, 5.6.7.8], creationTime=[2019-02-13]
While this behavior may be beneficial for some use-cases, it's mostly undesired. I want the new value to replace the old one.
I couldn't find any reference in the documentation to the loader behavior in case of reloading nodes, and there is no relevant parameter to configure in the API request.
How can I have reloaded nodes overwriting the existing ones?
Update: Neptune now supports single cardinality bulk-loading. Just set
updateSingleCardinalityProperties = TRUE
SOURCE: https://docs.aws.amazon.com/neptune/latest/userguide/load-api-reference-load.html
currently the Neptune bulk loader uses Set cardinality. To update an existing property the best way is to use Gremlin via the HTTP or WS endpoint.
From Gremlin you can specify that you want single cardinality (thus replacing rather than adding to the property value). An example would be
g.V('2').property(single,"ip","5.6.7.8")
Hope that helps,
Kelvin
In Airflow, we've created several DAGS. Some of which share common properties, for example the directory to read files from. Currently, these properties are listed as a property in each separate DAG, which will obviously become problematic in the future. Say if the directory name was to change, we'd have to go into each DAG and update this piece of code (possibly even missing one).
I was looking into creating some sort of a configuration file, which can be parsed into Airflow and used by the various DAGS when a certain property is required, but I cannot seem to find any sort of documentation or guide on how to do this. Most I could find was the documentation on setting up Connection ID's, but that does not meet my use case.
The question to my post, is it possible to do the above scenario and how?
Thanks in advance.
There are a few ways you can accomplish this based on your setup:
You can use a DagFactory type approach where you have a function generate DAGs. You can find an example of what that looks like here
You can store a JSON config as an Airflow Variable, and parse through that to generate a DAG. You can store something like this in a Admin -> Variables:
[
{
"table": "users",
"schema": "app_one",
"s3_bucket": "etl_bucket",
"s3_key": "app_one_users",
"redshift_conn_id": "postgres_default"
},
{
"table": "users",
"schema": "app_two",
"s3_bucket": "etl_bucket",
"s3_key": "app_two_users",
"redshift_conn_id": "postgres_default"
}
]
Your DAG could get generated as:
sync_config = json.loads(Variable.get("sync_config"))
with dag:
start = DummyOperator(task_id='begin_dag')
for table in sync_config:
d1 = RedshiftToS3Transfer(
task_id='{0}'.format(table['s3_key']),
table=table['table'],
schema=table['schema'],
s3_bucket=table['s3_bucket'],
s3_key=table['s3_key'],
redshift_conn_id=table['redshift_conn_id']
)
start >> d1
Similarly, you can just store that config as a local file and open it as you would any other file. Keep in mind the best answer to this will depend on your infrastructure and use case.
I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk insert. I do not find any option in official documentation to enable bulk inset mode.
Wondering, if it's possible to set bulk insert mode in dataflow pipeline? If yes, please let me know what I need to change in below code.
.apply(JdbcIO.<KV<Integer, String>>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
.withUsername("username")
.withPassword("password"))
.withStatement("insert into Person values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
public void setParameters(KV<Integer, String> element, PreparedStatement query) {
query.setInt(1, kv.getKey());
query.setString(2, kv.getValue());
}
})
EDIT 2018-01-27:
It turns out that this issue is related to the DirectRunner. If you run the same pipeline using the DataflowRunner, you should get batches that are actually up to 1,000 records. The DirectRunner always creates bundles of size 1 after a grouping operation.
Original answer:
I've run into the same problem when writing to cloud databases using Apache Beam's JdbcIO. The problem is that while JdbcIO does support writing up to 1,000 records in one batch, in I have never actually seen it write more than 1 row at a time (I have to admit: This was always using the DirectRunner in a development environment).
I have therefore added a feature to JdbcIO where you can control the size of the batches yourself by grouping your data together and writing each group as one batch. Below is an example of how to use this feature based on the original WordCount example of Apache Beam.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
// Count words in input file(s)
.apply(new CountWords())
// Format as text
.apply(MapElements.via(new FormatAsTextFn()))
// Make key-value pairs with the first letter as the key
.apply(ParDo.of(new FirstLetterAsKey()))
// Group the words by first letter
.apply(GroupByKey.<String, String> create())
// Get a PCollection of only the values, discarding the keys
.apply(ParDo.of(new GetValues()))
// Write the words to the database
.apply(JdbcIO.<String> writeIterable()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
.withStatement(INSERT_OR_UPDATE_SQL)
.withPreparedStatementSetter(new WordCountPreparedStatementSetter()));
The difference with the normal write-method of JdbcIO is the new method writeIterable() that takes a PCollection<Iterable<RowT>> as input instead of PCollection<RowT>. Each Iterable is written as one batch to the database.
The version of JdbcIO with this addition can be found here: https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java
The entire example project containing the example above can be found here: https://github.com/olavloite/spanner-beam-example
(There is also a pull request pending on Apache Beam to include this in the project)
I am working on migration of 3.0 code into new 4.2 framework. I am facing a few difficulties:
How to do CDR level deduplication in new 4.2 framework? (Note: Table deduplication is already done).
Where to implement PostDedupProcessor - context or chainsink custom? In either case, do I need to remove duplicate hashcodes from the list or just reject the tuples? Here I am also doing column updating for a few tuples.
My file is not moving into archive. The temporary output file is getting generated and that too empty and outside load directory. What could be the possible reasons? - I have thoroughly checked config parameters and after putting logs, it seems correct output is being sent from transformer custom, so I don't know where it is stuck. I had printed TableRowGenerator stream for logs(end of DataProcessor).
1. and 2.:
You need to select the type of deduplication. It is not a big difference if you choose "table-" or "cdr-level-deduplication".
The ite.businessLogic.transformation.outputType does affect this. There is one Dedup only. You can not have both.
Select recordStream for "cdr-level-deduplication", do the transformation to table row format (e.g. if you like to use the TableFileWriter) in xxx.chainsink.custom::PostContextDataProcessor.
In xxx.chainsink.custom::PostContextDataProcessor you need to add custom code for duplicate-handling: reject (discard) tuples or set special column values or write them to different target tables.
3.:
Possibly reasons could be:
Missing forwarding of window punctuations or statistic tuple
error in BloomFilter configuration, you would see it easily because PE is down and error log gives hints about wrong sha2 functions be used
To troubleshoot your ITE application, I recommend to enable the following debug sinks if checking the StreamsStudio live graph is not sufficient:
ite.businessLogic.transformation.debug=on
ite.businessLogic.group.debug=on
ite.businessLogic.sink.debug=on
Run a test with a single input file only and check the flow of your record and statistic tuples. "Debug sinks" write punctuations markers also to debug files.