Kafka-Connect JDBC Connector tinyint to boolean mapping - mysql

I have a Kafka-Connect job configured to query a MySQL table periodically and place messages on a queue. The structure of these messages are defined using an Avro schema. I am having an issue with the mapping for one of my columns.
The column is defined as a tinyint(1) in my MySQL schema, and I am trying to map this to a boolean field in my avro object.
{
"name": "is_active",
"type": "boolean"
}
The kafka-connect jobs runs, and messages are placed on the queue, but when my application who reads from the queue attempts to deserialize the messages I get the following error:
org.apache.avro.AvroTypeException: Found int, expecting boolean
I was hoping that a 1 or 0 value could be automatically mapped to a boolean, but that does not seem to be the case.
I have also tried to configure my job to use a 'Cast' transform, but that just seems to caused issues with the other fields in the message.
"transforms": "Cast",
"transforms.Cast.type": "org.apache.kafka.connect.transforms.Cast$Value",
"transforms.Cast.spec": "is_active:boolean"
Is what I am attempting possible, or will I have to change my application to work with the int value?
Here is my full configuration ( I have stripped out some other irrelevant fields )
Kafka Connect job config
{
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
"mode": "bulk",
"topic.prefix": "my_topic-name",
"transforms.SetSchemaMetadata.type": "org.apache.kafka.connect.transforms.SetSchemaMetadata$Value",
"query": "select is_active from my_table",
"poll.interval.ms": "30000",
"transforms": "SetSchemaMetadata",
"name": "job_name",
"connection.url": "connectiondetailshere",
"transforms.SetSchemaMetadata.schema.name": "com.my.model.name"
}
AVRO Schema
{
"type": "record",
"name": "name",
"namespace": "com.my.model",
"fields": [
{
"name": "is_active",
"type": "long"
}
],
"connect.name": "com.my.model.name"
}

You can do this either with a custom Transform (this is a perfect use case for it), or write a simple streaming application to do it, for example in KSQL:
CREATE STREAM my_topic AS
SELECT COL1, COL2, …
CASE WHEN is_active=1 THEN TRUE ELSE FALSE END AS is_active_bln
FROM my_source_connect_topic;
ksql> describe my_topic;
Name : my_topic
Field | Type
-----------------------------------------
ROWTIME | BIGINT (system)
ROWKEY | VARCHAR(STRING) (system)
COL1 | INTEGER
COL1 | VARCHAR
IS_ACTIVE_BLN | BOOLEAN
----------------------------------------

Related

Failing to generate a filter on debezium connector with error: "op is not a valid field name"

I have created a debezium connector to a docker MySQL container.
I tried to set a filter for messages:
{
"name": "my_connector",
"config": {
"name": "my_connector",
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
...
"include.schema.changes": "true",
"transforms": "filter, unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "true",
"transforms.filter.type": "io.debezium.transforms.Filter",
"transforms.filter.language": "jsr223.groovy",
"transforms.filter.condition": "value.source.table == 'table-name' && (value.op == 'd' || value.op == 'c' || (value.op == 'u' && value.after.status != value.before.status))"
}
}
In http://localhost:8070/connectors/my_connector/status I see this:
{
"connector":
{
"state": "RUNNING",
"worker_id": "172.21.0.13:8083"
},
"name": "my_connector",
"tasks":
[
{
"id": 0,
"state": "FAILED",
"trace": "org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded
in error handler\n\tat
org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:178)\n\tat
org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execute(RetryWithToleranceOperator.java:104)\n\tat
org.apache.kafka.connect.runtime.TransformationChain.apply(TransformationChain.java:50)\n\tat
org.apache.kafka.connect.runtime.WorkerSourceTask.sendRecords(WorkerSourceTask.java:320)\n\tat
org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:245)\n\tat
org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)\n\tat
org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)\n\tat
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat
java.base/java.lang.Thread.run(Thread.java:834)\nCaused by:
io.debezium.DebeziumException: Error while evaluating expression
'value.source.table == 'subscription_contract' && (value.op == 'd' ||
value.op == 'c' || (value.op == 'u' && value.after.status !=
value.before.status))' for record
'SourceRecord{sourcePartition={server=subscription_contracts_db },
sourceOffset={file=binlog.000006, pos=19704, snapshot=true}}
ConnectRecord{topic='subscription_contracts_db', kafkaPartition=0,
key=Struct{databaseName=subscription-contracts},
keySchema=Schema{io.debezium.connector.mysql.SchemaChangeKey:STRUCT},
value=Struct{source=Struct{version=1.2.0.Final,connector=mysql,name=subscription_contracts_db,ts_ms=0,snapshot=true,db=subscription-contracts,table=subscription_contract,server_id=0,file=binlog.000006,pos=19704,row=0},databaseName=subscription-contracts,ddl=DROP
TABLE IF EXISTS subscription-contracts.subscription_contract},
valueSchema=Schema{io.debezium.connector.mysql.SchemaChangeValue:STRUCT},
timestamp=null, headers=ConnectHeaders(headers=)}'\n\tat
io.debezium.transforms.scripting.Jsr223Engine.eval(Jsr223Engine.java:116)\n\tat
io.debezium.transforms.Filter.doApply(Filter.java:33)\n\tat
io.debezium.transforms.ScriptingTransformation.apply(ScriptingTransformation.java:189)\n\tat
org.apache.kafka.connect.runtime.TransformationChain.lambda$apply$0(TransformationChain.java:50)\n\tat
org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndRetry(RetryWithToleranceOperator.java:128)\n\tat
org.apache.kafka.connect.runtime.errors.RetryWithToleranceOperator.execAndHandleError(RetryWithToleranceOperator.java:162)\n\t...
11 more\nCaused by: javax.script.ScriptException:
org.apache.kafka.connect.errors.DataException: op is not a valid field
name\n\tat
org.codehaus.groovy.jsr223.GroovyScriptEngineImpl.eval(GroovyScriptEngineImpl.java:320)\n\tat org.codehaus.groovy.jsr223.GroovyCompiledScript.eval(GroovyCompiledScript.java:71)\n\tat
java.scripting/javax.script.CompiledScript.eval(CompiledScript.java:89)\n\tat
io.debezium.transforms.scripting.Jsr223Engine.eval(Jsr223Engine.java:107)\n\t...
16 more\nCaused by: org.apache.kafka.connect.errors.DataException: op
is not a valid field name\n\tat
org.apache.kafka.connect.data.Struct.lookupField(Struct.java:254)\n\tat
org.apache.kafka.connect.data.Struct.get(Struct.java:74)\n\tat
jdk.internal.reflect.GeneratedMethodAccessor1.invoke(Unknown
Source)\n\tat
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat
org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)\n\tat
groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)\n\tat
org.codehaus.groovy.runtime.metaclass.MethodMetaProperty$GetMethodMetaProperty.getProperty(MethodMetaProperty.java:62)\n\tat
org.codehaus.groovy.runtime.callsite.GetEffectivePojoPropertySite.getProperty(GetEffectivePojoPropertySite.java:63)\n\tat
org.codehaus.groovy.runtime.callsite.AbstractCallSite.callGetProperty(AbstractCallSite.java:329)\n\tat
Script9.run(Script9.groovy:1)\n\tat
org.codehaus.groovy.jsr223.GroovyScriptEngineImpl.eval(GroovyScriptEngineImpl.java:317)\n\t... 19 more\n",
"worker_id": "172.21.0.13:8083"
}
],
"type": "source" }
As OneCricketeer pointed out, the basic issue here is:
Caused by: javax.script.ScriptException: org.apache.kafka.connect.errors.DataException: op is not a valid field name\n\tat
But I am not sure what is wrong with using it, since it seems like it
is supposed to be a valid field - here.
After some investigation, I've seemed to find the answer, hope it'd help someone else;
In my connector configurations I had this configuration:
"include.schema.changes": "true"
Which caused my connector to include also logs about schema changes in the DB table.
I have a docker container of a migrator initiating the DB container by running some flyway migrations, one of them is the DROP TABLE in my exception above.
Since schema change message has no reason to contain an op field it just doesn't (as shown in the example here).
When the filter tries to fetch the field it doesn't find it and throws an exception.
Changing the configuration to false solved the issue.

Analysing and formatting JSON using PostgreSQL

I have a table called api_details where i dump the below JSON value into the JSON column raw_data.
Now i need to make a report from this JSON string and the expected output is something like below,
action_name. sent_timestamp Sent. Delivered
campaign_2475 1600416865.928737 - 1601788183.440805. 7504. 7483
campaign_d_1084_SUN15_ex 1604220248.153903 - 1604222469.087918. 63095. 62961
Below is the sample JSON OUTPUT
{
"header": [
"#0 action_name",
"#1 sent_timestamp",
"#0 Sent",
"#1 Delivered"
],
"name": "campaign - lifetime",
"rows": [
[
"campaign_2475",
"1600416865.928737 - 1601788183.440805",
7504,
7483
],
[
"campaign_d_1084_SUN15_ex",
"1604220248.153903 - 1604222469.087918",
63095,
62961
],
[
"campaign_SUN15",
"1604222469.148829 - 1604411016.029794",
63303,
63211
]
],
"success": true
}
I tried like below, but is not getting the results.I can do it using python by lopping through all the elements in row list.
But is there an easy solution in PostgreSQL(version 11).
SELECT raw_data->'rows'->0
FROM api_details
You can use JSONB_ARRAY_ELEMENTS() function such as
SELECT (j.value)->>0 AS action_name,
(j.value)->>1 AS sent_timestamp,
(j.value)->>2 AS Sent,
(j.value)->>3 AS Delivered
FROM api_details
CROSS JOIN JSONB_ARRAY_ELEMENTS(raw_data->'rows') AS j
Demo
P.S. in this case the data type of raw_data is assumed to be JSONB, otherwise the argument within the function raw_data->'rows' should be replaced with raw_data::JSONB->'rows' in order to perform explicit type casting.

Convert an array of strings to a dictionary with JQ?

I have trying to convert the AWS public IP ranges into a format that can be used with the Terraform external data provider so I can create a security group rule based off the AWS public CIDRs. The provider requires a single JSON object with this format:
{"string": "string"}
Here is a snippet of the public ranges JSON document:
{
"syncToken": "1589917992",
"createDate": "2020-05-19-19-53-12",
"prefixes": [
{
"ip_prefix": "35.180.0.0/16",
"region": "eu-west-3",
"service": "AMAZON",
"network_border_group": "eu-west-3"
},
{
"ip_prefix": "52.94.76.0/22",
"region": "us-west-2",
"service": "AMAZON",
"network_border_group": "us-west-2"
},
// ...
]
I can successfully extract the ranges I care about with this, [.prefixes[] | select(.region == "us-west-2") | .ip_prefix] | sort | unique, and it gives me this:
[
"100.20.0.0/14",
"108.166.224.0/21",
"108.166.240.0/21",
"13.248.112.0/24",
...
]
I can't figure out how to convert this to an arbitrarily-keyed object with jq. In order to properly use the array object, I need to convert it to a dictionary, something like {"arbitrary-key": "100.20.0.0/14"}, so that I can use it in Terraform like this:
data "external" "amazon-ranges" {
program = [
"cat",
"${path.cwd}/aws-ranges.json"
]
}
resource "aws_default_security_group" "allow-mysql" {
vpc_id = aws_vpc.main.id
ingress {
description = "MySQL"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = [
values(data.external.amazon-ranges.result)
]
}
}
What is the most effective way to extract the the AWS public IP ranges document into a single object with arbitrary keys?
The following script uses the .ip_prefix as the key, thus perhaps avoiding the need for the sort|unique. It yields:
{
"35.180.0.0/16": "35.180.0.0/16",
"52.94.76.0/22": "52.94.76.0/22"
}
Script
#!/bin/bash
function data {
cat <<EOF
{
"syncToken": "1589917992",
"createDate": "2020-05-19-19-53-12",
"prefixes": [
{
"ip_prefix": "35.180.0.0/16",
"region": "eu-west-3",
"service": "AMAZON",
"network_border_group": "eu-west-3"
},
{
"ip_prefix": "52.94.76.0/22",
"region": "us-west-2",
"service": "AMAZON",
"network_border_group": "us-west-2"
}
]
}
EOF
}
data | jq '
.prefixes
| map(select(.region | test("west"))
| {(.ip_prefix): .ip_prefix} )
| add '
There's a better option to get at the AWS IP ranges data in Terraform, which is to use the aws_ip_ranges data source, instead of trying to mangle things with the external data source and jq.
The example in the above linked documentation shows a similar, but also slightly more complex, thing to what you're trying to do here:
data "aws_ip_ranges" "european_ec2" {
regions = ["eu-west-1", "eu-central-1"]
services = ["ec2"]
}
resource "aws_security_group" "from_europe" {
name = "from_europe"
ingress {
from_port = "443"
to_port = "443"
protocol = "tcp"
cidr_blocks = data.aws_ip_ranges.european_ec2.cidr_blocks
ipv6_cidr_blocks = data.aws_ip_ranges.european_ec2.ipv6_cidr_blocks
}
tags = {
CreateDate = data.aws_ip_ranges.european_ec2.create_date
SyncToken = data.aws_ip_ranges.european_ec2.sync_token
}
}
To do your exact thing you would do something like this:
data "aws_ip_ranges" "us_west_2_amazon" {
regions = ["us_west_2"]
services = ["amazon"]
}
resource "aws_default_security_group" "allow-mysql" {
vpc_id = aws_vpc.main.id
ingress {
description = "MySQL"
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = data.aws_ip_ranges.us_west_2_amazon.cidr_blocks
}
}
However, there are 2 things that are bad here.
The first, and most important, is that you're allowing access to your database from every IP address that AWS has in US-West-2 across all services. That means that anyone in the world is able to spin up an EC2 instance or Lambda function in US-West-2 and then have network access to your database. This is a very bad idea.
The second is that if that returns more than 60 CIDR blocks you are going to end up with more than 60 rules in your security group. AWS security groups have a limit of 60 security group rules per IP address type (IPv4 vs IPv6) and per ingress/egress:
You can have 60 inbound and 60 outbound rules per security group (making a total of 120 rules). This quota is enforced separately for IPv4 rules and IPv6 rules; for example, a security group can have 60 inbound rules for IPv4 traffic and 60 inbound rules for IPv6 traffic. A rule that references a security group or prefix list ID counts as one rule for IPv4 and one rule for IPv6.
From https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html#vpc-limits-security-groups
This is technically a soft cap and you can ask AWS to raise this limit in exchange for reducing the amount of security groups that can be applied to a network interface to keep the maximum amount of security group rules at or below 1000 per network interface. It's probably not something you want to mess around with though.

PostgreSQL jsonb string format

I'm using PostgreSQL jsonb and have the following in my database record:
{"tags": "[\"apple\",\" orange\",\" pineapple\",\" fruits\"]",
"filename": "testname.jpg", "title_en": "d1", "title_ja": "1",
"description_en": "d1", "description_ja": "1"}
and both SELECT statements below retrived no results:
SELECT "photo"."id", "photo"."datadoc", "photo"."created_timestamp","photo"."modified_timestamp"
FROM "photo"
WHERE datadoc #> '{"tags":> ["apple"]}';
SELECT "photo"."id", "photo"."datadoc", "photo"."created_timestamp", "photo"."modified_timestamp"
FROM "photo"
WHERE datadoc -> 'tags' ? 'apple';
I wonder it is because of the extra backslash added to the json array string, or the SELECT statement is incorrect.
I'm running "PostgreSQL 10.1, compiled by Visual C++ build 1800, 64-bit" on Windows 10.
PostgreSQL doc is here.
As far as any JSON parser is concerned, the value of your tags key is a string, not an array.
"tags": "[\"apple\",\" orange\",\" pineapple\",\" fruits\"]"
The string itself happens to be another JSON document, like the common case in XML where the contents of a string happen to be an XML or HTML document.
["apple"," orange"," pineapple"," fruits"]
What you need to do is extract that string, then parse it as a new JSON object, and then query that new object.
I can't test it right now, but I think that would look something like this:
(datadoc ->> 'tags') ::jsonb ? 'apple'
That is, "extract the tags value as text, cast that text value as jsonb, then query that new jsonb value.
Hey there i know this is very late answer, but here is the good approach, with data i have.
initital data in db:
"{\"data\":{\"title\":\"test\",\"message\":\"string\",\"image\":\"string\"},\"registration_ids\":[\"s
tring\"],\"isAllUsersNotification\":false}"
to convert it to json
select (notificationData #>> '{}')::jsonb from sent_notification
result:
{"data": {"image": "string", "title": "string", "message": "string"}, "registration_ids": ["string"], "isAllUsersNotification": false}
getting a data object from json
select (notificationData #>> '{}' )::jsonb -> 'data' from sent_notification;
result:
{"image": "string", "title": "string", "message": "string"}
getting a field from above result:
select (notificationData #>> '{}' )::jsonb -> 'data' ->>'title' from sent_notification;
result:
string
performing where operations,
Q: get records where title ='string'
ans:
select * from sent_notification where (notificationData #>> '{}' )::jsonb -> 'data' ->>'title' ='string'

Using scala play-json 2.4.x, how do I extract the name of a json object into a different object?

Given this json:
{
"credentials": {
"b79a2ba2-lolo-lolo-lolo-lololololol": {
"description": "meaningful description",
"displayName": "git (meaningful description)",
"fullName": "credential-store/_/b79a2ba2-lolo-lolo-lolo-lololololol",
"typeName": "SSH Username with private key"
}
},
"description": "Credentials that should be available irrespective of domain specification to requirements matching.",
"displayName": "Global credentials (unrestricted)",
"fullDisplayName": "Credentials » Global credentials (unrestricted)",
"fullName": "credential-store/_",
"global": true,
"urlName": "_"
}
and this scala destination class:
case class JenkinsCredentials(uuid: String, description: String)
How can I create a Reads[JenkinsCredentials] to extract that first object name uuid b79a2ba2-lolo-lolo-lolo-lololololol along with the description?
Following the documentation it'd be something along the lines of this:
implicit val credsReader: Reads[JenkinsCredentials] = (
(JsPath).read[String] and
(JsPath \ "description").read[String]
)(JenkinsCredentials.apply _)
Used with (Json.parse(content) \\ "credentials").validate[Seq[JenkinsCredentials]
But the documentation doesn't discuss anything about extracting the names of the objects as a field used somewhere else...
EDIT: clarifying
My end state would be a Seq of JenkinsCredentials that are parsed from a json Object, not an array. Because of how the JSON is structured. I'd pull out the "credentials": "UUID" and "credentials":"UUID":"description" into a single object, for each potential credentials entry under "credentials"
I figured this out not using the Reads[T] methods:
//Get the key names inside of the credentials
(json \ "credentials").as[JsObject].fields.map{ case (credId, value) =>
JenkinsCredentials(credId, (value \ "description").validate[String].get)
}
This works, but doesn't validate bits, and doesn't use a transformer.