I have a topic with string containing json. For example a message could be:
'{"id":"foo", "datetime":1}'
In this topic everything is considered as string.
I would like to send messages in postgresql table with kafka-connect. My goal is to let postgresql to understand that messages are json. Indeed, postgresql handles pretty well json.
How to tell kafka-connect or postgresql that messages are in fact json ?
Thanks
EDIT:
For now, I use ./bin/connect-standalone config/connect-standalone.properties config/sink-sql-rules.properties.
With:
connect-standalone.properties
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
rest.port=8084
plugin.path=share/java
sink-sql-rules.properties
name=mysink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
# The topics to consume from - required for sink connectors like this one
topics=mytopic
# Configuration specific to the JDBC sink connector.
connection.url=***
connection.user=***
connection.password=***
mode=timestamp+incremeting
auto.create=true
auto.evolve=true
table.name.format=mytable
batch.size=500
EDIT2:
With those conf I get this error:
org.apache.kafka.connect.errors.ConnectException: No fields found using key and value schemas for table
Related
Here is my earlier question :- importing-json-data-into-postgres-using-kafka-jdbc-sink-connector
I was able to load json data when I produce data in schema and payload format. But, for me it is not possible to have schema assigned to every record. So, I started to look for other solution and found Schema Inferencing for JsonConverter. As per the documentation I disabled the value.converter.schemas.enable and enabled value.converter.schemas.infer.enable but still ,I'm facing the same error
i.e.,
Caused by: org.apache.kafka.connect.errors.ConnectException: Sink connector 'load_test' is configured with 'delete.enabled=false' and 'pk.mode=none' and therefore requires records with a non-null Struct value and non-null Struct schema, but found record at (topic='dup_emp',partition=0,offset=0,timestamp=1633066307312) with a HashMap value and null value schema.
my configuartion:-
curl -X PUT http://localhost:8083/connectors/load_test/config \
-H "Content-Type: application/json" \
-d '{
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url":"jdbc:postgresql://localhost:5432/somedb",
"connection.user":"user",
"connection.password":"passwd",
"value.converter":"org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable":"false",
"value.converter.schemas.infer.enable": "true",
"tasks.max" : "1",
"topics":"dup_emp",
"table.name.format":"dup_emp",
"insert.mode":"insert",
"quote.sql.identifiers":"never"
}'
I have gone through the sink_config_options Here as per my understanding, I need to generate record with key where key contains a struct of primary key fields and need to set pk_mode:record_key and delete.enabled:true
Correct me if I understood wrongly. If my understanding is correct, how do we produce records with key of type struct (which contains all primary keys) and Finally, how do I make it successfully populate data in postgres from topic.
it is not possible to have schema assigned to every record
Then it's not possible to use this connector, as it requires a schema in order to know what fields and types exist.
The KIP you linked to is "under discussion", with an unassigned, open JIRA, not implemented.
The alternative is to not use JSON, but a structured binary format such as the ones Confluent provides (Avro or Protobuf). You can use KSQL before consuming in Connect to do this translation (requires running Confluent Schema Registry)
Otherwise, you need to write your own Converter (or Transform) and add it to the Connect classpath such that it returns a Struct
I trying to set up a full ELK stack for managing logs from our Kubernetes clusters. Our applications are either logging plain text logs or JSON objects. I want to be able to handle searching in the text logs, and also be able to index and search the fields in the JSON.
I have filebeats running on each Kubernetes node, picking up the docker logs, enriching them with various kubernetes fields, and a few fields we use internally. The complete filebeat.yml is:
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
fields:
kubernetes.cluster: <name of the cluster>
environment: <environment of the cluster>
datacenter: <datacenter the cluster is running in>
fields_under_root: true
output.logstash:
hosts: ["logstash-logstash-headless:5044"]
The filebeat is shipping the resulting logs to a central Logstash I have installed. In the logstash I attempt to parse the log message field into a new field called message_parsed. The complete pipeline looks like this:
input {
beats {
port => 5044
type => "beats"
tags => ["beats"]
}
}
filter {
json {
source => "message"
target => "message_parsed"
skip_on_invalid_json => true
}
}
output {
elasticsearch {
hosts => [
"elasticsearch-logging-ingest-headless:9200"
]
}
}
I then have an Elasticsearch cluster installed which received the logs. I have separate Data, Ingest and Master nodes. Apart from some CPU and memory configuration the cluster is completely default settings.
The trouble I'm having is that I do not control the contents of the JSON messages. They could have any field with any type, and we have many cases where the same field exists but the fields values are of differing types. One simple example is the field level, which is usually a string carrying the values "debug", "info", "warn" or "error", but we also run some software that outputs this level as a numeric value. Other cases include error fields sometimes being objects and other times being strings, and date fields sometimes being unix timestamps and sometimes being human readable dates.
This of course makes Elasticsearch complain with a mapper_parsing_exception. Here's an example of one such error:
[2021-04-07T15:57:31,200][WARN ][logstash.outputs.elasticsearch][main][19f6c57d0cbe928f269b66714ce77f539d021549b68dc20d8d3668bafe0acd21] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x1211193c>], :response=>{"index"=>{"_index"=>"logstash-2021.04.06-000014", "_type"=>"_doc", "_id"=>"L80NrXgBRfSv8axlknaU", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [message_parsed.error] tried to parse field [error] as object, but found a concrete value"}}}}
Is there any way I can make Elasticsearch handle that case?
Software Configuration:
Hadoop distribution:Amazon 2.8.3
Applications:Hive 2.3.2, Pig 0.17.0, Hue 4.1.0, Spark 2.3.0
Tried to read with multiple json schema,
val df = spark.read.option("mergeSchema",
"true").json("s3a://s3bucket/2018/01/01/*")
Throws an error,
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:206)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:340)
How to read json with multipl schema's with spark?
This sometimes happens when you are pointing to wrong path (when data does not exist).
I use a simple file source reader
connector.class=org.apache.kafka.connect.file.FileStreamSourceConnector
tasks.max=1
File content is a simple JSON object in each line. I found that there is a way to replace a record key and use transformations to do this, like
# Add the `id` field as the key using Simple Message Transformations
transforms=InsertKey
# `ValueToKey`: push an object of one of the column fields (`id`) into the key
transforms.InsertKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.InsertKey.fields=ip
But I got an error
Only Struct objects supported for [copying fields from value to key],
found: java.lang.String
Is there a way to parse string json and get a key from there like I can do with Flume and regex_extractor?
When using Transformation on SourceConnector, the transformation is done on the List<SourceRecord> that is returned by the SourceConnector.poll()
In your case, the FileStreamSourceConnector reads the lines of the file and puts each line as a String in the SourceRecord object. Therefore, when the transformation gets the SourceRecord, it only sees it as a String and does not know the structure of the object.
To solve this problem,
Either you modify the FileStreamSourceConnector code so that it returns the SourceRecord with a valid Struct and Schema of your input json String. You can use the Kafka's SchemaBuilder class for this.
Or in case you're consuming this data in sink connector, you can have KafkaConnect convert it to JSON by setting following config on the sink connector and then do the transformations on the sink connector.
"value.converter":"org.apache.kafka.connect.json.JsonConverter"
"value.converter.schemas.enable": "false"
If you go with the 2nd option, don't forget to put these configs on your SourceConnector.
"value.convertor":"org.apache.kafka.connect.storage.StringConverter"
"value.converter.schemas.enable": "false"
there is a way to replace a record key
There is a separate transform called org.apache.kafka.connect.transforms.ReplaceField$Key
InsertKey will take a value and attempt to insert into a Struct/Map, but you seem to be using a String key
Newbee to AWS.
I want to use COPY command to import table from dynamoDB to Redshift. But I occurred the error message such as "Invalid operation: Unsupported Data Type: Current Version only supports Strings and Numbers". Or I can only have values in some columns and others(more important one, such as sensor value in paylaod) are null.
In dynamoDB, hashkey and rangekey are String, but the payload is JSON format, how I can COPY this payload to Redshift?
the documentation in AWS didn't provide a detail solution.
COPY command can be used to copy data from DynamoDB table which has scalar data types (i.e. STRING and NUMBER).
If you have any attributes in DynamoDB table which has different data types (i.e. Map, List, Set etc.), the COPY command would fail (i.e. it is not supported at the moment).
Only Amazon DynamoDB attributes with scalar STRING and NUMBER data
types are supported. The Amazon DynamoDB BINARY and SET data types are
not supported. If a COPY command tries to load an attribute with an
unsupported data type, the command will fail. If the attribute does
not match an Amazon Redshift table column, COPY does not attempt to
load it, and it does not raise an error.