Elasticsearch-kafka-river plugin issue - json

I am trying to pass data from Kafka to Elasticsearch and then to Kibana. I am using kafka-river plugin as mentioned in this link:Elasticsearch-river-kafka plugin
After starting Kafka Zookeeper, server and producer, I am giving data as {"test":"one"}
Then start elasticsearch. I am getting the following error in Kafka:
[2016-02-04 00:05:00,094] ERROR Closing socket for /192.168.1.9 because of error (kafka.network.Processor)
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:375)
at kafka.utils.Utils$.read(Utils.scala:380)
at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Processor.read(SocketServer.scala:444)
at kafka.network.Processor.run(SocketServer.scala:340)
at java.lang.Thread.run(Thread.java:745)
And, in elasticsearch the following error:
org.codehaus.jackson.JsonParseException: Unexpected character ('S' (code 83)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
Also, I see this in elasticsearch logs:
[2016-02-04 00:14:31,340][WARN ][river.routing ] [ISAAC] no river _meta document found after 5 attempts
Any idea what I am doing wrong? Please help. Thanks.

The concept of rivers is deprecated in elastic search, it adds performance issues. Why dont you look at using the LogStash Kafka plugin for same thing. You can find out more about it at https://www.elastic.co/blog/logstash-kafka-intro

Related

value schema must be a struct

I am sending a nested json data to Kafka consumer for PostgreSQL sink.I am building sink connector and unfortunately I cant change data at source. I want to send the data as it is without any conversions using kafka.
kafka connect is showing this error:
[2023-01-04 22:58:15,227] ERROR WorkerSinkTask{id=Kafkapgsink-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: Value schema must be of type Struct (org.apache.kafka.connect.runtime.WorkerSinkTask:609)
org.apache.kafka.connect.errors.ConnectException: Value schema must be of type Struct
at io.confluent.connect.jdbc.sink.metadata.FieldsMetadata.extract(FieldsMetadata.java:86)
at io.confluent.connect.jdbc.sink.metadata.FieldsMetadata.extract(FieldsMetadata.java:67)
at io.confluent.connect.jdbc.sink.BufferedRecords.add(BufferedRecords.java:115)
at io.confluent.connect.jdbc.sink.JdbcDbWriter.write(JdbcDbWriter.java:74)
at io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:85)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:581)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:333)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:189)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:244)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
My kafka connector properties are =
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
sink properties are
name=Kafkapgsink connector.class=io.confluent.connect.jdbc.JdbcSinkConnector task.max=100 connection.url=jdbc:postgresql://localhost:5432/fileintegrity connection.user=postgres connection.password=09900 insert.mode=insert auto.create=true auto.evolve=true table.name.format=oi pk.mode=record_key delete.enabled=true
Your problem is here
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
Strings do not have typed key, value pairs (i.e. structure)
More details here if you want to use JSON - https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
As you see there, schemas.enable is only a property of JsonConverter

pSConfig JSON is not valid. Encountered the following validation errors:

When I create jason file the first time it gives me some error as
Error retrieving JSON. Encountered the following error:
garbage after JSON object, at character offset 8 (before ": {\r\n \r\n"...") at /usr/share/perl5/vendor_perl/JSON.pm line 171.
Then I tried to fix it and check it from the website
When I upload it a gain it gives me
pSConfig JSON is not valid. Encountered the following validation
errors: Node: / Error: Properties not allowed: cck.nnu.tn,
psauhank1.ankabut.ac.ae, aub-asren-ps.aub.edu.lb,
ps.ams-van1G.kaust.edu.sa, 1ge-throughput, 185.19.231.230,
193.227.1.143, perfsonardmz.marwan.ma, ps.sc-bdc10G.kaust.edu.sa, perfsonar.marwan.ma.
I don't know what is the problem and how I fixed it
the first time I missed the character but the second I don't know what is the reason
I checked it on JSON validator

Kafka connect string to json in Postgresql

I have a topic with string containing json. For example a message could be:
'{"id":"foo", "datetime":1}'
In this topic everything is considered as string.
I would like to send messages in postgresql table with kafka-connect. My goal is to let postgresql to understand that messages are json. Indeed, postgresql handles pretty well json.
How to tell kafka-connect or postgresql that messages are in fact json ?
Thanks
EDIT:
For now, I use ./bin/connect-standalone config/connect-standalone.properties config/sink-sql-rules.properties.
With:
connect-standalone.properties
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
rest.port=8084
plugin.path=share/java
sink-sql-rules.properties
name=mysink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
# The topics to consume from - required for sink connectors like this one
topics=mytopic
# Configuration specific to the JDBC sink connector.
connection.url=***
connection.user=***
connection.password=***
mode=timestamp+incremeting
auto.create=true
auto.evolve=true
table.name.format=mytable
batch.size=500
EDIT2:
With those conf I get this error:
org.apache.kafka.connect.errors.ConnectException: No fields found using key and value schemas for table

Read Multilple json schema with spark

Software Configuration:
Hadoop distribution:Amazon 2.8.3
Applications:Hive 2.3.2, Pig 0.17.0, Hue 4.1.0, Spark 2.3.0
Tried to read with multiple json schema,
val df = spark.read.option("mergeSchema",
"true").json("s3a://s3bucket/2018/01/01/*")
Throws an error,
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:206)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:397)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:340)
How to read json with multipl schema's with spark?
This sometimes happens when you are pointing to wrong path (when data does not exist).

Spark Structured-Streaming Error:-pyspark.sql.utils.StreamingQueryException: 'assertion failed: Invalid batch:

I have a Spark Structured-Streaming application, which reads JSON data from s3 and does some transformations and writes it back to s3.
While running the application, sometimes the job errors out and re-attempts (without any visible loss or data corruption- so that everything seems fine), but the error message provided is not very descriptive
Below is the error message:
pyspark.sql.utils.StreamingQueryException: u'assertion failed: Invalid batch: _ra_guest_gid#1883,_ra_sess_ts#1884,_ra_evt_ts#1885,event#1886,brand#1887,category#1888,funding_daysRemaining#1889,funding_dollarsRemaining#1890,funding_goal#1891,funding_totalBackers#1892L,funding_totalFunded#1893,id#1894,name#1895,price#1896,projectInfo_memberExclusive#1897,projectInfo_memberExclusiveHoursRemaining#1898,projectInfo_numberOfEpisodes#1899,projectInfo_projectState#1900,variant#1901 != _ra_guest_gid#2627,_ra_sess_ts#2628,_
My guess is this may have something to do with column mismatches, where
The incoming JSON record does not conform to the schema.
Or the datatype of the incoming JSON record may not match the data type provided in schema.
But I'm not sure how to pinpoint which record or which particular field causes the error.
Any help or suggestions here on what the error means or how I could log the error in a better way.
Thanks
I think i have figured out the issue, it is not related to schema mismatch.
What was happening in my case is that i have two streaming operations running in parallel.
1)reading raw incoming data from an S3 bucket, then doing some operation and writing it back to S3 in output folder 'a'
2)reading the processed streaming data from folder 'a' (step1) and then again doing some operation and writing back to S3 in output folder 'b'
Now as per my observations if i run the above steps individually then it works fine, but if i run them together i get the error
'pyspark.sql.utils.StreamingQueryException: u'assertion failed: Invalid batch: '
so i think it has trouble when it tries to read and write from same location i.e. the destination of one stream is the source of another stream