Data ingestion in elasticsearch with database having large number of tables - mysql

input {
jdbc {
jdbc_driver_library => "mysql-connector-java-5.1.36-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
jdbc_user => "mysql"
parameters => { "favorite_artist" => "Beethoven" }
schedule => "* * * * *"
statement => "SELECT * from songs where artist = :favorite_artist"
}
}
In the above logstash configuration file how to ingest data?
What to do when I have to select multiple tables?

Data would be getting ingested based on the "Select statement query". if you want to
have the data from multiple tables, then you can have join queries combining all the
tables and the relevant output from the query would be ingested to ES.It all depends
on your specific use case. Here is some sample pasted down for your reference.
input {
jdbc {
jdbc_driver_library => "mysql-connector-java-5.1.36-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/mydb"
jdbc_user => "mysql"
parameters => { "favorite_artist" => "Beethoven" }
schedule => "* * * * *"
statement => "SELECT * FROM songs INNER JOIN song_folder using song_number ORDER BY
song_title;"
}
}
output{
elasticsearch{
hosts=>"http://xx:XX:XX:XX:9200"
index=>"song"
document_type=>"songname"
document_id=>"song_title"
}
stdout{codec=>rubydebug}
}
Please let me know , if you have any further queries.

Related

Logstash Jdbc plugin filling more data in elasticsearch than the actual data, keeps on running

Logstash is running in an infinite loop and I'm having to stop the process, basically keeps filling values in the elasticsearch index. I need exact same number of documents as there are rows in my db table.
Here's my logstash config:
input {
jdbc {
jdbc_driver_library => "/correct_path/java/mysql-connector-java-8.0.27.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/my_db"
jdbc_user => "user"
jdbc_password => "password"
jdbc_paging_enabled => true
schedule => "*/5 * * * * *"
statement => 'select * from my_table'
}
}
output {
elasticsearch {
user => "test"
password => "test"
hosts => ["localhost:9200"]
index => "my_index"
}
stdout { codec => "rubydebug" }
}
This is happening because query will get all the data every time when the cron job will be executed. Also, you have not provided custom id in elasticsearch output so it will create dynamic id for each document and due to that there will be more data in index (duplicate data with different unique id).
You can use sql_last_value param which store the last crawl date and update your query with where condition on created_date or updated_date. This will get first time all the data from DB and second time onward only data which are newly created or updated.
input {
jdbc {
jdbc_driver_library => "/correct_path/java/mysql-connector-java-8.0.27.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/my_db"
jdbc_user => "user"
jdbc_password => "password"
jdbc_paging_enabled => true
schedule => "*/5 * * * * *"
statement => 'select * from my_table where created_date > :sql_last_value or updated_date > :sql_last_value'
}
}
output {
elasticsearch {
user => "test"
password => "test"
hosts => ["localhost:9200"]
index => "my_index"
}
stdout { codec => "rubydebug" }
}
PS: I am not pro in SQL so my query might have issue. But I hope you will get the idea.

Logstash sql_last_value is not updating

I wish to migrate records from one mysql table to elasticsearch by using logstash configuration. I'm checking that file logstash_jdbc_last_run_issued is not changing/updating so the sql_last_value is not changing also. When I add one record on table artifact, the index emp7 is inserting with replicating information without stopping. So it's growing and growing the index. Unless I break up the logstash process.
Logstash configuration:
input {
jdbc {
jdbc_driver_library => "e:\Data\logstash\bin\mysql-connector-java-5.1.18-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/kibana"
jdbc_user => "userdb"
jdbc_password => "passdb"
last_run_metadata_path => "e:\Data\logstash\bin\logstash_jdbc_last_run_issued"
tracking_column => "issuedDate"
tracking_column_type => "numeric"
use_column_value=>true
statement => "SELECT serial, name, issuedDate FROM artifact where issuedDate > :sql_last_value; "
schedule => " * * * * * *"
}
}
output {
elasticsearch {
hosts => "http://127.0.0.1:9200"
index => "emp7"
document_type => "doc"
user => "user"
password => "pass"
}
stdout {
codec => rubydebug
}
}
Table structure information: artifact
serial varchar(40)
name varchar(40)
issuedDate bigint(20)
I'm giving you logstash debug results:
[2019-12-30T11:38:46,004][INFO ][logstash.inputs.jdbc ] (0.000000s) SELECT serial, name, issuedDate FROM artifact where issuedDate > 0;
[2019-12-30T11:38:46,004][WARN ][logstash.inputs.jdbc ] tracking_column not found in dataset. {:tracking_column=>"issuedDate"}
file logstash_jdbc_last_run_issued content:
--- 0
I'm using logstash 6.0, elasticsearch 6.0 and kibana 6.0
My question is what is missing about logstash configuration?
I figured it out what it's happening.
The problem was related to tracking_column not found in dataset. {:tracking_column...}.
I added lowercase_column_names => false inside jdbc section. Additionally, I added clean_run => false. Finally it starts to work. I was understaning logstash lowercases the tracking_column by default. So I disabled that.
input {
jdbc {
jdbc_driver_library => "e:\Data\logstash\bin\mysql-connector-java-5.1.18-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/kibana"
jdbc_user => "userdb"
jdbc_password => "passdb"
last_run_metadata_path => "e:\Data\logstash\bin\logstash_jdbc_last_run_issued"
tracking_column => "issuedDate"
use_column_value=>true
lowercase_column_names => false
clean_run => false
statement => "SELECT serial, name, issuedDate FROM artifact where issuedDate > :sql_last_value; "
schedule => " * * * * * *"
}
}
output {
elasticsearch {
hosts => ["http://127.0.0.1:9200"]
index => "emp7"
document_type => "doc"
user => "user"
password => "pass"
}
stdout {
codec => rubydebug
}
}

How to migrate Mysql data to elasticsearch using logstash

I need a brief explanation of how can I convert MySQL data to Elastic Search using logstash.
can anyone explain the step by step process about this
This is a broad question, I don't know how much you familiar with MySQL and ES. Let's say you have a table user. you may just simply dump it as csv and load it at your ES will be good. but if you have a dynamic data, like the MySQL just like a pipeline, you need to write a Script to do those stuff. anyway you can check the below link to build your basic knowledge before you ask How.
How to dump mysql?
How to load data to ES
Also, since you probably want to know how to covert your CSV to json file, which is the best suite for ES to understand.
How to covert CSV to JSON
You can do it using the jdbc input plugin for logstash.
Here is a config example.
Let me provide you with a high level instruction set.
Install Logstash, and Elasticsearch.
In Logstash bin folder copy jar ojdbc7.jar.
For logstash, create a config file ex: config.yml
#
input {
# Get the data from database, configure fields to get data incrementally
jdbc {
jdbc_driver_library => "./ojdbc7.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:#db:1521:instance"
jdbc_user => "user"
jdbc_password => "pwd"
id => "some_id"
jdbc_validate_connection => true
jdbc_validation_timeout => 1800
connection_retry_attempts => 10
connection_retry_attempts_wait_time => 10
#fetch the db logs using logid
statement => "select * from customer.table where logid > :sql_last_value order by logid asc"
#limit how many results are pre-fetched at a time from the cursor into the client’s cache before retrieving more results from the result-set
jdbc_fetch_size => 500
jdbc_default_timezone => "America/New_York"
use_column_value => true
tracking_column => "logid"
tracking_column_type => "numeric"
record_last_run => true
schedule => "*/2 * * * *"
type => "log.customer.table"
add_field => {"source" => "customer.table"}
add_field => {"tags" => "customer.table" }
add_field => {"logLevel" => "ERROR" }
last_run_metadata_path => "last_run_metadata_path_table.txt"
}
}
# Massage the data to store in index
filter {
if [type] == 'log.customer.table' {
#assign values from db column to custom fields of index
ruby{
code => "event.set( 'errorid', event.get('ssoerrorid') );
event.set( 'msg', event.get('errormessage') );
event.set( 'logTimeStamp', event.get('date_created'));
event.set( '#timestamp', event.get('date_created'));
"
}
#remove the db columns that were mapped to custom fields of index
mutate {
remove_field => ["ssoerrorid","errormessage","date_created" ]
}
}#end of [type] == 'log.customer.table'
} #end of filter
# Insert into index
output {
if [type] == 'log.customer.table' {
amazon_es {
hosts => ["vpc-xxx-es-yyyyyyyyyyyy.us-east-1.es.amazonaws.com"]
region => "us-east-1"
aws_access_key_id => '<access key>'
aws_secret_access_key => '<secret password>'
index => "production-logs-table-%{+YYYY.MM.dd}"
}
}
}
Go to bin, Run as
logstash -f config.yml

Keeping the elasticseach indexed data in sync with MySQL table data

I am using logstash to index the different data in MySQL DB tables.
input
{
jdbc { jdbc_driver_library => "/opt/logstash/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://<ip number>:3306/database"
jdbc_user => "elastic"
jdbc_password => "password"
schedule => "* * * * *"
statement => "SELECT name, id, description from user_table"
}
}
output
{
elasticsearch {
index => "search"
document_type => "name"
document_id => "%{id}"
hosts => "127.0.0.1:9200"
}
#stdout { codec => json_lines }
}
The data is indexed properly but how do we keep the data in elastic search in sync with the data in the database tables as the data is updated by the application continuously. I just gave the example of one table but I have multiple tables for which I want to index the data. I searched for the answer but could not find the details.

Logstash produce duplicates

My goal is to import data from MySQL table in the ElasticSearch index. MySQL table has about 2.5 million records, however after a while logstash inserts at least 3x times more data and don't stop.
The weirdest thing is that I try to generate sha1 signature of each message and use it as document_id to avoid duplicates
input {
jdbc {
jdbc_driver_library => "/app/bin/mysql-connector-java-5.1.37-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://database.xxxxxxx.us-west-2.rds.amazonaws.com:3306/test"
jdbc_page_size => 25000
jdbc_paging_enabled => true
statement => "SELECT * FROM Actions"
}
}
filter {
ruby {
code => "
require 'digest/sha1';
event['fingerprint'] = Digest::SHA1.hexdigest(event.to_json);
"
}
}
output {
elasticsearch {
hosts => ["elasticbeanstalk-env:80"]
index => "test"
document_type => "action"
document_id => "%{fingerprint}"
}
}