How to migrate Mysql data to elasticsearch using logstash - mysql

I need a brief explanation of how can I convert MySQL data to Elastic Search using logstash.
can anyone explain the step by step process about this

This is a broad question, I don't know how much you familiar with MySQL and ES. Let's say you have a table user. you may just simply dump it as csv and load it at your ES will be good. but if you have a dynamic data, like the MySQL just like a pipeline, you need to write a Script to do those stuff. anyway you can check the below link to build your basic knowledge before you ask How.
How to dump mysql?
How to load data to ES
Also, since you probably want to know how to covert your CSV to json file, which is the best suite for ES to understand.
How to covert CSV to JSON

You can do it using the jdbc input plugin for logstash.
Here is a config example.

Let me provide you with a high level instruction set.
Install Logstash, and Elasticsearch.
In Logstash bin folder copy jar ojdbc7.jar.
For logstash, create a config file ex: config.yml
#
input {
# Get the data from database, configure fields to get data incrementally
jdbc {
jdbc_driver_library => "./ojdbc7.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:#db:1521:instance"
jdbc_user => "user"
jdbc_password => "pwd"
id => "some_id"
jdbc_validate_connection => true
jdbc_validation_timeout => 1800
connection_retry_attempts => 10
connection_retry_attempts_wait_time => 10
#fetch the db logs using logid
statement => "select * from customer.table where logid > :sql_last_value order by logid asc"
#limit how many results are pre-fetched at a time from the cursor into the client’s cache before retrieving more results from the result-set
jdbc_fetch_size => 500
jdbc_default_timezone => "America/New_York"
use_column_value => true
tracking_column => "logid"
tracking_column_type => "numeric"
record_last_run => true
schedule => "*/2 * * * *"
type => "log.customer.table"
add_field => {"source" => "customer.table"}
add_field => {"tags" => "customer.table" }
add_field => {"logLevel" => "ERROR" }
last_run_metadata_path => "last_run_metadata_path_table.txt"
}
}
# Massage the data to store in index
filter {
if [type] == 'log.customer.table' {
#assign values from db column to custom fields of index
ruby{
code => "event.set( 'errorid', event.get('ssoerrorid') );
event.set( 'msg', event.get('errormessage') );
event.set( 'logTimeStamp', event.get('date_created'));
event.set( '#timestamp', event.get('date_created'));
"
}
#remove the db columns that were mapped to custom fields of index
mutate {
remove_field => ["ssoerrorid","errormessage","date_created" ]
}
}#end of [type] == 'log.customer.table'
} #end of filter
# Insert into index
output {
if [type] == 'log.customer.table' {
amazon_es {
hosts => ["vpc-xxx-es-yyyyyyyyyyyy.us-east-1.es.amazonaws.com"]
region => "us-east-1"
aws_access_key_id => '<access key>'
aws_secret_access_key => '<secret password>'
index => "production-logs-table-%{+YYYY.MM.dd}"
}
}
}
Go to bin, Run as
logstash -f config.yml

Related

Logstash Jdbc plugin filling more data in elasticsearch than the actual data, keeps on running

Logstash is running in an infinite loop and I'm having to stop the process, basically keeps filling values in the elasticsearch index. I need exact same number of documents as there are rows in my db table.
Here's my logstash config:
input {
jdbc {
jdbc_driver_library => "/correct_path/java/mysql-connector-java-8.0.27.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/my_db"
jdbc_user => "user"
jdbc_password => "password"
jdbc_paging_enabled => true
schedule => "*/5 * * * * *"
statement => 'select * from my_table'
}
}
output {
elasticsearch {
user => "test"
password => "test"
hosts => ["localhost:9200"]
index => "my_index"
}
stdout { codec => "rubydebug" }
}
This is happening because query will get all the data every time when the cron job will be executed. Also, you have not provided custom id in elasticsearch output so it will create dynamic id for each document and due to that there will be more data in index (duplicate data with different unique id).
You can use sql_last_value param which store the last crawl date and update your query with where condition on created_date or updated_date. This will get first time all the data from DB and second time onward only data which are newly created or updated.
input {
jdbc {
jdbc_driver_library => "/correct_path/java/mysql-connector-java-8.0.27.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/my_db"
jdbc_user => "user"
jdbc_password => "password"
jdbc_paging_enabled => true
schedule => "*/5 * * * * *"
statement => 'select * from my_table where created_date > :sql_last_value or updated_date > :sql_last_value'
}
}
output {
elasticsearch {
user => "test"
password => "test"
hosts => ["localhost:9200"]
index => "my_index"
}
stdout { codec => "rubydebug" }
}
PS: I am not pro in SQL so my query might have issue. But I hope you will get the idea.

notify Logstash when new data is entered in mysql databse without using parameter schedule

I am working on Elastic Stack with Mysql. everything is working fine like logstash taking data from mysql database and sending it to elasticsearch and when new entries entered in mysql data then to update elasticsearch automatically i am using parameter: Schedule but in this case logstash is checking continuously for new data from it's terminal that is my main concern.
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://localhost:3306/testdb"
# The user we wish to execute our statement as
jdbc_user => "root"
jdbc_password => ""
# The path to our downloaded jdbc driver
jdbc_driver_library => "/home/Downloads/mysql-connector-java-5.1.38.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
#run logstash at an interval of on minute
schedule => "*/15 * * * *"
use_column_value => true
tracking_column => 'EVENT_TIME_OCCURRENCE_FIELD'
# our query
statement => "SELECT * FROM brainplay WHERE EVENT_TIME_OCCURRENCE_FIELD > :sql_last_value"
}
}
output {
stdout { codec => json_lines }
elasticsearch {
"hosts" => "localhost:9200"
"index" => "test-migrate"
"document_type" => "data"
"document_id" => "%{personid}"
}
}
But if data is large Logstash will check for new entries in entire data without any stopping point then this will reduce scalability and consume more power.
Is there any other method or any webhook like when new data is entered into database then mysql will notify Logstash only for new data or Logstash will check for only new entries, Please help
You can either use sql_last_start parameter in your query with any timestamp field (assuming that there is a timestamp field like last_updated).
For example, your query could be like,
WHERE last_updated >= :sql_last_start
From this answer,
For example, the first time you run this sql_last_start will be
1970-01-01 00:00:00 and you'll get all rows. The second run
sql_last_start will be (for example) 2015-12-03 10:55:00 and the query
will return all rows with a timestamp newer than that.
or you can read this answer on using :sql_last_value
WHERE last_updated > :sql_last_value

Need help in identifying TYPE of document ingested by Elasticsearch (through Logstash)

I used Logstash to ingest csv files from https://www.kaggle.com/wcukierski/the-simpsons-by-the-data and saved it to Elasticsearch. For starters, I ingested simpsons_characters.csv using the following conf:
input {
file {
path => "/Users/xyz/Downloads/the-simpsons-by-the-data/simpsons_characters.csv"
start_position => beginning
sincedb_path => "/dev/null"
}
}
filter {
csv {
columns => ["id", "name", "normalized_name", "gender"]
separator => ","
}
}
output {
stdout {
codec => rubydebug
}
elasticsearch {
hosts => "localhost"
action => "index"
index => "simpsons"
}
}
However, when I query like so: http://localhost:9200/simpsons/name/Lou
where
simpsons = index
name = type (I think ... not sure)
I get the following response back:
{
"_index": "simpsons",
"_type": "name",
"_id": "Lou",
"found": false
}
So, the question is, why am I not getting the correct response. Further, when you do bulk ingestion through csv, what is the type of the document?
Thanks!
The default type in Logstash Elasticsearch output is logs. So, no matter how you define your IDs (either take it from the csv - document_id => "%{id}" or let ES define its own), you can get those documents as http://localhost:9200/simpsons/logs/THE_ID.
If you don't know the id and want to simply check if something is there: http://localhost:9200/simpsons/logs/_search?pretty.
If you want to see what is the mapping of your index, for example to find out the _type of the index: http://localhost:9200/simpsons/_mapping?pretty.
To change the default _type:
elasticsearch {
hosts => "localhost"
action => "index"
index => "simpsons"
document_type => "characters"
document_id => "%{id}"
}
Here you haven't specified id field in your logstash output. In this case elasticsearch would asign a random id to your documents and you are searching for a document with id=Lou.
Adding document_id => "%{id}" would solve your problem.
output {
stdout {
codec => rubydebug
}
elasticsearch {
hosts => "localhost"
action => "index"
index => "simpsons"
document_id => "%{id}"
}
}

Keeping the elasticseach indexed data in sync with MySQL table data

I am using logstash to index the different data in MySQL DB tables.
input
{
jdbc { jdbc_driver_library => "/opt/logstash/mysql-connector-java-5.1.39/mysql-connector-java-5.1.39-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://<ip number>:3306/database"
jdbc_user => "elastic"
jdbc_password => "password"
schedule => "* * * * *"
statement => "SELECT name, id, description from user_table"
}
}
output
{
elasticsearch {
index => "search"
document_type => "name"
document_id => "%{id}"
hosts => "127.0.0.1:9200"
}
#stdout { codec => json_lines }
}
The data is indexed properly but how do we keep the data in elastic search in sync with the data in the database tables as the data is updated by the application continuously. I just gave the example of one table but I have multiple tables for which I want to index the data. I searched for the answer but could not find the details.

Logstash produce duplicates

My goal is to import data from MySQL table in the ElasticSearch index. MySQL table has about 2.5 million records, however after a while logstash inserts at least 3x times more data and don't stop.
The weirdest thing is that I try to generate sha1 signature of each message and use it as document_id to avoid duplicates
input {
jdbc {
jdbc_driver_library => "/app/bin/mysql-connector-java-5.1.37-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://database.xxxxxxx.us-west-2.rds.amazonaws.com:3306/test"
jdbc_page_size => 25000
jdbc_paging_enabled => true
statement => "SELECT * FROM Actions"
}
}
filter {
ruby {
code => "
require 'digest/sha1';
event['fingerprint'] = Digest::SHA1.hexdigest(event.to_json);
"
}
}
output {
elasticsearch {
hosts => ["elasticbeanstalk-env:80"]
index => "test"
document_type => "action"
document_id => "%{fingerprint}"
}
}