how to pull 300-400k data stored in mysql to mongoDB - mysql

I was developing a spring boot api to pull data from remote mysql database table. This table contains 300k - 400k data daily. We need to migrate this data to mongoDB now. I tried GridFS technique to store collected json file to mongoDB. I was able to this on local machine. But when I tried this scenario with live server, the JVM threw error :
2018-12-18 17:59:26.206 ERROR 4780 --- [r.BlockPoller-1] o.a.tomcat.util.net.NioBlockingSelector :
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.ArrayList.iterator(ArrayList.java:840) ~[na:1.8.0_181]
at sun.nio.ch.WindowsSelectorImpl.updateSelectedKeys(WindowsSelectorImpl.java:496) ~[na:1.8.0_181]
at sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:172) ~[na:1.8.0_181]
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) ~[na:1.8.0_181]
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) ~[na:1.8.0_181]
at org.apache.tomcat.util.net.NioBlockingSelector$BlockPoller.run(NioBlockingSelector.java:339) ~[tomcat-embed-core-8.5.14.jar:8.5.14]
2018-12-18 17:59:27.865 ERROR 4780 --- [nio-8083-exec-1] o.a.c.c.C.[.[.[.[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [/datapuller/v1] threw exception [Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: GC overhead limit exceeded] with root cause
java.lang.OutOfMemoryError: GC overhead limit exceeded
I tried to exceed the heap size with -Xmx3048m by opening java utility from control panel.. but same result. What should I do next to resolve this issue. I have not posted code here because I hope it was all fine as it was running OK on local machine with 60k to 70k record data.

The most performant way is always to bypass all those abstraction.
Since you are not locked to Sprint Boot, I will suggest you to dump the data as csv from mysql, either via mysqldump, or
echo 'SELECT * FROM table' | mysql -h your_host -u user -p -B <db_schema>
Then you can import this csv to MongoDB.
mongoimport --host=127.0.0.1 -d database_name -c collection_name --type csv --file csv_location --headerline
https://docs.mongodb.com/manual/reference/program/mongoimport/

Related

Error timeout when I want to read file from hadoop with pyspark

I want to read a csv file from hadoop with Pyspark with the following code:
dfcsv = spark.read.csv("hdfs://my_hadoop_cluster_ip:9000/user/root/input/test.csv")
dfcsv.printSchema()
My cluster hadoop is on a Docker container on my local machine and link with two other slave container for the workers.
As you see in this picture from my ui hadoop cluster, the path is the right path.
But when I submit my script with this command :
spark-submit --master spark://my_cluster_spark_ip:7077 test.py
My script stuck on the read, and after few minutes I have this following error :
22/02/09 15:42:29 WARN TaskSetManager: Lost task 0.1 in stage 4.0 (TID 4) (my_slave_spark_ip executor 1): org.apache.hadoop.net.ConnectTimeoutException: Call From spark-slave1/my_slave_spark_ip to my_hadoop_cluster_ip:9000 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=my_hadoop_cluster_ip/my_hadoop_cluster_ip:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
...
For information, my csv file is very small, just 3 lines and 64 KB.
Have you any solution to fix this issue?

MyDumper / MyLoader unable to import large tables into azure MySql database

I’m in the progress of migration a large mysql database to a “Azure-database voor MySQL Flexible Server”.
The database has a few tables that are larger than 1GB, the largest one being 200GB. All tables are InnoDB tables.
Because of the size of the tables, a normal mysql dump didn’t work, so as suggested here, I resorted to MyDumper/MyLoader: https://learn.microsoft.com/en-us/azure/mysql/concepts-migrate-mydumper-myloader
I dumped one of the large tables (a 31GB table) with the following command:
mydumper --database mySchema \
--tables-list my_large_table
--host database
--user root
--ask-password
--compress-protocol
--chunk-filesize 500
--verbose 3
--compress
--statement-size 104857600
I then copied the files over to a VM in the same region/zone as the Azure database and started the import with the following command:
myloader --directory mydumpdir \
--host dbname.mysql.database.azure.com \
--user my_admin \
--queries-per-transaction 100 \
--ask-password \
--verbose 3 \
--enable-binlog \
--threads 4 \
--overwrite-tables \
--compress-protocol
MyLoader seems to start loading and produced the following output:
** Message: 08:37:56.624: Server version reported as: 5.7.32-log
** Message: 08:37:56.674: Thread 1 restoring create database on `mySchema` from mySchema-schema-create.sql.gz
** Message: 08:37:56.711: Thread 2 restoring table `mySchema`.`my_large_table` from export-20220217-073020/mySchema.my_large_table-schema.sql.gz
** Message: 08:37:56.711: Dropping table or view (if exists) `mySchema`.`my_large_table`
** Message: 08:37:56.979: Creating table `mySchema`.`my_large_table` from export-20220217-073020/mySchema.my_large_table-schema.sql.gz
** Message: 08:37:57.348: Thread 2 restoring `mySchema`.`my_large_table` part 3 of 0 from mySchema.my_large_table.00003.sql.gz. Progress 1 of 26 .
** Message: 08:37:57.349: Thread 1 restoring `mySchema`.`my_large_table` part 0 of 0 from mySchema.my_large_table.00000.sql.gz. Progress 2 of 26 .
** Message: 08:37:57.349: Thread 4 restoring `mySchema`.`my_large_table` part 1 of 0 from mySchema.my_large_table.00001.sql.gz. Progress 3 of 26 .
** Message: 08:37:57.349: Thread 3 restoring `mySchema`.`my_large_table` part 2 of 0 from mySchema.my_large_table.00002.sql.gz. Progress 4 of 26 .
When I execute a "show full processlist" command on the Azure database, I see the 4 connected threads, but I see they are all sleeping, it seems like nothing is happening.
When I don't kill the command, it errors out after a long time:
** (myloader:31323): CRITICAL **: 17:07:27.642: Error occours between lines: 6 and 1888321 on file mySchema.my_large_table.00002.sql.gz: Lost connection to MySQL server during query
** (myloader:31323): CRITICAL **: 17:07:27.642: Error occours between lines: 6 and 1888161 on file mySchema.my_large_table.00001.sql.gz: MySQL server has gone away
** (myloader:31323): CRITICAL **: 17:07:27.642: Error occours between lines: 6 and 1888353 on file mySchema.my_large_table.00003.sql.gz: Lost connection to MySQL server during query
** (myloader:31323): CRITICAL **: 17:07:27.642: Error occours between lines: 6 and 1888284 on file mySchema.my_large_table.00000.sql.gz: MySQL server has gone away
After these errors, the table is still empty.
I tried a few different settings when dumping/loading, but to no avail:
start only 1 thread
make smaller chunks (100mb)
remove --compress-protocol
I also tried importing a smaller table (400MB in chunks of 100MB ), with exactly the same settings, and that did actually work.
I tried to import the tables into a mysql database on my local machine, and there I experienced exactly the same problem: the large table (31GB) import created 4 sleeping threads and didn't do anything, while the smaller table import (400MB in chunks of 100MB) did work.
So the problem doesn't seem to be related to the Azure database.
I now have no clue what the problem is, any ideas?
I had a similar problem, for me it ended up being the instance I was restoring into was too small, the server kept running out of memory. Try temporarily increasing the instance size to a much larger size, and once the data is imported shrink the size of the instance back down.

Errors while setting up a Kafka connector pipeline

Errors while setting up a Kafka connector pipeline
Enviroment overview:
Docker containers
docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=mypw -d mysql:latest
docker run -d --name Elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" Elasticsearch: 7.9.2
Operating System
WSL2 ON - Windows-10 Version 1909 (OS build 18363.1139)
Kafka Version
confluent-6.0.0
Hello
I’m reading Kafka: The Definitive Guide by Neha Narkhede, Gwen Shapira, and Todd Palino
I’ve reached the Connector Example: MySQL to Elasticsearch chapter (p.146) and i'm
following the instructions to create a pipeline from a MySql source to a Elasticsearch sink.
I have made some diversions from the instructions
I created the mysql and Elasticsearch connectors using mvn package instead of mvn build
I placed the said connectors in a folder named C:\Users\ROY\confluent-6.0.0\share\kafka
tougher with some other connectors I’ve downloaded. I set the plugin.path vaiable in connect-distributed.properties to:
plugin.path=C://Users//ROY//confluent-6.0.0//share//kafka,/mnt/c/Users/ROY/confluent-6.0.0/share/kafka
I'm using mysql and Elasticsearch as docker containers.
The mysql connector works fine and reads data into a topic,
but when I try to create the Elasticsearch connector I get the following error:
...
(io.confluent.connect._Elasticsearch_._Elasticsearch_SinkConnectorConfig:354)
[2020-10-16 12:22:27,170] ERROR WorkerSinkTask{id=elastic-login-connector-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:187)
java.lang.NoClassDefFoundError: io/searchbox/action/Action
at io.confluent.connect._Elasticsearch_._Elasticsearch_SinkTask.start(_Elasticsearch_SinkTask.java:74)
at io.confluent.connect._Elasticsearch_._Elasticsearch_SinkTask.start(_Elasticsearch_SinkTask.java:48)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:302)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:193)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:185)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:235)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: io.searchbox.action.Action
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 11 more
...
Can someone assist resolving the issue and give and explanation to the error?
Thank you
Roy
Ok I found the problem. After I ran 'mvn package' on the elastic search connector source code I should have copied a whole directory into Kafka's /share dir, instead I only took one .jar file.

My account on Cosmos global instance seems to be running out of space - maybe need to increase quota

Trying to run a simple hdfs query failed with:
[ms#cosmosmaster-gi ~]$ hadoop fs -ls /user/ms/def_serv/def_servpath
Java HotSpot(TM) 64-Bit Server VM warning: Insufficient space for shared memory file:
/tmp/hsperfdata_ms/21066
Try using the -Djava.io.tmpdir= option to select an alternate temp location.
Exception in thread "main" java.lang.NoClassDefFoundError: ___/tmp/hsperfdata_ms/21078
Caused by: java.lang.ClassNotFoundException: ___.tmp.hsperfdata_ms.21078
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: ___/tmp/hsperfdata_ms/21078. Program will exit.
Any idea how to fix that or increase quota?
Thanks!
ms
Your quota has not been exceeded (see command below), but this was a problem with the cluster. It should be fixed now.
$ hadoop fs -dus /user/ms
hdfs://cosmosmaster-gi/user/ms 90731

Not able to connect Mysql database to presto - No factory for connector mysql

I always get this error when trying to start the Presto server in Intellij.
2015-06-05T19:30:32.293+0530 ERROR main com.facebook.presto.server.PrestoServer No factory for connector mysql
java.lang.IllegalArgumentException: No factory for connector mysql
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:145)
at com.facebook.presto.connector.ConnectorManager.createConnection(ConnectorManager.java:131)
at com.facebook.presto.metadata.CatalogManager.loadCatalog(CatalogManager.java:88)
at com.facebook.presto.metadata.CatalogManager.loadCatalogs(CatalogManager.java:70)
at com.facebook.presto.server.PrestoServer.run(PrestoServer.java:107)
at com.facebook.presto.server.PrestoServer.main(PrestoServer.java:59)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
2015-06-05T19:30:32.294+0530 INFO Thread-88 io.airlift.bootstrap.LifeCycleManager Life cycle stopping...
Process finished with exit code 1
I installed mysql using brew.
When each Presto server starts, it logs which catalogs were loaded. My guess, is the file is not in the correct location, or you did not restart your Presto servers. Note, the file must be on every Presto server.
The 'mysql.properties' file should be present in
presto-main/etc/catalog folder
Also,
'presto-main/etc/config.properties' should be edited.
'../presto-mysql/pom.xml' need to be appended in plugin.bundles shown below
$ cat presto-main/etc/config.properties
# sample nodeId to provide consistency across test runs
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.environment=test
http-server.http.port=8080
discovery-server.enabled=true
discovery.uri=http://localhost:8080
exchange.http-client.max-connections=1000
exchange.http-client.max-connections-per-server=1000
exchange.http-client.connect-timeout=1m
exchange.http-client.read-timeout=1m
scheduler.http-client.max-connections=1000
scheduler.http-client.max-connections-per-server=1000
scheduler.http-client.connect-timeout=1m
scheduler.http-client.read-timeout=1m
query.client.timeout=5m
query.max-age=30m
plugin.bundles=\
../presto-raptor/pom.xml,\
../presto-hive-cdh4/pom.xml,\
../presto-example-http/pom.xml,\
../presto-kafka/pom.xml,\
../presto-tpch/pom.xml,\
../presto-mysql/pom.xml
presto.version=testversion
experimental-syntax-enabled=true
distributed-joins-enabled=true