How can I create a parquet file bigger than node's assigned memory? - apache-drill

I'm trying to create a parquet file from a table stored in mysql. The source contains millions of rows and I get a GC Overhead limit exception after a couple of minutes.
Can apache drill be configured in a way that allows operations to use disk temporarily in case there is no more RAM available?
This were my steps before getting the error:
Put the mysql jdbc connector inside jars/3rdparty
Execute sqlline.bat -u "jdbc:drill:zk=local"
Navigate to http://localhost:8047/storage
Configure a new storage pluggin to connect to mysql
Navigate to http://localhost:8047/query and execute the following queries
ALTER SESSION SET `store.format` = 'parquet';
ALTER SESSION SET `store.parquet.compression` = 'snappy';
create table dfs.tmp.`bigtable.parquet` as (select * from mysql.schema.bigtable)
Then I get the error and the aplication ends:
Node ran out of Heap memory, exiting.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2149)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1956)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3308)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:463)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3032)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2280)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2673)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2546)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2504)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1370)
at org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
at org.apache.commons.dbcp.DelegatingStatement.executeQuery(DelegatingStatement.java:208)
at org.apache.drill.exec.store.jdbc.JdbcRecordReader.setup(JdbcRecordReader.java:177)
at org.apache.drill.exec.physical.impl.ScanBatch.(ScanBatch.java:101)
at org.apache.drill.exec.physical.impl.ScanBatch.(ScanBatch.java:128)
at org.apache.drill.exec.store.jdbc.JdbcBatchCreator.getBatch(JdbcBatchCreator.java:40)
at org.apache.drill.exec.store.jdbc.JdbcBatchCreator.getBatch(JdbcBatchCreator.java:33)
at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:151)
at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:174)
at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:131)
at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:174)
at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:131)
at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:174)
at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:131)
at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:174)
at org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:105)
at org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:79)
at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:230)
at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Check drill-env.sh located in <drill_installation_directory>/conf
By default values are:
DRILL_MAX_DIRECT_MEMORY="8G"
DRILL_HEAP="4G"
The default memory for a Drillbit is 8G, but Drill prefers 16G or more depending on the workload.
If you have sufficient RAM you can configure it as 16G
You can read in detail in Drill's documentation.

Related

Change or Add ROW label

$counters = #(
"\Processor(_Total)% Processor Time" ,"\Memory\Available MBytes"
,"\Paging File(_Total)% Usage" ,"\LogicalDisk(*)\Avg. Disk Bytes/Read"
,"\LogicalDisk()\Avg. Disk Bytes/Write" ,"\LogicalDisk(*)\Avg. Disk sec/Read"
,"\LogicalDisk()\Avg. Disk sec/Write" ,"\LogicalDisk(*)\Disk Read Bytes/sec"
,"\LogicalDisk()\Disk Write Bytes/sec" ,"\LogicalDisk(*)\Disk Reads/sec"
,"\LogicalDisk()\Disk Writes/sec"
)
(Get-Counter $counters).countersamples
`Im new to powershell and found this script to get server performance. When you execute this command, you will get a column called “Path”. How would you rename this or add new row for betters understanding.
Example labels:
Read Latency = “\LogicalDisk()\Disk Reads/sec"
Write Latency = "\LogicalDisk()\Disk Writes/sec”
I have tried foreach but they will be executed one at a time and will not be accurate data. They need to be executed at once to capture the performance of the server at that exact time for all performance counters. Our environment are still running on PS 5.1 (including windows 2016/2019).

Unable to Create Extract - Tableau and Spark SQL

I am trying to make extract information from Spark SQL. Following error message showing while creating extract.
[Simba][Hardy] (35) Error from server: error code: '0' error message: 'org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 906 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)'.
A quick fix is just changing the setting in your execution context.
spark.sql("set spark.driver.maxResultSize = 8G")
Not entirely convinced on Spark SQL Thrift Server, and a little awkward to distill all facts. Tableau uses the results collect'ed to the driver, how else can it get them with Spark?
However:
Set spark.driver.maxResultSize 0 in relevant spark-thrift-sparkconf.conf file will mean no limit (except physicals limits on driver node).
Set spark.driver.maxResultSize 8G or higher in relevant spark-thrift-sparkconf.conf file. Note not all memory on driver can be used.
Or, use Impala Connection for Tableau assuming a Hive Impala source, then less such issues.
Also, number of concurrent users can be a problem. Hence, last point.
Interesting to say the least.
spark.driver.maxResultSize 0
This is the setting you can put in your advanced cluster settings. This will solve your 4 GB issue.

Operation not allowed after ResultSet closed in solr import

I encountered an error while doing full-import in solr-6.6.0.
I am getting exception as bellow
This happens when I set
batchSize="-1" in my db-config.xml
If I change this value to say batchSize="100" then import will run without any error.
But recommended value for this is "-1".
Any suggestion why solr throwing this exception.
By the way the data am trying to import is not huge, data am trying to import is just 250 documents.
Stack trace:
org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLException: Operation not allowed after ResultSet closed
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:464)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:377)
at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:133)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:516)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:415)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:457)
at java.lang.Thread.run(Thread.java:745)
By the way am getting one more warning:
Could not read DIH properties from /configs/state/dataimport.properties :class org.apache.zookeeper.KeeperException$NoNodeException
This happens when config directory is not writable.
How can we make config directory writable in solrCloud mode.
Am using zookeeper as watch-dog. Can we go ahead and change permission of config files which are there is zookeeper?
your help greatly appreciated.
Using fetchSize="-1" is only recommended if you have problems running without it. Its behaviour is up to the JDBC driver, but the cause of people assuming its recommended is this sentence from the old wiki:
DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.
Unless you're actually seeing issues with the default values, leave the setting alone and assume your JDBC driver does the correct thing (.. which it might not do with -1 as the value).
The reason for dataimport.properties having to be writable is that it writes a property for the last time the import ran to the file, so that you can perform delta updates by referencing the time of the last update in your SQL statement.
You'll have to make the directory writable for the client (solr) if you want to use this feature. My guess would be that you can ignore the warning if you're not using delta imports.

Can you create a disk and an instance in with one command in Google Compute Engine?

Currently, I'm creating a disk from a snapshot. Then I wait for 60 seconds and create an instance which will use that disk as its system disk. I'm using the gcloud utility for this.
Is there any way I can create the disk and the instance in one command?
Mix of copy-pasted Python code and pseudocode below:
cmd_create_disk = [GCLOUD, 'compute', 'disks', 'create', new_instance,
'--source-snapshot', GCE_RENDER_SNAPSHOT_VERSION,
'--zone', GCE_REGION, '--project', GCE_PROJECT]
# wait for 60 seconds
cmd_make_instance = [GCLOUD, 'compute', 'instances', 'create', new_instance,
'--disk', 'name='+new_instance+',boot=yes,auto-delete=yes',
'--machine-type', instance_type, '--network', GCE_NETWORK,
'--no-address', '--tags', 'render', '--tags', 'vpn',
'--tags', proj_tag, '--zone', GCE_REGION,
'--project', GCE_PROJECT]
The instance uses the disk as its system disk. Waiting for 60 seconds is quite arbitrary and I'd rather leave this up to GCE, making sure the instance is indeed started with the system disk.
When you delete an instance you can specify that the disk should also get deleted. In the same manner, I'd like to create an instance and specify the disk to be created from image.
The boot disk can be created automatically. You can specify the image to use for that using --image and --image-project flags in gcloud compute instances create command line. You'll need to make sure to create the image first though - your current command to create the disk seems to use a snapshot rather than an image.

Not able to join postgres table through apache drill

Not able to join multiple tables of postgres database through apache drill. When trying the same below error is coming.
org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: IllegalStateException: Memory was leaked by query. Memory leaked: (40960) Allocator(op:0:0:6:JdbcSubScan) 1000000/40960/2228224/10000000000 (res/actual/peak/limit) Fragment 0:0 [Error Id: b05fe30e-cc3a-4e7f-b81e-46ecfd1a9466 on INBBRDSSVM300.india.tcs.com:31010] (java.lang.IllegalStateException) Memory was leaked by query. Memory leaked: (40960) Allocator(op:0:0:6:JdbcSubScan) 1000000/40960/2228224/10000000000 (res/actual/peak/limit) org.apache.drill.exec.memory.BaseAllocator.close():492 org.apache.drill.exec.ops.OperatorContextImpl.close():124 org.apache.drill.exec.ops.FragmentContext.suppressingClose():416 org.apache.drill.exec.ops.FragmentContext.close():405 org.apache.drill.exec.work.fragment.FragmentExecutor.closeOutResources():346 org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup():179 org.apache.drill.exec.work.fragment.FragmentExecutor.run():290 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1145 java.util.concurrent.ThreadPoolExecutor$Worker.run():615 java.lang.Thread.run():745
Try increasing DRILLBIT_MAX_PROC_MEM which should be higher than DRILL_MAX_DIRECT_MEMORY+DRILL_HEAP