java.sql.Clob reading : weird results b/w MySQL and Oracle - mysql

I got an unified JDBC code for reading/writing large texts. Column is CLOB on Oracle and TEXT on MySQL. The following code
java.sql.Clob aClob = resultSet.getClob(COLUMN_NAME);
java.io.InputStream aStream = aClob.getAsciiStream();
int av = aStream.available();
gives relevant value on MySQL (Connector/J 5.0.4) but zero on Oracle (Oracle JDBC driver 11.2.0.2). Clob.length() fortunately gives correct value on both and InputStream.read() up to -1 works too, so there are other ways of obtaining the data in unified way.
Javadoc gives this weird note:
The available method for class InputStream always returns 0.
So which driver is right? And no, i don't want to drag vendor-specific packages into the code :-) This question is JDBC neutral.

I would be tempted to say that both drivers were right.
The Javadoc for the available() method appears to suggest that the value returned is an estimate of how many bytes the InputStream currently has cached and can return to you without an I/O operation. How many bytes it has cached, and how it does any caching, would seem to me to be an implementation detail. The fact that these values are different merely suggests that the two drivers are implemented differently. Nothing in the Javadoc for the available() method suggests to me that either driver is doing anything wrong.
I'd guess that the Oracle driver doesn't cache any data from the CLOB immediately after executing the query, so that might be why the available() method returns 0. However, once data has been read from the stream, the available() method for the Oracle driver no longer returns 0, as it seems Oracle JDBC driver has been to the database and fetched some data out of the CLOB column. On the other hand, MySQL seems to be a bit more proactive in actually fetching data out of the TEXT column as soon as the query has finished executing.
Having read the Javadoc for the available() method I'm not sure why I'd use it. What are you using it for?

Related

Unable to Create Extract - Tableau and Spark SQL

I am trying to make extract information from Spark SQL. Following error message showing while creating extract.
[Simba][Hardy] (35) Error from server: error code: '0' error message: 'org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 906 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)'.
A quick fix is just changing the setting in your execution context.
spark.sql("set spark.driver.maxResultSize = 8G")
Not entirely convinced on Spark SQL Thrift Server, and a little awkward to distill all facts. Tableau uses the results collect'ed to the driver, how else can it get them with Spark?
However:
Set spark.driver.maxResultSize 0 in relevant spark-thrift-sparkconf.conf file will mean no limit (except physicals limits on driver node).
Set spark.driver.maxResultSize 8G or higher in relevant spark-thrift-sparkconf.conf file. Note not all memory on driver can be used.
Or, use Impala Connection for Tableau assuming a Hive Impala source, then less such issues.
Also, number of concurrent users can be a problem. Hence, last point.
Interesting to say the least.
spark.driver.maxResultSize 0
This is the setting you can put in your advanced cluster settings. This will solve your 4 GB issue.

Google Dataflow (Apache beam) JdbcIO bulk insert into mysql database

I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk insert. I do not find any option in official documentation to enable bulk inset mode.
Wondering, if it's possible to set bulk insert mode in dataflow pipeline? If yes, please let me know what I need to change in below code.
.apply(JdbcIO.<KV<Integer, String>>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
.withUsername("username")
.withPassword("password"))
.withStatement("insert into Person values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
public void setParameters(KV<Integer, String> element, PreparedStatement query) {
query.setInt(1, kv.getKey());
query.setString(2, kv.getValue());
}
})
EDIT 2018-01-27:
It turns out that this issue is related to the DirectRunner. If you run the same pipeline using the DataflowRunner, you should get batches that are actually up to 1,000 records. The DirectRunner always creates bundles of size 1 after a grouping operation.
Original answer:
I've run into the same problem when writing to cloud databases using Apache Beam's JdbcIO. The problem is that while JdbcIO does support writing up to 1,000 records in one batch, in I have never actually seen it write more than 1 row at a time (I have to admit: This was always using the DirectRunner in a development environment).
I have therefore added a feature to JdbcIO where you can control the size of the batches yourself by grouping your data together and writing each group as one batch. Below is an example of how to use this feature based on the original WordCount example of Apache Beam.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
// Count words in input file(s)
.apply(new CountWords())
// Format as text
.apply(MapElements.via(new FormatAsTextFn()))
// Make key-value pairs with the first letter as the key
.apply(ParDo.of(new FirstLetterAsKey()))
// Group the words by first letter
.apply(GroupByKey.<String, String> create())
// Get a PCollection of only the values, discarding the keys
.apply(ParDo.of(new GetValues()))
// Write the words to the database
.apply(JdbcIO.<String> writeIterable()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
.withStatement(INSERT_OR_UPDATE_SQL)
.withPreparedStatementSetter(new WordCountPreparedStatementSetter()));
The difference with the normal write-method of JdbcIO is the new method writeIterable() that takes a PCollection<Iterable<RowT>> as input instead of PCollection<RowT>. Each Iterable is written as one batch to the database.
The version of JdbcIO with this addition can be found here: https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java
The entire example project containing the example above can be found here: https://github.com/olavloite/spanner-beam-example
(There is also a pull request pending on Apache Beam to include this in the project)

Working on migration of SPL 3.0 to 4.2 (TEDA)

I am working on migration of 3.0 code into new 4.2 framework. I am facing a few difficulties:
How to do CDR level deduplication in new 4.2 framework? (Note: Table deduplication is already done).
Where to implement PostDedupProcessor - context or chainsink custom? In either case, do I need to remove duplicate hashcodes from the list or just reject the tuples? Here I am also doing column updating for a few tuples.
My file is not moving into archive. The temporary output file is getting generated and that too empty and outside load directory. What could be the possible reasons? - I have thoroughly checked config parameters and after putting logs, it seems correct output is being sent from transformer custom, so I don't know where it is stuck. I had printed TableRowGenerator stream for logs(end of DataProcessor).
1. and 2.:
You need to select the type of deduplication. It is not a big difference if you choose "table-" or "cdr-level-deduplication".
The ite.businessLogic.transformation.outputType does affect this. There is one Dedup only. You can not have both.
Select recordStream for "cdr-level-deduplication", do the transformation to table row format (e.g. if you like to use the TableFileWriter) in xxx.chainsink.custom::PostContextDataProcessor.
In xxx.chainsink.custom::PostContextDataProcessor you need to add custom code for duplicate-handling: reject (discard) tuples or set special column values or write them to different target tables.
3.:
Possibly reasons could be:
Missing forwarding of window punctuations or statistic tuple
error in BloomFilter configuration, you would see it easily because PE is down and error log gives hints about wrong sha2 functions be used
To troubleshoot your ITE application, I recommend to enable the following debug sinks if checking the StreamsStudio live graph is not sufficient:
ite.businessLogic.transformation.debug=on
ite.businessLogic.group.debug=on
ite.businessLogic.sink.debug=on
Run a test with a single input file only and check the flow of your record and statistic tuples. "Debug sinks" write punctuations markers also to debug files.

Operation not allowed after ResultSet closed in solr import

I encountered an error while doing full-import in solr-6.6.0.
I am getting exception as bellow
This happens when I set
batchSize="-1" in my db-config.xml
If I change this value to say batchSize="100" then import will run without any error.
But recommended value for this is "-1".
Any suggestion why solr throwing this exception.
By the way the data am trying to import is not huge, data am trying to import is just 250 documents.
Stack trace:
org.apache.solr.handler.dataimport.DataImportHandlerException: java.sql.SQLException: Operation not allowed after ResultSet closed
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:61)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:464)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:377)
at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:133)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:475)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:516)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:414)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:329)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:415)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:474)
at org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:457)
at java.lang.Thread.run(Thread.java:745)
By the way am getting one more warning:
Could not read DIH properties from /configs/state/dataimport.properties :class org.apache.zookeeper.KeeperException$NoNodeException
This happens when config directory is not writable.
How can we make config directory writable in solrCloud mode.
Am using zookeeper as watch-dog. Can we go ahead and change permission of config files which are there is zookeeper?
your help greatly appreciated.
Using fetchSize="-1" is only recommended if you have problems running without it. Its behaviour is up to the JDBC driver, but the cause of people assuming its recommended is this sentence from the old wiki:
DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.
Unless you're actually seeing issues with the default values, leave the setting alone and assume your JDBC driver does the correct thing (.. which it might not do with -1 as the value).
The reason for dataimport.properties having to be writable is that it writes a property for the last time the import ran to the file, so that you can perform delta updates by referencing the time of the last update in your SQL statement.
You'll have to make the directory writable for the client (solr) if you want to use this feature. My guess would be that you can ignore the warning if you're not using delta imports.

Returning values from InputFormat via the Hadoop Configuration object

Consider a running Hadoop job, in which a custom InputFormat needs to communicate ("return", similarly to a callback) a few simple values to the driver class (i.e., to the class that has launched the job), from within its overriden getSplits() method, using the new mapreduce API (as opposed to mapred).
These values should ideally be returned in-memory (as opposed to saving them to HDFS or to the DistributedCache).
If these values were only numbers, one could be tempted to use Hadoop counters. However, in numerous tests counters do not seem to be available at the getSplits() phase and anyway they are restricted to numbers.
An alternative could be to use the Configuration object of the job, which, as the source code reveals, should be the same object in memory for both the getSplits() and the driver class.
In such a scenario, if the InputFormat wants to "return" a (say) positive long value to the driver class, the code would look something like:
// In the custom InputFormat.
public List<InputSplit> getSplits(JobContext job) throws IOException
{
...
long value = ... // A value >= 0
job.getConfiguration().setLong("value", value);
...
}
// In the Hadoop driver class.
Job job = ... // Get the job to be launched
...
job.submit(); // Start running the job
...
while (!job.isComplete())
{
...
if (job.getConfiguration().getLong("value", -1))
{
...
}
else
{
continue; // Wait for the value to be set by getSplits()
}
...
}
The above works in tests, but is it a "safe" way of communicating values?
Or is there a better approach for such in-memory "callbacks"?
UPDATE
The "in-memory callback" technique may not work in all Hadoop distributions, so, as mentioned above, a safer way is, instead of saving the values to be passed back in the Configuration object, create a custom object, serialize it (e.g., as JSON), saved it (in HDFS or in the distributed cache) and have it read in the driver class. I have also tested this approach and it works as expected.
Using the configuration is a perfectly suitable solution (admittedly for a problem I'm not sure I understand), but once the job has actually been submitted to the Job tracker, you will not be able to amend this value (client side or task side) and expect to see the change on the opposite side of the comms (setting configuration values in a map task for example will not be persisted to the other mappers, nor to the reducers, nor will be visible to the job tracker).
So to communicate information back from within getSplits back to your client polling loop (to see when the job has actually finished defining the input splits) is fine in your example.
What's your greater aim or use case for using this?