need mysql compatible compress()/decompress() for Java - mysql

I'm thinking of applying the MySql compress() function to a field that is varchar and tends to run from a few thousand characters to more than a million, per column. The text is almost normal English, so I get a 8-to-1 or ever better compression. Since I have millions of records and rarely ever want to actually look at the data, compression seems to be a good engineering tradeoff.
I need to do most of the processing in Java, and there are nice implementations of zip, gzip and bzip2. So that is cool.
But I'd really love to be able to use the standard MySql client to do queries such as
select decompress(longcolumn) where ...
so I'd like my java code to use the same, or a compatible compression algorithm as the built in function. The documentation I find says "compiled with a compression library such as zlib"
this is a bit vague, how can I know exactly what to use?
=== edited ==
to be clear, I want to be able to use "mysql" the client program to do debugging, so things like:
select decompress(longcolumn) where ...
don't use Java at all. But I want to do the updates and inserts using JDBC.
And the mainline usage, has to get the compressed blog, and then decompress it. Some sort of wrapper or ZipInputStream is fine.

I'm not certain, but I'd try just wrapping the output with an InflaterInputStream():
ResultSet resultSet = statement.executeQuery("SELECT blobfield FROM table");
InputStream stream = new InflaterInputStream(resultSet.getBlob(1).getBinaryStream());
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/zip/InflaterInputStream.html was moved to https://docs.oracle.com/javase/7/docs/api/java/util/zip/InflaterInputStream.html
This blogpost might be interesting to you as well:
http://www.mysqlperformanceblog.com/2012/05/30/data-compression-in-innodb-for-text-and-blob-fields/

Related

Postgresql JSONB is coming. What to use now? Hstore? JSON? EAV?

After going through the relational DB/NoSQL research debate, I've come to the conclusion that I will be moving forward with PG as my data store. A big part of that decision was the announcement of JSONB coming to 9.4. My question is what should I do now, building an application from the ground up knowing that I want to migrate to (I mean use right now!) jsonb? The DaaS options for me are going to be running 9.3 for a while.
From what I can tell, and correct me if I'm wrong, hstore would run quite a bit faster since I'll be doing a lot of queries of many keys in the hstore column and if I were to use plain json I wouldn't be able to take advantage of indexing/GIN etc. However I could take advantage of nesting with json, but running any queries would be very slow and users would be frustrated.
So, do I build my app around the current version of hstore or json data type, "good ol" EAV or something else? Should I structure my DB and app code a certain way? Any advice would be greatly appreciated. I'm sure others may face the same question as we await the next official release of PostgreSQL.
A few extra details on the app I want to build:
-Very relational (with one exception below)
-Strong social network aspect (groups, friends, likes, timeline etc)
-Based around a single object with variable user assigned attributes, maybe 10 or 1000+ (this is where the schema-less design need comes into play)
Thanks in advance for any input!
It depends. If you expect to have a lot of users, a very high transaction volume, or an insane number of attribute fetches per query, I would say use HSTORE. If, however, you app will start small and grow over time, or have relatively few transactions that fetch attributes, or just fetch a few per query, then use JSON. Even in the latter case, if you're not fetching many attributes but checking one or two keys often in the WHERE clause of your queries, you can create a functional index to speed things up:
CREATE INDEX idx_foo_somekey ON foo((bar ->> 'somekey'));
Now, when you have WHERE bar ->> somekey, it should use the index.
And of course, it will be easier to use nested data and to upgrade to jsonb when it becomes available to you.
So I would lean toward JSON unless you know for sure you're going kick your server's ass with heavy use of key fetches before you have a chance to upgrade to 9.4. But to be sure of that, I would say, do some benchmarking with anticipated query volumes now and see what works best for you.
You probably don't give quite enough to give a very detailed answer, but I will say this... If your data is "very relational" then I believe your best course is to build it with a good relational design. If it's just one field with "variable assigned attributes", then that sounds like a good use for an hstore. Which is pretty tried and true at this point. I've been doing some reading on 9.4 and jsonb sounds cool, but, that won't be out for a while. I suspect that a good schema design in 9.3 + a very targeted use of hstore will probably yield a good combination of performance and flexibility.

SQL parameterization: How does this work behind the scenes?

SQL parameterization is a hot topic nowadays, and for a good reason, but does it really do anything besides escaping decently?
I could imagine a parameterization engine simply making sure the data is decently escaped before inserting it into the query string, but is that really all it does? It would make more sense to do something differently in the connection, e.g. like this:
> Sent data. Formatting: length + space + payload
< Received data
-----
> 69 SELECT * FROM `users` WHERE `username` LIKE ? AND `creation_date` > ?
< Ok. Send parameter 1.
> 4 joe%
< Ok. Send parameter 2.
> 1 0
< Ok. Query result: [...]
This way would simply eliminate the issue of SQL injections, so you wouldn't have to avoid them through escaping. The only other way I can think of how parameterization might work, is by escaping the parameters:
// $params would usually be an argument, not in the code like this
$params = ['joe%', 0];
// Escape the values
foreach ($params as $key=>$value)
$params[$key] = mysql_real_escape_string($value);
// Foreach questionmark in the $query_string (another argument of the function),
// replace it with the escaped value.
$n = 0;
while ($pos = strpos($query_string, "?") !== false && $n < count($params)) {
// If it's numeric, don't use quotes around it.
$param = is_numeric($params[$n]) ? $params[$n] : "'" . $params[$n] . "'";
// Update the query string with the replaced question mark
$query_string = substr($query_string, 0, $pos) //or $pos-1? It's pseudocode...
. $param
. substr($query_string, $pos + 1);
$n++;
If the latter is the case, I'm not going to switch my sites to parameterization just yet. It has no advantage that I can see, it's just another strong vs weak variable typing discussion. Strong typing may catch more errors in compiletime, but it doesn't really make anything possible that would be hard to do otherwise - same with this parameterization. (Please correct me if I'm wrong!)
Update:
I knew this would depend on the SQL server (and also on the client, but I assume the client uses the best possible techniques), but mostly I had MySQL in mind. Answers concerning other databases are (and were) also welcome though.
As far as I understand the answers, parameterization does indeed do more than simply escaping the data. It is really sent to the server in a parameterized way, so with variables separated and not as a single query string.
This also enables the server to store and reuse the query with different parameters, which provides better performance.
Did I get everything? One thing I'm still curious about is whether MySQL has these features, and if query reusage is automatically done (or if not, how this can be done).
Also, please comment when anyone reads this update. I'm not sure if it bumps the question or something...
Thanks!
I'm sure that the way that your command and parameters are handled will vary depending on the particular database engine and client library.
However, speaking from experience with SQL Server, I can tell you that parameters are preserved when sending commands using ADO.NET. They are not folded into the statement. For example, if you use SQL Profiler, you'll see a remote procedure call like:
exec sp_executesql N'INSERT INTO Test (Col1) VALUES (#p0)',N'#p0 nvarchar(4000)',#p0=N'p1'
Keep in mind that there are other benefits to parameterization besides preventing SQL injection. For example, the query engine has a better chance of reusing query plans for parameterized queries because the statement is always the same (just the parameter values change).
In response to update:
Query parameterization is so common I would expect MySQL (and really any database engine) to handle it similarly.
Based on the MySQL protocol documentation, it looks like prepared statements are handled using COM_PREPARE and COM_EXECUTE packets, which do support separate parameters in binary format. It's not clear if all parameterized statements will be prepared, but it does look like unprepared statements are handled by COM_QUERY which has no mention of parameter support.
When in doubt: test. If you really want to know what's sent over the wire, use a network protocol analyzer like Wireshark and look at the packets.
Regardless of how it's handled internally and any optimizations it may or may not currently provide for a given engine, there's very little (nothing?) to gain from not using parameters.
Parameterized query are passed to SQL implementation as parameterized query, the parameters are never concatenated to the query itself unless an implementation decided to fallback to concatenating themselves. Parameterized query avoids the need for escaping, and improves performance since the query is generic and it is more likely that a compiled form of the query is already cached by the database server.
The straight answer is "it's implemented whatever way it's implemented in the particular implementation in question". There's dozens of databases, dozens of access layers and in some cases more than one way for the same access layer to deal with the same code.
So, there isn't a single correct answer here.
One example would be that if you use Npgsql with a query that isn't a prepared statement, then it pretty much just escapes things correctly (though escaping in Postgresql has some edge cases that people who know about escaping miss, and Npgsql catches them all, so still a gain). With a prepared statement, it sends parameters as prepared-statment parameters. So one case allows for greater query-plan reuse than another.
The SQLServer driver for the same framework (ADO.NET) passes queries through as calls to sp_executesql, which allows for query-plan re-use.
As well as that, the matter of escaping is still worth considering for a few reasons:
It's the same code each time. If you're escaping yourself, then either you're doing so through the same piece of code each time (so it's not like there's any downside to using someone else's same piece of code), or you're risking a slip-up each time.
They're also better at not escaping. There's no point going through every character in the string representation of a number looking for ' characters, for example. But does not escaping count as a needless risk, or a reasonable micro-optimisation.
Well, "reasonable micro-optimisation" in itself means one of two things. Either it requires no mental effort to write or to read for correctness afterwards (in which case you might as well), or it's hit frequently enough that tiny savings will add up, and it's easily done.
(Relatedly, it also makes more sense to write a highly optimised escaper - the sort of string replacement involved is the sort of case where the most common approach of replacing isn't as fast as some other approaches in some languages at least, but the optimisation only makes sense if the method will be called a very large number of times).
If you've a library that includes type checking the parameter (either in basing the format used on the type, or by validation, both of which are common with such code), then it's easy to do and since these libraries aim at mass use, it's a reasonable micro-opt.
If you're thinking each time about whether parameter number 7 of an 8-parameter call could possibly contain a ' character, then it's not.
They're also easier to translate to other systems if you want. To again look at the two examples I gave above, apart from the classes created, you can use pretty much identical code with System.Data.SqlClient as with Npgsql, though SQL-Server and Postgresql have different escaping rules. They also have an entirely different format for binary strings, date-times and a few other datatypes they have in common.
Also, I can't really agree with calling this a "hot topic". It's had a well-established consensus for well over a decade at the very least.

Storing JSON in an msSQL database?

I'm developing a form generator, and wondering if it would be bad mojo to store JSON in an SQL database?
I want to keep my database & tables simple, so I was going to have
`pKey, formTitle, formJSON`
on a table, and then store
{["firstName":{"required":"true","type":"text"},"lastName":{"required":"true","type":"text"}}
in formJSON.
Any input is appreciated.
I use JSON extensively in my CMS (which hosts about 110 sites) and I find the speed of access data to be very fast. I was surprised that there wasn't more speed degradation. Every object in the CMS (Page, Layout, List, Topic, etc) has an NVARCHAR(MAX) column called JSONConfiguration. My ORM tool knows to look for that column and reconstitute it as an object if needed. Or, depending on the situation, I will just pass it to the client for jQuery or Ext JS to process.
As for readability / maintainability of my code, you might say it's improved because I now have classes that represent a lot of the JSON objects stored in the DB.
I used JSON.net for all serialization / deserialization. https://www.newtonsoft.com/json
I also use a single query to return meta-JSON with the actual data. As in the case of Ext JS, I have queries that return both the structure of the Ext JS object as well as the data the object will need. This cuts out one post back / SQL round trip.
I was also surprised at how fast the code was to parse a list of JSON objects and map them into a DataTable object that I then handed to a GridView.
The only downside I've seen to using JSON is indexing. If you have a property of the JSON you need to search, then you have to store it as a separate column.
There are JSON DB's out there that might server your needs better: CouchDB, MongoDB, and Cassandra.
A brilliant way to make an object database from sql server. I do this for all config objects and everything else that doesn't need any specific querying. extending your object - easy, just create a new property in your class and init with default value. Don't need a property any more? Just delete it in the class. Easy roll out, easy upgrade. Not suitable for all objects, but if you extract any prop you need to index on - keep using it. Very modern way of using sql server.
It will be slower than having the form defined in code, but one extra query shouldn't cause you much harm. (Just don't let 1 extra query become 10 extra queries!)
Edit: If you are selecting the row by formTitle instead of pKey (I would, because then your code will be more readable), put an index on formTitle
We have used a modified version of XML for exactly the purpose you decribe for seven or eight years and it works great. Our customers' form needs are so diverse that we could never keep up with a table/column approach. We are too far down the XML road to change very easily but I think JSON would work as well and maybe evan better.
Reporting is no problem with a couple of good parsing functions and I would defy anyone to find a significant difference in performance between our reporting/analytics and a table/column solution to this need.
I wouldn't recommend it.
If you ever want to do any reporting or query based on these values in the future it's going to make your life a lot harder than having a few extra tables/columns.
Why are you avoiding making new tables? I say if your application requires them go ahead and add them in... Also if someone has to go through your code/db later it's probably going to be harder for them to figure out what you had going on (depending on what kind of documentation you have).
You should be able to use SisoDb for this. http://sisodb.com
I think it not an optimal idea to store object data in a string in SQL. You have to do transformation outside of SQL in order to parse it. That presents a performance issue and you lose the leverage of using SQL native data parsing capability. A better way would be to store JSON as an XML datatype in SQL. This way, you kill two birds with one stone: You don't have to create shit load of tables and still get all the native querying benefits of SQL.
XML in SQL Server 2005? Better than JSON in Varchar?

Is HBase meaningful if it's not running in a distributed environment?

I'm building an index of data, which will entail storing lots of triplets in the form (document, term, weight). I will be storing up to a few million such rows. Currently I'm doing this in MySQL as a simple table. I'm storing the document and term identifiers as string values than foreign keys to other tables. I'm re-writing the software and looking for better ways of storing the data.
Looking at the way HBase works, this seems to fit the schema rather well. Instead of storing lots of triplets, I could map document to {term => weight}.
I'm doing this on a single node, so I don't care about distributed nodes etc. Should I just stick with MySQL because it works, or would it be wise to try HBase? I see that Lucene uses it for full-text indexing (which is analogous to what I'm doing). My question is really how would a single HBase node compare with a single MySQL node? I'm coming from Scala, so might a direct Java API have an edge over JDBC and MySQL parsing etc each query?
My primary concern is insertion speed, as that has been the bottleneck previously. After processing, I will probably end up putting the data back into MySQL for live-querying because I need to do some calculations which are better done within MySQL.
I will try prototyping both, but I'm sure the community can give me some valuable insight into this.
Use the right tool for the job.
There are a lot of anti-RDBMSs or BASE systems (Basically Available, Soft State, Eventually consistent), as opposed to ACID (Atomicity, Consistency, Isolation, Durability) to choose from here and here.
I've used traditional RDBMSs and though you can store CLOBs/BLOBs, they do
not have built-in indexes customized specifically for searching these objects.
You want to do most of the work (calculating the weighted frequency for
each tuple found) when inserting a document.
You might also want to do some work scoring the usefulness of
each (documentId,searchWord) pair after each search.
That way you can give better and better searches each time.
You also want to store a score or weight for each search and weighted
scores for similarity to other searches.
It's likely that some searches are more common than others and that
the users are not phrasing their search query correctly though they mean
to do a common search.
Inserting a document should also cause some change to the search weight
indexes.
The more I think about it, the more complex the solution becomes.
You have to start with a good design first. The more factors your
design anticipates, the better the outcome.
MapReduce seems like a great way of generating the tuples. If you can get a scala job into a jar file (not sure since I've not used scala before and am a jvm n00b), it'd be a simply matter to send it along and write a bit of a wrapper to run it on the map reduce cluster.
As for storing the tuples after you're done, you also might want to consider a document based database like mongodb if you're just storing tuples.
In general, it sounds like you're doing something more statistical with the texts... Have you considered simply using lucene or solr to do what you're doing instead of writing your own?

How fast is MySQL compared to a C/C++ program running in the server?

Ok, I have a need to perform some intensive text manipulation operations.
Like concatenating huge (say 100 pages of standard text), and searching in them etc. so I am wondering if MySQL would give me a better performance for these specific operations, compared to a C program doing the same thing?
Thanks.
Any database is always slower than a flat-file program outside the database.
A database server has overheads that a program reading and writing simple files doesn't have.
In general the database will be slower. But much depends on the type of processing you want to do, the time you can devote for coding and the coding skills. If the database provides out-of-the-box the tools and functionality you need, then why don't give it a try, which should take much less time than coding own tool. If the performance turns out to be an issue then write your own solution.
But I think that MySQL will not provide the text manipulation operations you want. In Oracle world one has Text Mining and Oracle Text.
There are several good responses that I voted up, but here are more considerations from my opinion:
No matter what path you take: indexing the text is critical for speed. There's no way around it. The only choice is how complex you need to make your index for space constraints as well as search query features. For example, a simple b-tree structure is fast and easy to implement but will use more disk space than a trie structure.
Unless you really understand all the issues, or want to do this as a learning exercise, you are going to be much better off using an application that has had years of performance tuning.
That can mean a relational databases like MySQL even though full-text is a kludge in databases designed for tables of rows and columns. For MySQL use the MyISAM engine to do the indexing and add a full text index on a "blob" column. (Afaik, the InnoDB engine still doesn't handle full text indexing, so you need to use MyISAM). For Postgresql you can use tsearch.
For a bit more difficulty of implementation though you'll see the best performance integrating indexing apps like Xapian, Hyper Estraier or (maybe) Lucene into your C program.
Besides better performance, these apps will also give you important features that MySQL full-text searching is missing, such as word stemming, phrase searching, etc., in other words real full-text query parsers that aren't limited to an SQL mindset.
Relational Databases are normally not good for handling large text data. The performance-wise strength of realtional DBs is the indexation and autmatically generated query plan. Freeform text does not work well in with this model.
If you are talking about storing plain text in one db field and trying to manipulate with data, then C/C++ sould be faster solution. In simple way, MySQL should be a lot bigger C programm than yours, so it must be slower in simple tasks like string manipulation :-)
Of course you must use correct algorithm to reach good result. There is useful e-book about string search algorithms with examples included: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
P.S. Benchmark and give us report :-)
Thanks for all the answers.
I kind of thought that a DB would involve some overhead as well. But what I was thinking is that since my application required that the text be stored somewhere in the first place already, then the entire process of extracting the text from DB, passing it to the C program, and writing back the result into the DB would overall be less efficient than processing it within the DB??
If you're literally talking about concatenating strings and doing a regexp match, it sounds like something that's worth doing in C/C++ (or Java or C# or whatever your favorite fast high-level language is).
Databases are going to give you other features like persistence, transactions, complicated queries, etc.
With MySQL you can take advantage of full-text indices, which will be hundreds times faster, then directly searching through the text.
MySQL is fairly efficient. You need to consider whether writing your own C program would mean more or less records need to be accessed to get the final result, and whether more or less data needs to be transferred over the network to get the final result.
If either solution will result in the same number of records being accessed, and the same amount transferred over the network, then there probably won't be a big difference either way. If performance is critical then try both and benchmark them (if you don't have time to benchmark both then you probably want to go for whichever is easiest to implemnent anyway).
MySQL is written in C, so it is not correct to compare it to a C program. It's itself a C program