MySQL implementation with CUDA - mysql

I am a senior undergrad majoring in CS. At the moment I am taking a Computer Architecture class. We need to do a project. I want to do something related to CUDA, where the performance of the computation will have a moderate increase compred to a serial implementation.
I am really interested in databases so I decided to do something related to SQL. I only have experience with MySQL and I could not find anything related to how to work with MySQL using CUDA. There is only one reseasrch I could find about SQL and it uses SQLite. I am not sure what to do and how to gather information on this subject so I decided to take your opinions.
Best

Just in case someone end-up in this page, the PGStorm is a module of foreign data wrapper of PostgreSQL database.

You might want to look at implementation of SQL language which runs on GPU and uses CUDA.
it is open source so you can look at algorithms for joins, sorts and groupings.
Link :
http://sourceforge.net/projects/alenka/

Really? Google found this from NVIDIA:
http://forums.nvidia.com/index.php?showtopic=100342
They have a guide. Is that not suitable? It's certainly not for the faint of heart.
http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdf

On the contrary to everyone else, while I'm not sure how, a GPU can work with MySQL... I dont know why every says it couldnt. If there is CPU workload in MySQL then whatever that CPU is doing, at some level, if someone took the time to implement it, a GPU could, for example, work on UPDATING separate rows on separate threads. Could either alloc each table to each block, or just freeform it and let the end user decide.
Or at least someone could edit the driver to speed up efficiency of communication.

It looks like that now there's a solution for querying data using SQL on gpus in python :
https://developer.nvidia.com/blog/beginners-guide-to-querying-data-using-sql-on-gpus-in-python/
I wonder if they are some other possibilities using different programming langages though.

Related

How do you make real, secure benchmarks?

According to this question, a benchmark run on the same machine had very varying results.
I'm not asking about how to use microtime or whichever framework, but rather, how do you make sure that your benchmarks are not biased in any way? Any machine setup, software setup, process setup? Is there a way to make sure your benchmarks can be safely used as a reference?
Basically benchmarking is kind of like a scientific study, so the same rules apply. A benchmark is usually done to answer some kind of question, so start with formulating a good question. After that it is practice and experience to eliminate all the wrong bias.
Make sure you know and document the runtime environment in detail(e.g. switch off power management and other background tasks that might disturb measurements).
Make sure you repeat the experiment (benchmark run) often enough to get good and stable averages and document it.
Make sure you know what you are measuring (e.g. use a working set thats larger than all caches if you want to measure memory performance etc., or using as many threads as you have cores and so on).
In some cases this involves getting caches filled and datasets cached, in other cases you need to do the exact opposite. Depends on the question you want to answer with your benchmark.

mysql sharding case study link or paper

I have been going through the book "High performance mysql", its really a nice book. But the only concern for myself is the MySQL sharding part. Even though there are a lot of theories but the practical implementation is lacking and some of the aspects are also like blackbox (arranging shrds on node). It would be great if somebody can point me to some case study article or paper so that i can under it properly.
Thanks in advance!!
I found one [link] (http://tumblr.github.com/assets/2011-11-massively_sharded_mysql.pdf). please share more if somebody has. Thanks.
Yes, "sharding" is rather a design/development pattern... It's not a database feature of any kind, I would call it "it's like the database had outsourced the scale-out capability to the application".
I work for ScaleBase (http://www.scalebase.com), which is a maker of a complete scale-out solution an "automatic sharding machine" if you like, analyzes the data and SQL stream, splits the data across DB nodes, load-balances reads, and aggregates results in runtime – so you won’t have to!
No code changes, everything continues to work with “1 database”. Your application or any other client tool (mysql, mysqldump, PHPMyAdmin...) connects to ScaleBase controller (looks and feels like a MySQL), which is a proxy to a grid of "shards", automating command routing and parallelizing cross-db queries, and merge results – exactly as if the result came from 1 database. ORDER, GROUP, LIMIT, agg functions supported!
Also, please visit my blog, http://database-scalability.blogspot.com/, all about scalability...
ScaleBase my company had a webinar not so long ago, specifically about sharding and data distribution. Amazingly it's not (yet?) in the http://www.scalebase.com/resources/webinars/. I'll see if they can upload it, or I'll have the slides attached here, or similar. Stay tuned!
Hope I helped...
Doron

MySQL: which API to use?

I'm just getting started with interfacing to MySQL from a C++ app. The app is pretty simple: it's a Linux web server, and the C++ code retrieves JavaScript from a local database to return to the client via Apache and Ajax. The database will contain no more than a few thousand short JavaScript programs.
Question: any advice on which API I should use? I'm just reading through the docs on dev.mysql.com, and there doesn't seem to be any good reason to choose one or other of libmysql, Connector/C, Connector/C++, MySQL++, or Connector/ODBC. Thanks.
With no more than a few thousand rows, chances are, you should pick your API after your language preferences, not the other way round - so go aheead and chose whatever fits your mood.
If your app's performance stands and falls with the performance differences of the MySQL connectors you should be quite busy fixing your design elsewhere.
I personally prefer portability, so I tend to use a lot of ODBC, accepting the small performance hit, but others might think different. If you never ever want to use a different RDBMS stay away from ODBC - without the portability benefit it's quite ugly.
I would just use the raw C API. Seems to be the simplest way with the least overhead.

Any good threads related job-interview question?

When interviewing graduates I usually ask them questions about data structures, algorithms and complexity theory. I would really like to ask a question that will enable them to show their familiarity with multi-threaded concepts, without dwelling into language specific issues.
Any good questions? The only question I could think of is how to write a Singleton that supports multi-threaded access.
I find the classic "write me a consumer-producer queue" question to be quite good. You can talk about synchronization in a handwavy way beforehand for five minutes or so (e.g. start with "What does Object.wait() do? What other methods on Object is it related to? Can you give me an example of when you might use these? What other concurrency techniques might you use in practice [because really, it's quite rare that actually using the wait/notify primitives is the best approach]?"). Make sure the candidate addresses (or at least makes clear he is aware of) both atomicity ("missed updates") and volatility (visibility of the new value on other threads)
Then after you've had a chat about the theory of these, get them to spend a few minutes actually writing the code for a primitive producer-consumer queue. This should be straightforward to anyone who actually understands what they were talking about above, yet it will weed out those who can "talk the talk" but don't actually understand it in practice (arguably the most dangerous group).
What I like about these mini-coding exercises, is that they're often easy to extend. For instance, if the candidate completes the task easily, you can ask how they would extend it for situation XXX - invent requirements that you know will push the limits of the noddy solution you asked for. This not only lets you tailor the depth of questions you're asking but gives some insight into how well the candidate handles clarification of requirements, and modifications of existing design (which is pretty important in this industry).
Here you can find some topics to discuss:
threads implementation ( kernel vs user space)
thread local storage
synchronization primitives
deadlocks, livelocks
Differences between mutex and
semaphore.
Use of condition variables.
When not to use threads. (eg. IO multiplexing)
Talk with them about a popular, but not well-known topic, where thread handling is essential.
I recommend you, build a web server with them, of course, only on paper or just in words. The result should look something like this: there is a main thread, it's listening on a socket. When something arrives, it passes the socket into the pool, then this thread returns back to socket listening. The pool has fixed number of slots. The request processing threads are dedicated to get job from the pool. Find out, what's better, if the threads are checking the pool concurrently, or the listner main thread selects a free slot/thread for the new incoming request. Try to write a small pseudocode, or a graph for both side of the pool handling.
Let's introduce a small application: page counter, which tells that how many page request has been made since server startup. Don't tell them that the counter must be protected against concurrent modification, let them to find it out how to do this with mutexes or synchronization or whatsoever. Maybe you could skip the web server part, the page counter app is easier to specify.
Another example is a chat, with 2+ clients and a server, find out, how to solve the problem, that all the messages should arrive in the same order for all clients. Or reflex game: the server waits for 1..5 secs random, then says "peek-a-boo", and the player wins who presses space key first. Specify it with 2 player, then try to expand it to N players.
Also, be aware of NPPs. NPP stands: "non-programming programmer". There are dudes, who can talk about programming issues, they know all the 3/4-letter abbrevations (there're lot in the Java world, EJB, JSP, XSLT, and my favourite: POJO, which means Pure Old Java Objects, lol), they understand and modify codes, or make similar programs from a base, but they fail even with small problems, it it has to do it theirselves, e.g. finding the nearest element to a base in an array. Sometimes it takes months, until it turns out. They performs well at interviews, because they prepare for it. Maybe they don't even known, that they're NPPs, this is a known effect: http://en.wikipedia.org/wiki/Dunning-Kruger_effect
It's hard to recognize the opposite dudes, who have not heard about trendy libraries or patterns, but they can learn it even at the job interview. (Personal remark: my last interview was in 1999, and it seems that I will not do interview anymore. I have never heard of dynamic web pages before, but I've figured out the term "session" during the interview, the question was that how to build a simple hanging man web app. I was hired.)

Benefits of cross-platform development?

Are there benefits to developing an application on two or more different platforms? Does using a different compiler on even the same platform have benefits?
Yes, especially if you plan to distribute your code for multiple platforms.
But even if you don't cross platform development is a form of futureproofing; if it runs on multiple (diverse) platforms today, it's more likely to run on future platforms than something that was tuned, tweeked, and specialized to work on a version 7.8.3 clean install of vendor X's Q-series boxes (patch level 1452) and nothing else.
There seems to be a benefit in finding and simply preventing bugs with a different compiler and a different OS. Different CPUs can pin down endian issues early. There is the pain at the GUI level if you want to stay native at that level.
Short answer: Yes.
Short of cloning a disk, it is almost impossible to make two systems exactly alike, so you are going to end up running on "different platforms" whether you meant to or not. By specifically confronting and solving the "what if system A doesn't do things like B?" problem head on you are much more likely to find those key assumptions your code makes.
That said, I would say you should get a good chunk of your base code working on system A, and then take a day (or a week or ...) and get it running on system B. It can be very educational.
My education came back in the 80's when I ported a source level C debugger to over 100 flavors of U*NX. Gack!
Are there benefits to developing an application on two or more different platforms?
If this is production software, the obvious reason is the lure of a larger client base. Your product's appeal is magnified the moment the client hears that you support multiple platforms. Remember, most enterprises do not use a single OS or even a single version of the OS. It is fairly typical to find a section using Windows, another Mac and a smaller version some flavor of Linux.
It is also seen that customizing a product for a single platform is often far more tedious than to have it run on multi-platform. The law of diminishing returns kicks in even before you know.
Of course, all of this makes little sense, if you are doing customization work for an existing product for the client's proprietary hardware. But even then, keep an eye out for the entire range of hardware your client has in his repertoire -- you never know when he might ask for it.
Does using a different compiler on even the same platform have benefits?
Yes, again. Different compilers implement different extensions. See to it that you are not dependent on a particular version of a particular compiler.
Further, there may be a bug or two in the compiler itself. Using multiple compilers helps sort these out.
I have further seen bits of a (cross-platform) product using two different compilers -- one was to used in those modules where floating point manipulation required a very high level of accuracy. (Been a while I've heard anyone else do that, but ...)
I've ported a large C++ program, originally Win32, to Linux. It wasn't very difficult. Mostly dealing with compiler incompatibilities, because the MS C++ compiler at the time was non-compliant in various ways. I expect that problem has mostly gone now (until C++0x features start gradually appearing). Also writing a simple platform abstraction library to centralize the platform-specific code in one place. It depends to what extent you are dependent on services from the OS that would be hard to mimic on a new platform.
You don't have to build portability in from the ground up. That's why "porting" is often described as an activity you can perform in one shot after an initial release on your most important platform. You don't have to do it continuously from the very start. Purely for economic reasons, if you can avoid doing work that may never pay off, obviously you should. The cost of porting later on, when really necessary, turns out to be not that bad.
Mostly, there is an existing platform where the application is written for (individual software). But you adress more developers (both platforms), if you decide to provide an independent language.
Also products (standard software) for SMEs can be sold better if they run on different platforms! You can gain access to both markets, WIN&LINUX! (and MacOSx and so on...)
Big companies mostly buy hardware which is supported/certified by the product vendor only to deploy the specified product.
If you develop on multiple platforms at the same time you get the advantage of being able to use different tools. For example I once had a memory overwrite (I still swear I didn't need the +1 for the null byte!) that cause "free" to crash. I brought the code up to speed on Windows and found the overwrite in about 1 minute with Rational Purify... it had taken me a week under Linux of chasing it (valgrind might have found it... but I didn't know about it at the time).
Different compilers on the same or different platforms is, to me, a must as each compiler will report different things, and sometimes the report from one compiler about an error will be gibberish but the other compiler makes it very clear.
Using things like multiple databases while developing means you are much less likely to tie yourself to a particular database which means you can swap out the database if there is a reason to do so. If you want to integrate something that uses Oracle into a existing infrastructure that uses SQL Server for example it can really suck - much better if the Oracle or SQL Server pieces can be moved to the other system (I know of some places that have 3 different databases for their financial systems... ick).
In general, always developing for two or three things means that the odds of you finding mistakes is better, and the odds of the system being more flexible is better.
On the other hand all of that can take time and effort that, at the immediate time, is seen as an unneeded expense.
Some platforms have really dreadful development tools. I once worked in an IB where rather than use Sun's ghastly toolset, peole developed code in VC++ and then ported to Solaris.