Arel in SQLite and other databases - mysql

I am new in databases and Ruby on rails applications.
I have a question about generating queries from ORM.
When my database is SQLite, and I am using a code for creating queries for this database, if I change my database am I still able to use the same code?
In addition, when I am using Arel, because it provides more ready methods for more complex queries, before I am generating a query I call the method .to_sql
If I want to use the same code but for another database am I still able to execute the query? Using instead of to_sql something else?

In general, the Ruby on Rails code is portable between databases without doing anything more than adjusting your config/database.yml file (for connection details) and updating your Gemfile (to use the correct database adapter gem).
Database portability is mostly likely when you do not rely on specific, hardcoded uses of SQL as a way to invoke queries. Instead use the Rails' associations, relations, and query tools wherever possible. Specific SQL often creeps in on .where() clauses, so be thoughtful there and minimize/simply those as much as practical (for instance, multiple simple scopes that can be chained may give you better results than trying larger more complex single scopes). Also use Arel.matches when depending on "LIKE" type clauses instead of hardcoding LIKE details in .where() calls because different databases (such as PostgreSQL) handle case-sensitivity differently.
Your best defense against surprises upon switching databases is a robust set of automated unit (e.g., Rspec, minitest) and integration tests (e.g., capybera). These are especially important where you are unable to avoid use of specific SQL coding (say for optimization or odd/complex queries).
Since SQLite is simpler than most other robust engines like MySQL or Postgres, you're likely to be safer anyway in any operations you depend on. You're most vulnerable when using some advanced or specific feature of the database, but you're also generally more aware if you're doing that, so can write protective tests to help warn you upon switching database engines.

Related

Doctrine, SQLite and Enums

We have an application running on Symfony 2.8 with a package named "liip/functional-test-bundle". We plan on using PHP Unit to run functional tests on our application, which uses MySQL for it's database.
The 'functional test bundle' package allows us to use the entities as a schema builder for an in-memory SQLite database, which is very handy because:
It requires zero configuration to run
It's extremely fast to run tests
Our tests can be run independently from each test and the development data
Unfortunately, some of our entities use 'enums' which is not supported by SQLite, and our technical lead has opted to keep existing enums whilst refraining from using them anymore.
Ideally we need this in the project sooner rather than later, so the team can start writing new tests in the future to help maintain the stability of the application.
I have 3 options at this point, but I need help choosing the correct one and performing it correctly:
Convince the technical lead that enums are a bad idea and lookup tables could instead be used (Which may cost time where the workload is already high)
Switch to using MySQL for the testing database. (This will require additional configuration for our tests to run, and may be slower)
Have doctrine detect when enums are used on a SQLite driver, and switch them out for strings. (I would have no idea how to do this, but this is, in my opinion, the most ideal solution)
Which action is the best, and how should I carry it out?

SQLite3 database per customer

Scenario:
Building a commercial app consisting in an RESTful backend with symfony2 and a frontend in AngularJS
This app will never be used by many customers (if I get to sell 100 that would be fantastic. Hopefully much more, but in any case will be massive)
I want to have a multi tenant structure for the database with one schema per customer (they store sensitive information for their customers)
I'm aware of problem when updating schemas but I will have to live with it.
Today I have a MySQL demo database that I will clone each time a new customer purchase the app.
There is no relationship between my customers, so I don't need to communicate with multiple shards for any query
For one customer, they can be using the app from several devices at the time, but there won't be massive write operations in the db
My question
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and if it's a common practice to have one dedicated SQLite3 database PER CLIENT. I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
Is this a correct scenario for SQLite?
Any suggestion (aka tutorial) in how to achieve this?
[I wonder] if it's a common practice to have one dedicated SQLite3 database PER CLIENT
Only if the database is deployed along with the application, like on a phone. Otherwise I've never heard of such a thing.
I've never used SQLite and I have no idea if the process of updating a schema and replicate the changes in all the databases is done in the same way as for other RDBMS
SQLite is a SQL database and responds to ALTER TABLE and the like. As for updating all the schemas, you'll have to re-run the update for all schemas.
Schema synching is usually handled by an outside utility, usually your ORM will have something. Some are server agnostic, some only support specific servers. There are also dedicated database change management tools such as Sqitch.
However I wonder if it's also a good idea to switch from MySQL to SQLite3 database as my main database support for the application, and
SQLite's main advantage is not requiring you to install and run a server. That makes sense for quick projects or where you have to deploy the database, like a phone app. For server based application there's no problem having a database server. SQLite's very restricted set of SQL features becomes a disadvantage. It will also likely run slower than a server database for anything but the simplest queries.
Trying to set some functional tests for the backend API I read about having a dedicated sqlite database for loading testing data, which seems to be good idea.
Under no circumstances should you test with a different database than the production database. Databases do not all implement SQL the same, MySQL is particularly bad about this, and your tests will not reflect reality. Running a MySQL instance for testing is not much work.
This separate schema thing claims three advantages...
Extensibility (you can add fields whenever you like)
Security (a query cannot accidentally show data for the wrong tenant)
Parallel Scaling (you can potentially split each schema onto a different server)
What they're proposing is equivalent to having a separate, customized copy of the code for every tenant. You wouldn't do that, it's obviously a maintenance nightmare. Code at least has the advantage of version control systems with branching and merging. I know only of one database management tool that supports branching, Sqitch.
Let's imagine you've made a custom change to tenant 5's schema. Now you have a general schema change you'd like to apply to all of them. What if the change to 5 conflicts with this? What if the change to 5 requires special data migration different from everybody else? Now let's imagine you've made custom changes to ten schemas. A hundred. A thousand? Nightmare.
Different schemas will require different queries. The application will have to know which schema each tenant is using, there will have to be some sort of schema version map you'll need to maintain. And every different possible query for every different possible schema will have to be maintained in the application code. Nightmare.
Yes, putting each tenant in a separate schema is more secure, but that only protects against writing bad queries or including a query builder (which is a bad idea anyway). There are better ways mitigate the problem such as the view filter suggested in the docs. There are many other ways an attacker can access tenant data that this doesn't address: gain a database connection, gain access to the filesystem, sniff network traffic. I don't see the small security gain being worth the maintenance nightmare.
As for scaling, the article is ten years out of date. There are far, far better ways to achieve parallel scaling then to coarsely put schemas on different servers. There are entire databases dedicated to this idea. Fortunately, you don't need any of this! Scaling won't be a problem for you until you have tens of thousands to millions of tenants. The idea of front loading your design with a schema maintenance nightmare for a hypothetical big parallel scaling problem is putting the cart so far before the horse, it's already at the pub having a pint.
If you want to use a relational database I would recommend PostgreSQL. It has a very rich SQL implementation, its fast and scales well, and it has something that renders this whole idea of separate schemas moot: a built in JSON type. This can be used to implement the "extensibility" mentioned in the article. Each table can have a meta column using the JSON type that you can throw any extra data into you like. The application does not need special queries, the meta column is always there. PostgreSQL's JSON operators make working with the meta data very easy and efficient.
You could also look into a NoSQL database. There are plenty to choose from and many support custom schemas and parallel scaling. However, it's likely you will have to change your choice of framework to use one that supports NoSQL.

Performing a join across multiple heterogenous databases e.g. PostgreSQL and MySQL

There's a project I'm working on, kind of a distributed Database thing.
I started by creating the conceptual schema, and I've partitioned the tables such that I may require to perform joins between tables in MySQL and PostgreSQL.
I know I can write some sort of middleware that will break down the SQL queries and issue sub-queries targeting individual DBs, and them merge the results, but I'd like to do do this using SQL if possible.
My search so far has yielded this (Federated storage engine for MySQL) but it seems to work for MySQL databases.
If it's possible, I'd appreciate some pointer's on what to look at, preferably in Python.
Thanks.
It might take some time to set up, but PrestoDB is a valid OpenSource solution to consider.
see https://prestodb.io/
You connect connect to Presto with JDBC, send it the SQL, it interprets the different connections, dispatches to the different sources, then does the final work on the Presto node before returning the result.
From the postgres side, you can try using a foreign data wrapper such as mysql_ftw (example). Queries with joins can then be run through various Postgres clients, such as psql, pgAdmin, psycopg2 (for Python), etc.
This is not possible with SQL.
Your options are to write your own "middleware" as you hinted at. To do that in Python, you would use the standard DB-API drivers for both databases and write individual queries; then merge their results. An ORM like sqlalchemy will go a long way to help with that.
The other option is to use an integration layer. There are many options out there, however, none that I know that are written in Python. mule esb, apache servicemix, wso2 and jboss metamatrix are some of the more popular ones.
You can colocate the data on a single RDBMS node (either PostgreSQL or MySQL for example).
Two main approaches
Readonly - You might want to use read-replicas of both source systems, then use a process to copy the data to a new writeable converged node; OR
Primary - You might chose a primary database of 2. Move the data from one to the primary using a conversion process (eg. ETL or off the shelf table-level replication)
Then you can just run the query on the one RDBMS with JOINs as usual.
BONUS: You can also do log reading from RDBMS that can ship logs through Kafka. You can make it really complex as required.

Consider scaling on Rails, would you write hand coded SQL? Use Sequel gem?

If I wanted to scale a Rails application by distributing its database on a different machine based on its authorization rules (location and user roles). So any resource attributed to that location would be sitting in a database dedicated to that location.
Should I get down to basic SQL writing, use something like Sequel gem or keep the niceness and magic of ActiveRecord?
It is true that raw SQL execution speed is more than execution of the ActiveRecord's nice magical queries. However, if you talk about scaling then there comes this question of how well would the queries be manageable when the application really grows large.
By far, a lot of complicated database operations can be managed well by caching, and proper indexing and proper eager-loading. In some cases MySQL views also help to improve performance, and Rails treats MySQL views fairly. After this, if one is able to corner out the really slow queries, then it might be worth it to convert them to raw SQL and save some time. Also, Rails offer caching of the database queries. MySQL also has a caching mechanism. Before executing raw SQL directly I would make sure these options (actually many more like avoiding unnecessary joins, as a join operation is resource intensive) are not able to give me what I am looking for. Hope this helps.
It sounds like you are partitioning your database, which Sequel has built-in support for (http://sequel.rubyforge.org/rdoc/files/doc/sharding_rdoc.html). I recommend using Sequel, but as the lead developer, I'm biased.

We're using JDBC+XMLRPC+Tomcat+MySQL to execute potentially large MySQL queries. What is a better way?

I'm working on a Java based project that has a client program which needs to connect to a MySQL database on a remote server. This was implemented is as follows:
Use JDBC to write the SQL queries to be executed which are then hosted as a servlet using Apache Tomcat and made accessible via XML-RPC. The client code uses XML-RPC to remotely execute these JDBC based functions. This allows us to keep our MySQL database non-public, restricts use to the pre-defined functions, and allows Tomcat to manage the database transactions (which I've been told is better than letting MySQL do it alone, but I really don't understand why). However, this approach requires a lot of boiler-plate code, and Tomcat is a huge memory hog on our server.
I'm looking for a better way to do this. One way I'm considering is to make the MySQL database publicly accessible, re-writing the JDBC based code as stored procedures, and restricting public use to these procedures only. The problem I see with this are that translating all the JDBC code to stored procedures will be difficult and time consuming. I'm also not too familiar with MySQL's permissions. Can one grant access to a stored procedure which performs select statements on a table, but also deny arbitrary select statements on that same table?
Any other ideas are welcome, as are thoughts and or sugguestions on the stored procedure solution.
Thank you!
You can probably get the RAM upgraded in your server for less than the cost of even a few days development time, so don't write any code if that's all you're getting from the exercise. Also, just because the memory is used inside of tomcat, it doesn't mean that tomcat itself is using it. The memory could be used up by data or by technical flaws in your code.
If you've tried additional RAM and it is being eaten up, then that smells like a coding issue, so I'd suggest using a profiler, or log data to try and work out what the root cause is before changing anything. If the cause is large data sets then using the database directly will only delay the inevitable, instead you'd need to look at things like paging, summarisation, client side caching, or redesigning clients to reduce the use of expensive queries. Using a profiler, or simply reviewing the code base, will also tell you if something is creating too many objects (especially strings, or XML nodes) or leaking memory.
Boiler plate code can be avoided by refactoring creatively, and its good that you do avoid repetition. Its unclear how much structure you might already have, but with a little work its easy to centralise boilerplate JDBCs calls. There is no fundamental reason JDBC code should be repeated, perhaps you could tell us what code is being repeated?
Finally, I'll venture that there are many good reasons to put a web tier over your database. Flexibility (of deployment), compatibility, control (over the SQL) and security are all good reasons to keep the web tier.
MySQL 5.0.3+ does have an execute privilege that you can set (without setting select privileges) that should allow you to get the functionality you seek.
However, note this mysql bug report with JDBC (well and a lot of other drivers).
When calling the [procedure] with JDBC, I get "java.sql.SQLException: Driver requires
declaration of procedure to either contain a '\nbegin' or '\n' to follow argument
declaration, or SELECT privilege on mysql.proc to parse column types."
the workaround is:
See "noAccessToProcedureBodies" in /J 5.0.3 for a somewhat hackish, non-JDBC compliant
workaround.
I am sure you could implement your solution without much boiler-plate, esp. using something like Spring's remoting. Also, how much memory is Tomcat eating? I frankly believe that if it's just doing what you are describing, it could work in less than 128mb (conservative guess).
Your alternative is the "correct by the book" way of solving the problem. I say build a prototype and see how it works. The major problems you could have are:
MySQL having some important gotcha in this regard
MySQL's Stored Procedure support being too primitive and forcing you to do a lot of work
Some other strange hiccup
I'm probably one of those MySQL haters, so the situation might be better than I think.