Figure out database problems - mysql

At the beginning I would like to say that I am not expert in this domain and It is problematic to describe all nuances. I work on Rails application which uses Mysql database. Our DB has grown and now we have seriously problems with performance. In our app we have two features (for example sync with mobile) which process much data and It causes our database hang. We use newrelic to monitoring which confirmed that we have problems with those two parts of app. My main question is how to profile my app to figure out which actions make the biggest problem? Which tools I can use? Do you have any tips what I can do/configure DB to improve performance? What action I should do to find out where the problem is (next small step)? I know that those question are very general but I am junior in this domain and new in rails. I believe that more question will appear after your answers ;)

Firstly what is the size of db, like how many tables and avg no of rows per table. With the basic details available in your question, here are few steps you can look upon.
Regarding data sync with mobile. First step should be avoid heavy processing on the fly rather use background jobs to process the data and store in tables what actually needs to be sent to mobile . So in that case you will avoid multiple queries, rather 1-2 queries should fetch the entire data with minimum processsing.
I'm sure you must have applied indexing and active model relationship has been properly managed. It would actually help if you can post some models and basic relationship between them . Also explan in brief how the sync is being done and apis being handled.
Use benchmarking to figure the time to fetch data and keep a watch on the logs when processing data. There are some better benchmarking tooks available.

Related

MySQL(RDBMS) for order/shipping info and MongoDB(NoSQL) for product info. Can this work?

I am trying to build an integrated commerce backend, where multiple ecommerce channels can be connected. This service has two main functions:
1. It brings all the orders from all the connected commerces together.
2. Manager registers the new product to our service once. Then our service registers this product to all the connected commerces.
I am planning to use MySQL as my main database. But, when it comes to product info, Im not sure if MySQL(or any other RDBMS) would be a clever choice. Because each commerce most likely require some other new info. I dont want to add columns or make new table every time we add new commerces to our platform.
So I thought I could use MongoDB( NoSQL) when related to products and use MySQL when related to orders. Would this be a good idea? Im worried about querying limits and any other potential problems which I might encounter.
By the way I am using node.js for my backend.
Using more than one DB is not a bad idea at all, infact it becomes a necessity when your application is growing. Clearly, you cannot choose MySql for products as performing alter table at every commerce addition is not a good option. And mongodb is a good choice for such data, but then querying becomes difficult and slow in mongo.
What I would suggest is to not worry about the database right from the start. Go with something that you are more comfortable with and then based on how your app grows take further decisions. Write your code in such a way that it's easy to replace the db at any point of time, because it's very difficult to know how the app would grow in the near future and what feature will be required more.

Database for counting page accesses

So let's say I have a site with appx. 40000 articles.
What Im hoping to do is record the number of page visits per each article overtime.
Basically the end goal is to be able to visualize via graph the number of lookups for any article between any period of time.
Here's an example: https://books.google.com/ngrams
I've began thinking about mysql data structure -> but my brain tells me it's probably not the right task for mysql. Almost seems like I'd need to use some specific nosql analytics solution.
Could anyone advice what DB is the right fit for this job?
SQL is fine. It supports UPDATE statements that guarantee your count is correct rather than just eventual consistency.
Although most people will just use a log file, and process this on-demand. Unless you are Google scale, that will be fast enough.
There exist many tools for this, often including some very efficient specialized data structures such as RDDs that you won't find in any database. Why don't you just use them?

Database and table structure for larger logging application in Rails

I am thinking of building a logging application. I was planning on making it in Ruby on Rails since I have fiddled around with it a little and it seems lika a good option.
But what I now am worried about is the database structure. As I can understand Rails will create a table for every Model.
So if I have a model like: LoggingInstance, that stores the time of the logging, sensorID, value, unit and some other interesting stuff every 10th second. After a while I will have very many rows in this table. And as I add on more sensors the rows will increase even faster.
I could make the logging entires more specific like: TemperatureLoggingInstance, PressureLoggingInstance etc, but this might lead to the same performance problems.
What I am wondering is for a better way to store all the data. I was thinking if it was possible to save every sensors logging values in separate tables but how would I implement that in Rails. Or is there a better way of doing it?
I am afraid of getting bad performance in the database when I call the values from one sensor.
I was planning to use the RailsAPI gem and have one application running only data handling and then a front end application that would use the API to visualize the data.
The performance problem might not become a problem in years but I would want to structure the database so that it is possible to have a lot of data in it and have good performance.
All tips or references are appriciated :)
Since you want to store timeseries, i would suggest you to take a look on InfluxDB.
There are libraries for ruby, which you can use:
https://github.com/influxdb/influxdb-ruby
https://github.com/influxdb/influxdb-rails

Planning for database scaling and schema changes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm doing research before I create my social network database and I've found a lot of questions/resources pertaining to graph and key-value databases for social networks. I understand there are a TON of different options and ways to implement the DB. I also understand that what the big companies do is complex and way above what I currently need (1b+ users). I also know each of the big companies have revamped their databases to account for the insane scaling they go through.
Because I don't know how the network will grow, and I don't believe I can accurately create a model that will scale to 1m users (due to unknowns such as how people will use it, how often people post, comment, etc). But I can at least try to create a database that will be easiest to scale when (if) the need arises.
Do most companies create a database to handle up to 1k users, then once they grow, they revamp it for 10k users, then 100k, etc? If they do, at each of these arbitrary numbers (because of the unknowns listed above), do companies typically change a few tables/nodes/etc, or do they completely recreate the database to take advantage of new technologies (such as moving from SQL to graph)?
I want to pick the best solution, but I'm finding the decision between graph, key-value, SQL, among others very difficult--especially with no data to know what relationships/data is most important. I believe I can create a solid system using a graph that can support up to 10k users, but I'm worried having to potentially completely reacreate the database as the system grows. Is this a worry now to avoid issues, or implement now and adapt later type problem?
Going further, if I do need to plan on complete DB restructures, does it typically make sense to use a Multi-Model NoSQL DBMS (such as OrientDB or ArangoDB)?
I personally think you are asking premature questions.
Seriously, even with a bad model, a database can handle 10k users.
You think about scaling, but the hardest problem is not scaling, it is to come to the point where you need to scale.
I'm sure everybody wants 1bn users, but then you are already dreaming about having a social network with 200 times more users than Github itself ? (Github has ~ 5 million users).
Also, even by thinking it ahead, you will refactor and refactor again definitely during years, and you will have more than one persistence layer, be sure of it.
Code and code good, stay lean, remain able to change quickly, deploy, show to users, refactor, test, deploy and show to users in the same day. These are the things you need to do now, not asking questions about a problem you don't have yet, you definitely have a lot of other problems to solve now ;-)
UPDATE
Based on your comment, you might need to think that there are questions we just can not simply answer, because we don't need your exact requirements.
I have a simple app, which uses 4 persistence layers, and this app is not yet online. I'll give you my "why" about using it and which use case :
Neo4j : it is the core of the application data, I use it because I love it, I know it very much (it is my job) and, as the concept of the app is quite new and can evolve rapidly, having a schemaless db is reducing a lot of the refactoring stuff. Also I have now a lot of use cases coming by building the app, which make Neo4j a good choice when you need to add features without breaking what has already been done.
MySQL
I use it for User accounts and profiles. Why ? Because the framework I use already has a lot of bundles integrating this kind of stuff in a couple of lines of code, the bundles are well maintained and if I would use (currently) neo4j for it, I will have to reinvent the wheel. Also all the modules I use evolve in stability and compatibility with the framework.
Of course the mysql data is coupled (minimally) with the neo4j one. But I know that this kind of data will not evolve that much, so Mysql is a good choice and in case I have to refactor some points, this will not be a huge pain.
Redis
I use Redis for storing analytics data, Redis is quite flexible and I can easily create new keys and add data on top of it.
RabbitMQ :
I use a lot of message queues, why ? For testing refactoring. I can easily process messages with multiple consumers for testing "refactoring", testing mutliple database layers while the app is running for testing changes, testing new features, testing refactoring, ...
You will refactor ! Just try to keep it as simple as possible.

What would be the best DB cache to use for this application?

I am about 70% of the way through developing a web application which contains what is essentially a largeish datatable of around 50,000 rows.
The app itself is a filtering app providing various different ways of filtering this table such as range filtering by number, drag and drop filtering that ultimately performs regexp filtering, live text searching and i could go on and on.
Due to this I coded my MySQL queries in a modular fashion so that the actual query itself is put together dynamically dependant on the type of filtering happening.
At the moment each filtering action (in total) takes between 250-350ms on average. For example:-
The user grabs one end of a visual slider, drags it inwards, when he/she lets go a range filtering query is dynamically put together by my PHP code and the results are returned as a JSON response. The total time from the user letting go of the slider until the user has recieved all data and the table is redrawn is between 250-350ms on average.
I am concerned with scaleability further down the line as users can be expected to perform a huge number of the filtering actions in a short space of time in order to retrieve the data they are looking for.
I have toyed with trying to do some fancy cache expiry work with memcached but couldn't get it to play ball correctly with my dynamically generated queries. Although everything would cache correctly I was having trouble expiring the cache when the query changes and keeping the data relevent. I am however extremely inexperienced with memcached. My first few attempts have led me to believe that memcached isn't the right tool for this job (due to the highly dynamic nature of the queries. Although this app could ultimately see very high concurrent usage.
So... My question really is, are there any caching mechanisms/layers that I can add to this sort of application that would reduce hits on the server? Bearing in mind the dynamic queries.
Or... If memcached is the best tool for the job, and I am missing a piece of the puzzle with my early attempts, can you provide some information or guidance on using memcached with an application of this sort?
Huge thanks to all who respond.
EDIT: I should mention that the database is MySQL. The siite itself is running on Apache with an nginx proxy. But this question is related purely to speeding up and reducing the database hits, of which there are many.
I should also add that the quoted 250-350ms roundtrip time is fully remote. As in from a remote computer accessing the website. The time includes DNS lookup, Data retrieval etc.
If I understand your question correctly, you're essentially asking for a way to reduce the number of queries against the database eventhough there will be very few exactly the same queries.
You essentially have three choices:
Live with having a large amount of queries against your database, optimise the database with appropriate indexes and normalise the data as far as you can. Make sure to avoid normal performance pitfalls in your query building (lots of ORs in ON-clauses or WHERE-clauses for instance). Provide views for mashup queries, etc.
Cache the generic queries in memcached or similar, that is, without some or all filters. And apply the filters in the application layer.
Implement a search index server, like SOLR.
I would recommend you do the first though. A roundtrip time of 250~300 ms sounds a bit high even for complex queries and it sounds like you have a lot to gain by just improving what you already have at this stage.
For much higher workloads, I'd suggest solution number 3, it will help you achieve what you are trying to do while being a champ at handling lots of different queries.
Use Memcache and set the key to be the filtering query or some unique key based on the filter. Ideally you would write your application to expire the key as new data is added.
You can only make good use of caches when you occasionally run the same query.
A good way to work with memcache caches is to define a key that matches the function that calls it. For example, if the model named UserModel has a method getUser($userID), you could cache all users as USER_id. For more advanced functions (Model2::largerFunction($arg1, $arg2)) you can simply use MODEL2_arg1_arg2 - this will make it easy to avoid namespace conflicts.
For fulltext searches, use a search indexer such as Sphinx or Apache Lucene. They improve your queries a LOT (I was able to do a fulltext search on a 10 million record table on a 1.6 GHz atom processor, in less than 500 ms).