Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Currently I am working on a database that requires me to take raw data from a 3rd party and store it into a database. The problem is that the raw data is obviously not optimized, and the people who I'm building the database for, don't want any data entry involved when uploading the raw data into the database, they pretty much just want to upload the data and be done with it. Some of the raw data files have empty cells all over the place and many instances of duplicate names/numbers/entries. Is there a way to still optimize the data quickly and efficiently without too much data entry or reworking each time data is uploaded or is this an instant where optimization is impossible due to constrants? Does this happen a lot, or do I need to tell them their dreams of just uploading are not possible for long team success?
There are many ways to optimize data and one way to optimize data in one use case may be horrible in another use case. There are tools that will tell you there are multiple values in columns that need to be optimized but there is no single advice which works in all cases.
without specific details, this is always good:
With regards to empty entries, that should not be an issue
With regards to duplicate data, it may be worth considering adding a one to many relationship
One thing need to make sure is to put a key in any field you are going to search for, this will speed up a lot your queries no matter the dataset
as far as changing the database schema... rare are the schemas that do not change over time.
My advice is think through your schema but do not try to over optimize things because you can not plan in advance what the exact usage will be. As long as it is working and there is no bottleneck, focus on other areas. If there is a bottleneck, then by all means, rewrite the affected part, making sure indices are present (consider composite indices in some cases). Consider avoiding unions when possible. and remember the KISS principle (Keep It Simple and Sweet).
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
These days I've faced performance issues when binding data with Java object to database. Especially when paring the data from database to java code when a lot of FK-PK relationship involved. I realized the issue and solved the performance slowdown by creating database views and create POJOs to map with the view.
I did some research online but couldn't find a good answer for this: How does database(I am using mysql) keeps the fast data querying speed in views?
For example, if I create a view among 10 tables, with FK-PK relationship, the view is still pretty fast to query and display the result pretty fast. How exactly happened behind the scenes for the database engine?
Indexes.
MySQL implicitly creates a foreign key index (i.e. an index on columns that compose the foreign key), unless one already exists. Not all database engines do so.
A view is little more than an aliased query. As such, any view, as trivial as it may seem, could kill the server if written poorly. Execution time is not proportional with the number of joined tables, but with the quality of indexes*.
Side effect: the default index might not be the most efficient one.
*tables sizes also start to matter when the tables grow large, as in millions of records
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I am starting to create my first web application in my career using mysql.
I am going to make table which contain users information (like id, firstname, lastname, email, password, phone number).
Which of the following is better?
Put all data into one single table (userinfo).
Divide all data by alphabet character and put data into many tables. for example, if user's email id is Joe#gmail.com that put into table (userinfo_j) and if user's email id is kevin#gmail.com that put into table (userinfo_k).
I don't want to sound condescending, but I think you should spend some time reading up on database design before tackling this project, especially the concept of normalization, which provides consistent and proven rules for how to store information in a relational database.
In general, my recommendation is to build your database to be easy to maintain and understand first and foremost. On modern hardware, a reasonably well-designed database with indexes running relational queries can support millions of records, often tens or hundreds of millions of records without performance problems.
If your database has a performance problem, tune the query first; add indexes second, buy better hardware third, and if that doesn't work, you may consider a design that makes the application harder to maintain (often called denormalization).
Your second solution will almost certainly be slower for most cases.
Relational databases are really, really fast when searching by indexed fields; searching for "email like 'Joe#gmail.com'" on a reasonable database will be too fast to measure on a database with tens of millions of records.
However, including the logic to find the right table in which to search will almost certainly be slower than searching in all the tables.
Especially if you want to search by things other than email address - imagine finding all the users who signed up in the last week. Or who have permission to do a certain thing in your application. Or who have a #gmail.com account.
So, the second solution is bad from a design/maintenance point of view, and will almost certainly be slower.
First one is better. In second you will have to write extra logic to find out which table you will start looking into. And for speeding up the search you can implement indexers. Here I suppose you will do equal operations more often rather than less than or more than operations so you can try implementing indexer with Hash. For comparison operations B-Tree are better.
Like others said, the first one is better. Specially if you need to add other tables in your database and link them to userĀ“s table, as the second one will soon get impossible to work and create relationships when your number of tables increase.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Okay, so I have my user table ready with columns for all the technical information, such as username, profile picture, password and so on. Now I'm at a situation where I need to add superficial profile information, such as location, age, self-description, website, Facebook account, Twitter account, interests etc. In total, I calculated this would amount to 12 new columns, and since my user table already has 18 columns, I come at a crossroads. Other questions I read about this didn't really give a bottom-line answer of the method that is most efficient.
I need to find out if there is a more efficient way, and what is the most efficient way to store this kind of information? The base assumption being that my website would in the future have millions of users, so an option is needed that is able to scale.
I have so far concluded two different options:
Option 1: Store superficial data in user table, taking the total column count in users table up to 30.
Or
Option 2: Store superficial data in separate table, connecting that with Users table.
Which of these has better ability to scale? Which is more efficient? Is there a third option that is better than these two?
A special extra question also, if anyone has information about this; how do the biggest sites on the internet handle this? Thanks to anyone who participates with an answer, it is hugely appreciated.
My current databse is MySQL with rails mysql2 gem in Rails 4.
In your case, I would go with the second option. I suppose this would be more efficient because you would retrieve data from table 1 whenever the user logins and you would use data from table 2 (superficial data) whenever you change his preferences. You would not have to retrieve all data each time you want to do something. In the bottom line, I would suggest modelling your data according to your usage scenarios (use cases), creating data entities (eg tables) matching your use case entities. Then you should take into account the database normalization principles.
If you are interested on how these issues are handled by the biggest sites in the world, you should know that they do not use relational (SQL) databases. They actually use NoSQL databases, which run on a distributed function. This is a much more complicated scenario than yours. If you want to see related tools, you could start reading about Cassandra and hadoop.
Hope I helped!
If you will need to access to these 30 columns of information frequently, you could put all of them into the same table. That's what some widely-used CMS-es do because even though a row is big, it's faster to retrieve one big row than plenty of small rows on various tables (more SQL requests, more searches, more indexes, ...).
Also a good read for your problem is Database normalization.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I've always just used MyISAM for all of my projects, but I am looking for a seasoned opinion before I start this next one.
I'm about to start a project that will be dealing with hundreds of thousands of rows across many tables. (Several tables may even have millions of rows as the years go on). The project will primarily need fast-read access because it is a Web App, but fast-write obviously doesn't hurt. It needs to be very scalable.
The project will also be dealing with sensitive and important information, meaning it needs to be reliable. MySQL seems to be notorious for ignoring validation.
The project is going to be using CakePHP as a framework, and I'm fairly sure it supports MySQL and Postgresql equally, but if anyone can disagree with me on that please let me know.
I was tempted to go with InnoDB, but I've heard it has terrible performance. Postgresql seems to be the most reliable, but also is not as fast as MyISAM.
If I were able to upgrade the server's version of MySQL to 5.5, would InnoDB be a safer bet than Postgres? Or is MyISAM still a better fit for most needs and more scaleable than the others?
The only answer that this really needs is "not MyISAM". Not if you care about your data. After all, /dev/null has truly amazing performance, but it doesn't meet your reliability requirement either ;-)
The rest is the usual MySQL vs PostgreSQL opinion that we close every time someone asks a new flavour because it really doesn't lead to much that's useful.
What's way more important than your DB choice is how you use it:
Do you cache commonly hit data that can afford to be a little stale in something like Redis or Memcached?
Do you avoid "n+1" selects from inefficient ORMs in favour of somewhat sane joins?
Do you avoid selecting lots of data you don't need?
Do you do selective cache invalidation (I use LISTEN and NOTIFY for this), or just flush the whole cache when something changes?
Do you minimize pagination and when you must paginate, do so based on last-seen ID rather than offset? SELECT ... FROM ... WHERE id > ? ORDER BY id LIMIT 100 can be immensely faster than SELECT ... FROM ... ORDER BY id OFFSET ? LIMIT 100.
Do you monitor query performance and hand-tune problem queries, create appropriate indexes, etc?
(Marked community wiki because I close-voted this question and it seems inappropriate to close-vote and answer unless it's CW).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I currently have a database that has a lot of many to many associations. I have services which have many variations which have many staff who can perform the variation who then have details on themselves like name, role, etc...
At 10 services with 3 variations each and up to 4 out of 20 staff attached to each service even doing something as getting all variations and the staff associated with them takes 4s.
Is there a way I can reduce these queries that take a while to process? I've cut down the queries by doing eager loading in my DBM to reduce the problems that arise from 1+N issues, but still 4s is a long query for just a testing stage.
Is there a structure out there that would help make such nested many to many associations much quicker to select?
Maybe combining everything past the service level into a single table with a 'TYPE' column ?? I'm just not knowledgable enough to know the solution that turns this 4s query into a 300MS query... Any suggestions would be helpful.
A: It may be possible to restructure the data to make queries more efficient. This usually implies a trade-off with redundancy (repeated values), which can overly complicate the algorithms for insert/update/delete.
Without seeing the schema, and the query (queries?) you are running, it's impossible to diagnose the problem.
I think the the most likely explanation is that MySQL does not have suitable indexes available to efficiently satisfy the query (queries?) being run. Running an EXPLAIN query can be useful to show the access path, and give insight whether suitable indexes are available, whether indexes are even being considered, whether statistics are up-to-date, etc.
But you also mention "N+1" performance issues, and "eager loading", which leads me to believe that you might be using an ORM (like ADO Entity Framework, Hibernate, etc.) These are notorious sources of performance issues, issuing lots of SQL statements (N+1), OR doing a single query that does joins down several distinct paths, that produce a humongous result set, where the query is essentially doing a semi cross join.
To really diagnose the performance issue, you would really need to have the actual SQL statements being issued, and in a development environment, enabling the MySQL general log will capture the SQL being issued along with a rudimentary timing.
The table schemas would be nice to see for this question. As far as MySQL performance in general, make sure you research disk alignment, set the proper block sizes and for this particular issue check your execution plans and evaluate adding indexes.