I am working on an application which will have approximately 1 million records. To fetch data I have two options:
Join 10-12 tables (indexed) at run time and get the result
Create views from these tables. Query the view at run time instead of joining tables. I think MySQL won't let me use indexes on views.
Currently for testing I have just 10-20 records and both the options are taking similar time. But when the real time data will be loaded, which option would give better performance?
Related
I'm currently working on a query that gets data from two tables on two independent MySQL servers. Table A keeps game user data such as player GUID, game-server. And table B stores logs such as game-currency consumption log with player GUID. The number of logs in table B is around 6 millions.
What I did so far is SELECT and CONCAT user IDs from table A to a very long string, and pass it to a WHERE CLAUSE to query logs from table B. But in some special cases the number of target IDs is more than thousands, and made my query very slow cause I'm not able to perform an INNER JOIN for filtering rows I need. And worse, sometimes the concated ID string would be too long that crashes my stored procedure.
I've googled for this problem for days, but the only solution I get is to use federated tables. However, I found it may be slow when the data is in a large amount and it has security risks, so I'm not really sure whether I should use federate tables.
Is there any other solution to efficiently query data from two independent MySQL servers? Thanks for your helps.
I am working with hive table to execute one of sql to fetch some records from 230 million records but it is taking 300 seconds to execute with map reduce process and mysql fetches this information in less 1 seconds. Why hive take more time?
I am using Ambari Cluster with Tez engine. I am confused for moving database on hadoop.
There are numerous reasons why MySQL might perform better than Hive on a particular query. In that sense, your query is too broad.
The most likely cause are indexes in MySQL. If you have a lot of data, MySQL can optimize queries using indexes. Hive reads all the data and processes it. MySQL can optimize the data being processed.
There are other causes as well. If the data is stored in partitions, perhaps MySQL does a better job of partition pruning based on the where clause.
Without knowing the data and the query, there is no generalization. For a single query, this is not too surprising. In general, Hive is going to be faster on queries that need to process large amounts of data.
I have a complex task in here..
A client has 7 servers, all in different locations.
Each one has a database with a table of user logs.
All the server tables are the same! Only the records in them are different.
However, each user may have log records on several servers.
All 7 databases have about 10 million records.
I need to run queries using SUM, COUNT, BETWEEN DATES etc..
What will be the correct approach?
I know about federated tables, and probably this will do the job, but the performance will be an issue..
I am thinking of storing all database records in Redis and running queries there, which I never did before.
I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).
Actually i queried optimize table query for one table. then i didn't do any operation on that table. then again i'm querying optimize table query at the end of every month. but the data in the table may be changed once in four or 8 months. is it create any problem in performance of the mysql query?
If you don't do DML operations on the table, OPTIMIZE TABLE is useless.
OPTIMIZE TABLE cleans the table of deleted records, sorts the index pages (brings the physical order of the pages in consistence to logical one) and recalculates the statistics.
For the duration of the command, the table is unavailable both for reading and writing, and the command may take long for large tables.
Did your read the manual about OPTIMIZE? And do you have a problem you want to solve using OPTIMIZE? If not, don't use this statement at all.
If the data doesn't quite change over a period of 4-8 months it should not create any issue with performance for the end of month report.
However if the count of rows that are changed in the 4-8 months period is huge then you would want to rebuild indexes/analyze the tables so that the queries run fine after the load.