I have a large Mysql table (approx 2 million rows). I want to run searches on it that will match possibly upto 25k rows (returned results will be limited [eg 25 per page]). What I wanted to do was to rank these results on cetain criteria and use that to order them.
The solution I have so far is to create a script go through the table and assign a score to each row based on my criteria. Each result would be given points depending on how it compared with my ideal result. I could then order by that score when executing a select, instead of caluculating on the fly.
I was then thinking that I wanted other users of the system to be able to setup their own custom scoring criteria. My first thought was to ceate a separate table that would contain the first tables row id, a users id and rank. But I was thinking that this table could get very large (2 million rows foreach user). So I am thinking of alternatives so far options I have are:
1) Use separate ranking table
2) Use user specific ranking tables
3) calculate on the fly
Anyone have any experience with a similar problem? The results will be searched in real time by users so my primary concern would be to make this part of the process as fast as possible.
Many Thanks
I prefer seperate table, as you can generate a temporary new table while re-calculating, and rename the temp table to the real table.
This also may help you to bypass locking problems later...
Related
I have a quite complex view with two queries (a view in a view), one select users with related data and another one select orders with related data. Both of them have some filters, but now I have an issue and I am looking for proper and just decent solution, with good performance because I have a lot of data and relationships in the queries.
Assume I have:
Query 1 - Select user data, some left joins to other tables, conditions depends on provided parameters.
Query 2 - Select orders depends on users from Query 1, many joins, conditions depends on parameters.
I display data from two queries in one view, users, their data, orders, and some orders data and now I want to implement pager, but it has to work and display proper number of users depends on filters form Query 1 and Query 2. So there is an issue that I can't really limit from any query cuz another one has filters as well so maybe those users maybe aren't really selected to display depends on other query filters.
So I guess there are two ways, one is to put those queries in loop and collect data until I get proper number of results depends on query.
Another way is to merge those two queries into one, but there an issue that I get many rows per user, so I can't set any page limit and get results only for specific number of users, like for example 30. Because results will be like user 1 => order 1, user 1 => order 2, so is there any way to get specific number of unique results depends on user id or something.
Let me know if you have any questions.
Sample data will make more sense. I am unable to understand the whole requirement here in your question. will you be able to create some sample data and share with us ? if you are dealing with a lot of data, avoid loops as that will just make performance worse.
Which would be the best way to find the biggest ID in MySQL?
I am working on an eCommerce website and I need to find the maximum ID.
But regarding big table size and high frequency of using database by web application, I would like to know more how MySQL finding the biggest ID in MAX() way.
The only two method I know is that:
Sorting and cut column one
MAX(id)
Databases are good at data. MySQL correctly indexed is no exception.
SELECT MAX(id)
FROM tablename
So keep it simple.
This will scan backwards though a id based index to find the maximum number.
What about creating a resistant variable so that every time a new record is added to the table the max_tableA_id variable gets updated so it is always within easy reach.
Alternately you could create a simple table with two columns...
table names and current max id
and then update the appropriate record each time a new record is added to the table.
now all you need is a simple query to get the current max id for a given table.
I have a pretty huge MySQL database and having performance issues while selecting data. Let me first explain what I am doing in my project: I have a list of files. Every file should be analyzed with a number of tools. The result of the analysis is stored in a results table.
I have one table with files (samples). The table contains about 10 million rows. The schema looks like this:
idsample|sha256|path|...
The other (really small table) is a table which identifies the tool. Schema:
idtool|name
The third table is going to be the biggest one. The table contains all results of the tools I am using to analyze the files (The number of rows will be the number of files TIMES the number of tools). Schema:
id|idsample|idtool|result information| ...
What I am looking for is a query, which returns UNPROCESSED files for a given tool id (where no result exists yet).
The (most efficient) way I found so far to query those entries is following:
SELECT
s.idsample
FROM
samples AS s
WHERE
s.idsample NOT IN (
SELECT
idsample
FROM
results
WHERE
idtool = 1
)
LIMIT 100
The problem is that the query is getting slower and slower as the results table is growing.
Do you have any suggestions for improvements? One further problem is, that i cannot change the structure of the tables, as this a shared database which is also used by other projects. (I think) the only way for improvement is to find a more efficient select query.
Thank you very much,
Philipp
A left join may perform better, especially if idsample is indexed in both tables; in my experience, those kinds of "inquiries" are better served by JOINs rather than that kind of subquery.
SELECT s.idsample
FROM samples AS s
LEFT JOIN results AS r ON s.idsample = r.idsample AND r.idtool = 1
WHERE r.idsample IS NULL
LIMIT 100
;
Another more involved possible solution would be to create a fourth table with the full "unprocessed list", and then use triggers on the other three tables to maintain it; i.e.
when a new tool is added, add all the current files to that fourth table (with the new tool).
when a new file is added, add all the current tools to that fourth table (with the new file).
when a new result in entered, remove the corresponding record from the fourth table.
We are building an analytics engine which has to store attribute preference score for each user. We are expecting 400 attributes and they may change(at what frequency is not known as yet). We are planning to store this in Redshift.
My qs is:
Should we store as 1 row per user with 400 cols(1 column for each attribute)
or should we go for a table structure like
(uid, attribute id, attribute value, preference score) which will be (20-400)rows by 3 columns
Which kind of storage would lead to a better performance in Redshift.
Should be really consider NoSQL for this?
Note:
1. This is a backend for real time application with increasing number of users.
2. For processing, the above table has to be read with entire information of all attibutes for one user i.e indirectly create a 1*400 matrix at runtime.
Please help me which desgin would be ideal for such a use case. Thank you
You can go for tables like given in this example and then use bitwise functions
http://docs.aws.amazon.com/redshift/latest/dg/r_bitwise_examples.html
Bitwise functions are here
For your problem, I would suggest a two table design. Its more pain in the beginning but will help in future.
First table would be a key value kind of first table, which would store all the base data and would be kind of future proof, where you can add/remove more attributes, but this table will continue working.
And a N(400 in your case) column 2nd table. This second table you can build using the first table. For the second table, you can start with a bare minimum set of columns .. lets say only 50 out of those 400. So that querying this table would be really fast. And the structure of this table can be refreshed periodically to match with the current reporting requirements. Also you will always have the base table in case you need to backfill any data.
This may be a little difficult to answer given that I'm still learning to write queries and I'm not able to view the database at the moment, but I'll give it a shot.
The database I'm trying to acquire information from contains a large table (TransactionLineItems) that essentially functions as a store transaction log. This table currently contains about 5 million rows and several columns describing products which are included in each transaction (TLI_ReceiptAlias, TLI_ScanCode, TLI_Quantity and TLI_UnitPrice). This table has a foreign key which is paired with a primary key in another table (Transactions), and this table contains transaction numbers (TRN_ReceiptNumber). When I join these two tables, the query returns one row for every item we've ever sold, and each row has a receipt number. 16 rows might have the same receipt number, meaning that all of these items were sold in a single transaction. Below that might be 12 more rows, each sharing another receipt number. All transactions are broken down into multiple rows like this.
I'm attempting to build a query which returns all rows sharing a single receipt number where at least one row with that receipt number meets certain criteria in another column. For example, three separate types of gift cards all have values in the TLI_ScanCode column that begin with "740000." I want the query to return rows with values beginning with these six digits in the TLI_ScanCode column, but I would also like to return all rows which share a receipt number with any of the rows which meet the given scan code criteria. Essentially, I need the query to return all rows for every receipt number which is also paired in at least one row with a gift card-related scan code.
I attempted to use a subquery to return a column of all receipt numbers paired with gift card scan codes, using "WHERE A.TRN_ReceiptAlias IN (subquery..." to return only those rows with a receipt number which matched one of the receipt numbers returned by the subquery. This appeared to run without issue for five minutes before the server ground to a halt for another twenty while it processed the query. The query appeared to complete successfully, but given that I was working with IT to restore normal store operations during this time I failed to obtain the results of the query (apart from the associated shame and embarrassment).
I'd like to know if there is a way to write a query to obtain this information without causing the server to hang. I'm assuming that either: a) it wasn't very smart to use a subquery in this manner on such a large table, or b) I don't know enough about SQL to obtain the information I need. I'm assuming the answer is both A and B, but I'd very much like to learn how to do this the right way. Any help would be greatly appreciated. Thanks!
SELECT *
FROM a as a1
JOIN b
ON b.id = a.id
JOIN a as a2
ON a2.id = b.id
WHERE b.some_criteria = 'something';
Include an index on (b.id,b.some_criteria)
You aren't the first person, nor will you be the last to bring down your system with an inefficient query.
The most important lesson is that "Decision Support" and "Analytics" really don't co-exist with a transaction system. You really want to pull the data into a datamart or datawarehouse or some other database that isn't your transaction database, so that you don't take the business offline.
In terms of understanding why your initial query was so inefficient, you want to familiarize yourself with the EXPLAIN EXTENDED syntax that returns you plan information that should help you debug your query and work on making it perform acceptably. If you update your question with the actual explain plan output for it, that would be helpful in determining what the issue is.
Just from the outline you provided, it does sound like a self join would make sense rather than the subquery.