Storing large, session-level datasets? - mysql

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.

First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Related

How should I organize user data with several rows in MySQL?

I am currently developing a quiz-app, that keeps track of user data, such as :
the sets they've studied by the ID of the specific set ([1,2,6,12])
the friends they have by the id of the user ([1,2,3,4])
their schedule(
{"2022-07-03 00:00:00":{"551":{"type":"Flashcards","setid":"1"},"552":{"type":"Flashcards","setid":"1"},"553":{"type":"Flashcards","setid":"1"},"554":{"type":"Flashcards","setid":"1"},"555":{"type":"Flashcards","setid":"1"},"556":{"type":"Flashcards","setid":"1"},"557":{"type":"Flashcards","setid":"6"},"558":{"type":"Flashcards","setid":"6"},"559":{"type":"Flashcards","setid":"6"},"560":{"type":"Flashcards","setid":"6"}}})
every individual day they've logged in (["05/15/2022","05/17/2022","05/18/2022","05/19/2022","05/22/2022","05/23/2022","05/24/2022","05/25/2022","05/28/2022","05/29/2022","05/30/2022","05/31/2022","06/02/2022","06/05/2022","06/07/2022","06/08/2022","06/10/2022","06/11/2022","06/13/2022","06/14/2022","06/15/2022","06/17/2022","06/18/2022","06/19/2022","06/20/2022","06/22/2022","06/24/2022","06/25/2022","06/26/2022","06/28/2022","06/29/2022","06/30/2022","07/01/2022","07/02/2022"])
Note: there is quite a lot of other types of information that is stored inside the table.
All of this aforementioned information is collected in a mysql table called "users", which has rows for each user, with accompanying data (as mentioned above).
It has recently come to my attention that MySQL has data limits for the amount of data that can be represented in a given row (around 65K bytes). If I continue to represent data this way, I believe that at scale (assume a user uses the app for 5 years, imagine the amount of data inside the "every individual day they've logged in" table), I will face MySQL's data limits and it may cause problems in the future.
Here's a picture, showing how the information is represented inside of the table "users"
How would I better represent this type of table? Should I use multiple tables inside an SQL database? How should I format it? Do I not have to worry about the data limit, and should I continue saving data in this way?
Thanks.
If I understand this correctly, you are packing way too much information in each row. The structure of your data is not being represented in a way that allows MySQL to do what it is good at. You are just creating big buckets for each user and stashing them in MySQL.
To make this work better, you can either create tables to store each relationship (this is, after all, a relational database) like user_login, user_friend_requests, and so on. The direct answer to your question is that each cell in your table should be a table itself.
OR, you can embrace the blob, and use something like mongodb, which is much more suited for storing and retrieving the data in a way that fits your mindset. Since you don't do any real queries on the data, a NoSQL solution would probably fit you better.
So the "right" answer to your question is "modify your schema to store this data better, or switch your database to match the way you want to store the data."
However, having said all that, since it seems you are storing JSON in those cells, you can use the JSON data type (max size 1GB but better if you don't use so much - see https://dev.mysql.com/blog-archive/how-large-can-json-documents-be), or LONGTEXT, 4GB. (Assuming you are running in a 64-bit environment - see Maximum length for MySQL type text).
The JSON data type actually has some pretty cool features.

Matching 2 databases of names, given first, last, gender and DOB?

I collect a list of Facebook friends from my users including First, Last, Gender and DOB. I am then attempting to compare that database of names (stored as a table in MySQL) to another database comprised of similar information.
What would be the best way to conceptually link these results, with the second database being the much larger set of records (>500k rows)?
Here was what I was proposing:
Iterate through Facebook names
Search Last + DOB - if they match, assume a "confident" match
Search Last + First - if they match, assume a "probable" match
Search Last + Lichtenstein(First) above a certain level, assume a "possible" match
Are there distributed computing concepts that I am missing that may make this faster than a sequential mySQL approach? What other pitfalls may spring up, noting that it is much more important to not have a false-positive rather than miss a record?
Yes, your idea seems like a better algorithm.
Assuming performance is your concern, you can use caching to store the values that are just being searched. You can also start indexing the results in a NoSQL database such that the results will be very faster, so that you will have better read performance. If you have to use MySQL, read about polyglot persistence.
Assuming simplicity is your concern, you can still use indexing in a NoSQL database, so over the time you don't have to do myriad of joins will spoil the experience of the user and the developer.
There could be much more concerns, but it all depends on where would you like to use it, to use in a website, or to such data analytic purpose.
If you want to operate on the entire set of data (as opposed to some interactive thing), this data set size might be small enough to simply slurp into memory and go from there. Use a List to hang on to the data then create a Map> that for each unique last name points (via integer index) to all the places in the list where it exists. You'll also set yourself up to be able to perform more complex matching logic without getting caught up trying to coerce SQL into doing it. Especially since you are spanning two different physical databases...

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

high load on mysql DB how to avoid?

I have a table contain the city around the worlds it contain more than 70,000 cities.
and also have auto suggest input in my home page - which used intensively in my home page-, that make a sql query (like search) for each input in the input (after the second letter)..
so i afraid from that heavily load,,...,, so I looking for any solution or technique can help in such situation .
Cache the table, preferably in memory. 70.000 cities is not that much data. If each city takes up 50 bytes, that's only 70000 * 50 / (1024 ^ 2) = 3MByte. And after all, a list of cities doesn't change that fast.
If you are using AJAX calls exclusively, you could cache the data for every combination of the first two letters in JSON. Assuming a Latin-like alphabet, that would be around 680 combinations. Save each of those to a text file in JSON format, and have jQuery access the text files directly.
Create an index on the city 'names' to begin with. This speeds up queries that look like:
SELECT name FROM cities WHERE name LIKE 'ka%'
Also try making your auto complete form a little 'lazy'. The more letters a user enters, lesser the number of records your database has to deal with.
What resources exist for Database performance-tuning?
You should cache as much data as you can on the web server. Data that does not change often like list of Countries, Cities, etc is a good candidate for this. Realistically, how often do you add a country? Even if you change the list, a simple refresh of the cache will handle this.
You should make sure that your queries are tuned properly to make best use of Index and Join techniques.
You may have load on your DB from other queries as well. You may want to look into techniques to improve performance of MySQL databases.
Just get your table to fit in memory, which should be trivial for 70k rows.
Then you can do a scan very easily. Maybe don't even use a sql database for this (as it doesn't change very often), just dump the cities into a text file and scan that. That'd definitely be better if you have many web servers but only one db server as each could keep its own copy of the file.
How many queries per second are you seeing peak? I can't imagine there being that many people typing city names in, even if it is a very busy site.
Also you could cache the individual responses (e.g. in memcached) if you get a good hit rate (e.g. because people tend to type the same things in)
Actually you could also probably precalculate the responses for all one-three letter combinations, that's only 26*26*26 (=17k) entries. As a four or more letter input must logically be a subset of one of those, you could then scan the appropriate one of the 17k entries.
If you have an index on the the city name it should be handled by the database efficiently. This statement is wrong, see comments below
To lower the demands on your server resources you can offer autocompletion only after n more characters. Also allow for some timeout, i.e. don't do a request when a user is still typing.
Once the user stopped typing for a while you can request autocompletion.