Redis design help (From Relational to NoSQL) - mysql

I am very much a SQL developer and am new to redis, but it's performance is very interesting. I have a problem I think redis could help me very much in. I have a SQL table familiar to this:
| CONTAINER <String><NoUnq> | PROCESS <String><NoUnq> | PROCESS_DATA <String><NoUnq> | TimeCreated <TimeStamp><NoUnq>|
This table when populated to its max has roughly ~450,000,000 rows. I am running this on AWS. With these rows I select all the processes within a container (~1,000,000 containers), so I would have something like this in sql (of course container is indexed):
SELECT * FROM table WHERE container = '[CONTAINER_NAME]';
I then have a cronjob script which runs every hour and removes old processes from containers with something like this:
DELETE FROM table WHERE TimeCreated <= [SOME_TIME];
So essentially I like to have processes which are not older than ~4-5 hours. Looking at Redis I feel like I can engineer something similar to improve my performance, but am having trouble to convert this SQL like design into Redis.
My first thought was to use HSET, but I found out HSET does not allow the EXPIRE command on fields so I could not automatically remove old process. I am most concerned about performance and efficiency.

Look's like you can (and probably should) use HSET. And look's like you do not need to expire fields. You need to expire keys. The key name based on container name and EXPIREAT on this key. If you told about table relation structure like you wrote above the most like analogue is one table row is one key:
MULTI
HMSET <container name:rowId> PROCESS <value> PROCESS_DATA <value>
EXPIREAT <container name:rowId> <TimeCreated>
EXEC
Also you can use ZSET to store time related list of rows:
ZADD <container name> <TimeCreated> <rowId>
So you may use zRange as SELECT equivalent. Also you may use LUA scripting to get content of container with one request. Something like (I can make a mistake somewhere in the syntax of LUA):
local result = {}
local tmp = redis.call( 'zrange', KEYS[1], ARG[1], ARG[2], 'withscores' )
for k, v in pairs(tmp) do
result[v] = redis.call('hgetall', KEYS[1] + ':' + k)
end
return result
Where KEYS1 - container name, ARG1 - from , ARG2- to .
p.s. Also you should understand how redis expire keys to understand thats happens with memory at your instance.

Related

Checking multiple columns for a concatenated match MySQL

Hi i've hit a problem with my SQL queries, i have a table that contains 3 columns, one for vehicle brands, one for models and one for model versions.
So my data is split like
BRAND || MODEL || MODEL VERSION
RENAULT || R4 || R4 1.1 GTL
I've been asked to replace our current dropdown system with an input to make it easier for users to select their vehicle.
I'm using jQuery Autocomplete and my query looks something like this.
SELECT DISTINCT CONCAT (brand, ' ', model, ' ', version) as data from vehicles WHERE brand LIKE '%Golf%' OR model LIKE '%Golf%' OR version LIKE '%Golf%' LIMIT 5
So far so good, this will output "RENAULT R4 R4 1.1 GTL" if i type in "RENAULT"... the problem here comes when the user inserts something like Renault R4 instead of just "Renault"
As they've included the Model name as well as the Brand then it doesn't really match any of my columns in the Database and my Ajax call returns no results.
I need to query the actual result set from that concat instead so that anything the users type in will match the results, but i have no idea how i can do this.
In desperation i tried to type where data LIKE '%RENAULT R4%' but as expected this also doesn't work... What can i do in this situation? Any help would be appreciated.
Easy and slow way: Split the string by spaces and ask for each word.
SELECT ...
WHERE
(brand LIKE '%Renault%' OR model LIKE '%Renault%' OR version LIKE '%Renault%')
AND (brand LIKE '%R4%' OR model LIKE '%R4%' OR version LIKE '%R4%')
LIMIT 5
Keep in mind, that query like this one does not allow use of any index, so it is very slow.
The more complicated, but much faster implementation is to use fulltext index. You need recent version of MySQL (5.6 or newer); older versions support fulltext only on MyISAM tables which are not really a database.
CREATE FULLTEXT INDEX idx ON vehicles(brand, model, version);
SELECT ... FROM vehicles
WHERE MATCH(brand, model, version) AGAINST('Renault R4')
LIMIT 5;
(Query not tested, but you should get the idea.)
I can only think of this one, but I believe there are better ways to do it.
OR CONCAT (brand, ' ', model, ' ', version) LIKE '%RENAULT R4%'

Multiple, unknown number of fields passed into a query

Is it possible to create a generic query that would work for different types of documents? For example I have "cases" and "factories",
They have different set of fields. e.g:
{
id: 'case_o1',
name: 'Case numero uno',
amount: 40
}
{
id: 'factory_002',
location: 'Venezuela',
workers: 200,
operating: true
}
Is it possible to create a generic query where I would pass the type of an entity (case or factory) and additional parameters and it would filter results based on those?
I could of course use javascript view, but it doesn't allow me to filter by multiple fields. Let's say I want to fetch all factories located in Venezuela, with number of workers between 20 and 55.
I started with this, but then I got stuck:
select * from `mybucket` as entity
where position(meta(entity).id, $entity_type) == 0
How do I pass multiple predicates and have the query to recognize them?
I can of course list fields like this:
where position(meta(entity).id, $entity_type) == 0
and entity.location == 'Venezuela'
and entity.workers > $workers_min
and entity.workers < $workers_max
but then
I'm gonna have to create a separate query for each entity
And even then it won't solve my problem - I have no idea how to ignore predicates, what if next time $workers_min and $workers_max are not passed, does it mean I have to create a query for every single predicate (column)?
For security reasons I cannot generate free-form queries and pass them to Couchbase server, all the queries are already stored in the database, our api just picks them up out of a document and executes them
I think it's possible to create a query that would be "short-circuiting" for args that's undefined (e.g. WHERE $location IS MISSING OR entity.location == $location or something like that)
Is it possible at all to create a query that would be able to effectively filter and order a dataset based on arbitrary parameters? Or there's no way?
#Agzam. Sorry. I were writting my comment when you said it. But anyway. What you are asking for is possible by using coalesces in a not too complex expressions, but it is a REALLY bad idea because this will drastically throw down most of internal database optimizations. Including the use of any existing index. So, except if you are dealing with a relatively small database (and you are sure it will remain being approximately the same size), I suggest you to better try distinct approach… This is, in fact, the reason I implmented sqlapi.
If you need to have all querys previously stored in database, it probably could be much better to sort given arguments by its name and precalculate and store precalculated querys for each possible combination.
You can do it by assigning a default value to the variable when is not used. For instance if $location is not used you can set it to -1 as default value.
Then the where condition would be:
WHERE ($location=-1 OR entity.location = $location)

TOS DI variable in tMySqlInput

I'm relatively new to Talend OSDI. I managed to do simple request in MySql with tMySqlInput component. However today I have a more ambitious request and have some trouble to make it work.
Indeed I need a request where the result depends on the previous line. I made it on MySQLWorkbench but not on Talend. Exemple : delay time between two dates.
Here is the request :
SET #var = NULL;
SELECT id, start_date, end_date, #var precedent, UNIX_TIMESTAMP(TIMEDIFF(start_date,#var)) AS diff, #var:=start_date AS temp
FROM ma_table
ORDER BY start_date;
and errors are :
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT id, start_date, end_date, id_process_type, #var precedent, UNIX_TIMESTAMP' at line 2
...Not very usefull, Is this syntax forbidden on Talend ? Do it exists others solutions to do such requests on Talend ? (for delay time between two dates for examples) or other component maybe ? I am searching with tMysqlRow.
Thanks for ideas !
As #Gabriele B mentions, you might want to consider doing this in a more "Talend" way.
I'd personally make use of the tMemorizeRows component to do this though.
To simplify this I've gone and made the start and end dates as integers but it should be trivial to handle this using proper dates.
If we have some data that shows the start and end date of a process and we want to work out the delay between finishing the last one and starting the next process we can read all of the data in and then use the tMemorizeRows component to remember the last 2 rows:
We then access the memorized data by looking at the array index. So here we go to a tJavaRow component that has an extra output column, startdelay. We then calculate it by comparing the current process' start day minus the last process' end date:
output_row.id = input_row.id;
output_row.startdate = input_row.startdate;
output_row.enddate = input_row.enddate;
if (id_tMemorizeRows_1[0] != 1) {
output_row.startDelay = startdate_tMemorizeRows_1[0] - enddate_tMemorizeRows_1[1];
} else {
output_row.startDelay = 0;
}
The conditional statement it to avoid null pointer errors on the first run of the data as the enddate_tMemorizeRows_1[1] will be null at that point. You could handle the null in other ways of course.
This process is reasonably easy to understand and maintain (although there is that small bit of Java code in there) and has the benefits of only needing the load the data once and only keep a small part of it in memory at any one time. It should also be very fast.
You should consider a statement refactory to do it in a "Talend" way, maybe little slower but most portable and robust.
If your table is not huge, for example, I would recommend to load it in memory using tCacheOutput/tCacheInput (you can find them on Talend Exchange) and this design:
tMySqlLoad----->tCacheOutput_1
|
|
|
OnSubjobOk
|
|
v
tCacheInput_1------->tMap_1--------+
|
|
tJoin-------------->tMap_3------------>[output]
|
|
tCacheInput_2------->tMap_2--------'
First of all you dump your table on a memory buffer
Then, you read two times this buffer. It's in memory, so it won't hurt performances
In tMap_1 you add a auto_increment index using a Numeric.sequence
You do the same in tMap_2 but with a starting number of 2 (basically, you shift the index)
Then you auto-join the table using these brand new columns
Finally in tMap_3 you're going to release your payload (ie make the diff)
This is going to be a verbose but robust solution if your table is small. If it's not and performance is not a issue you can try an even more verbose solution like Prepared Statements.

Rails best way to add huge amount of records

I've got to add like 25000 records to database at once in Rails.
I have to validate them, too.
Here is what i have for now:
# controller create action
def create
emails = params[:emails][:list].split("\r\n")
#created_count = 0
#rejected_count = 0
inserts = []
emails.each do |email|
#email = Email.new(:email => email)
if #email.valid?
#created_count += 1
inserts.push "('#{email}', '#{Date.today}', '#{Date.today}')"
else
#rejected_count += 1
end
end
return if emails.empty?
sql = "INSERT INTO `emails` (`email`, `updated_at`, `created_at`) VALUES #{inserts.join(", ")}"
Email.connection.execute(sql) unless inserts.empty?
redirect_to new_email_path, :notice => "Successfuly created #{#created_count} emails, rejected #{#rejected_count}"
end
It's VERY slow now, no way to add such number of records 'cause of timeout.
Any ideas? I'm using mysql.
Three things come into mind:
You can help yourself with proper tools like:
zdennis/activerecord-import or jsuchal/activerecord-fast-import. The problem is with, your example, that you will also create 25000 objects. If you tell activerecord-import to not use validations, it will not create new objects (activerecord-import/wiki/Benchmarks)
Importing tens thousands of rows into relational database will never be super fast, it should be done asynchronously via background process. And there are also tools for that, like DelayedJob and more: https://www.ruby-toolbox.com/
Move the code that belongs to model out of controller(TM)
And after that, you need to rethink the flow of this part of application. If you're using background processing inside a controller action like create, you can not just simply return HTTP 201, or HTTP 200. What you need to do is to return "quick" HTTP 202 Accepted, and provide a link to another representation where user could check the status of their request (do we already have success response? how many emails failed?), as it is in now beeing processed in the background.
It can sound a bit complicated, and it is, which is a sign, that you maybe shouldn't do it like that. Why do you have to add like 25000 records in one request? What's the backgorund?
Why don't you create a rake task for the work? The following link explains it pretty well.
http://www.ultrasaurus.com/sarahblog/2009/12/creating-a-custom-rake-task/
In a nutshell, once you write your rake task, you can kick off the work by:
rake member:load_emails
If speed is your concern, I'd attack the problem from a different angle.
Create a table that copies the structure of your emails table; let it be emails_copy. Don't copy indexes and constraints.
Import the 25k records into it using your database's fast import tools. Consult your DB docs or see e.g. this answer for MySQL. You will have to prepare the input file, but it's way faster to do — I suppose you already have the data in some text or tabular form.
Create indexes and constraints for emails_copy to mimic emails table. Constraint violations, if any, will surface; fix them.
Validate the data inside the table. It may take a few raw SQL statements to check for severe errors. You don't have to validate emails for anything but very simple format anyway. Maybe all your validation could be done against the text you'll use for import.
insert into emails select * from emails_copy to put the emails into the production table. Well, you might play a bit with it to get autoincrement IDs right.
Once you're positive that the process succeeded, drop table emails_copy.

Query on custom metadata field?

This is a request from my client to tweak an existing Perl script. However, it is the actual database structure on their end that confuses me.
The requirement looks pretty simple:
only pull records where _X begins with 1, 2, or 9.
However, the underlying database is not that simple, here is the guideline from their DBA:
"_X is a custom metadata field. The database stores this data in rows, not columns, within the customData table. In order to query the custom data table in an efficient manner you need to know the Field_ID for the custom field you get that from the fielddef table:
SELECT Field_ID FROM FieldDef WHERE Name = "_X";
This returns:
10012
"Now you can query CustomData. For example:
SELECT Record_ID FROM CustomData where Field_ID="10012" AND StringValue="2012-04";
He also suggests that in my case, probably it would be:
"SELECT Record_ID FROM CustomData where Field_ID="10012" AND (StringValue LIKE '1%' || StringValue LIKE '2%' || StringValue LIKE '9%')
The weird thing is that the existing Perl script doesn't contain anything like "Select Record_ID FROM" but all like "SELECT StringValue FROM".
So that is why I am very confused here: What is "store in rows, not in columns"? Why first query the Field_ID table then CustomData? I would not be able to communicate with any of them during this weekend but really wish to get some idea on the whole thing, hope experts can help me a little on sorting out the whole structure.
More info(Table schema):
http://pastebin.com/ZiDTCCC0
The existing perl script:(focus on lines 72-136)
http://pastebin.com/JHpikTeZ
Thanks in advance.
What they seem to be using is some kind of Entity-Attribute-Value model, with the entities stored as ints and explained in another table (FieldDef).
You explained pretty well how you queried it (although you can do it in one query, with a join or a subquery), and your problem seems to be that you don't know how the Perl script does it. Unfortunately, without us seeing the Perl script, we can't either :]