Decatenate with MySQL? - mysql

I have an authors table in my database that lists an author's whole name, e.g. "Charles Dickinson". I would like to sort of "decatenate" at the space, so that I can get 'Charles" and "Dickinson" separately. I know there is the explode function in PHP, but is there anything similar for a straight mysql query? Thanks.

No, don't do that. Seriously. That is a performance killer. If you ever find yourself having to process a sub-column (part of a column) in some way, your DB design is flawed. It may well work okay on a home address book application or any of myriad other small databases but it will not be scalable.
Store the components of the name in separate columns. It's almost invariably a lot faster to join columns together with a simple concatenation (when you need the full name) than it is to split them apart with a character search.
If, for some reason you cannot split the field, at least put in the extra columns and use an insert/update trigger to populate them. While not 3NF, this will guarantee that the data is still consistent and will massively speed up your queries. You could also ensure that the extra columns are lower-cased (and indexed if you're searching on them) at the same time so as to not have to fiddle around with case issues.

This is related: MySQL Split String

Related

Strategies for large databases with changing schemas

We have a mysql database table with hundreds of millions of rows. We run into issues with performing any kind of operation on it. For example, adding columns is becoming impossible to do with any kind of predictable time frame. When we want to roll out a new column the "ALTER TABLE" command takes forever so we dont have a good idea as to what the maintenance window is.
We're not tied to keeping this data in mysql, but I was wondering if there are strategies for mysql or databases in general, for updating schemas for large tables.
One idea, which I dont particularly like, would be to create a new table with the old schema plus additional column, and run queries against a view which unioned the results until all data could be moved to the new table schema.
Right now we already run into issues where deleting large numbers of records based on a where clause exit in error.
Ideas?
In MySQL, you can create a new table using an entity-attribute-value model. This would have one row per entity and attribute, rather than putting the attribute in a new column.
This is particularly useful for sparse data. Words of caution: types are problematic (everything tends to get turned into strings) and you cannot define foreign key relationships.
EAV models are particularly useful for sparse values -- when you have attributes that only apply to a minimum number of roles. They may prove useful in your case.
In NOSQL data models, adding new attributes or lists of attributes is simpler. However, there is no relationship to the attributes in other rows.
Columnar databases (at least the one in MariaDB) is very frugal on space -- some say 10x smaller than InnoDB. The shrinkage, alone, may be well worth it for 100M rows.
You have not explained whether your data is sparse. If it is, the JSON is not that costly for space -- completely leave out any 'fields' that are missing; zero space. With almost any other approach, there is at least some overhead for missing cells.
As you suggest, use regular columns for common fields. But also for the main fields that you are likely to search on. Then throw the rest into JSON.
I like to compress (in the client) the JSON string and use a BLOB. This give 3x shrinkage over using uncompressed TEXT.
I dislike the one-row per attribute EAV approach; it is very costly in space, JOINs, etc, etc.
[More thoughts] on EAV.
Do avoid ALTER whenever possible.

SQL Index on Strings Helpful?

So I have used MySQL a lot in small projects, for school; however, I'm not taking over a enterprise-ish scale project, and now speed matters, not just getting the right information back. I have Googled around a lot trying to learn how indexes might make my website faster, and I am hoping to further understand how they work, not just when to use them.
So, I find myself doing a lot of SELECT DISTINCTS in order to get all the distinct values, so i can populate my dropdowns. I have heard that this would be faster if this column was indexed; however, I don't completely understand why. If the values in this columns were ints, I would totally understand; basically a data structure like a BST would be created, and search times could be Log(n); however, if my column is strings, how can it put a string in a BST? This doesn't seem possible, since there is no metric to compare a string against another string (like there are with numbers). It seems like an index would just create a list of all the possible values for that column, but it seems as if the search would still require the database to go through every single row, making this search linear, just like if the database just scanned a regular tables.
My second question is what does the database do once it finds the right value in the index data structure. For example, let's say I'm doing a where age = 42. So, the database goes through the data structure until it finds 42, but how does it map that lookup to the whole row? Does the index have some sort of row number associated with it?
Lastly, if I am doing these frequent SELECT DISTINCT statements, is adding an index going to help? I feel like this must be a common task for websites, as many sites have dropdowns where you can filter results, I'm just trying to figure out if I'm approaching it the right way.
Thanks in advance.
You logic is good, however, your assumption that there is no metric to compare string to other strings is incorrect. Strings can simply be compared in alphabetical order, giving them a perfectly usable comparison metric that can be used to build the index.
It takes a tiny bit longer to compare strings then it does ints, however, having an index still speeds things up, regardless of the comparison cost.
I would like to mention however that if you are using SELECT DISTINCT as much as you say, there are probably problems with your database schema.
You should learn about normalizing your database. I recommend starting with this link: http://databases.about.com/od/specificproducts/a/normalization.htm
Normalization will provide you with querying mechanism that can vastly outweigh benefits received from indexing.
if your strings are something small like categories, then an index will help. If you have large chunks of random text, then you will likely want a full text index. If you are having to use select distinct a lot, your database may not be properly normalized for what you are doing. You could also put the distinct values in a separate table (that only has the distinct values), but this only helps if the content does not change a lot. Indexing strategies are particular to your application's access patterns, the data itself, and how the tables are normalized (or not).
HTH

sql query LIKE % on Index

I am using a mysql database.
My website is cut in different elements (PRJ_12 for projet 12, TSK_14 for task 14, DOC_18 for document 18, etc). We currently store the references to these elements in our database as VARCHAR. The relation columns are Indexed so it is faster to select.
We are thinking of currint these columns in 2 columns (on column "element_type" with PRJ and one "element_id" with 12). We are thinking on this solution as we do a lot of requests containing LIKE ...% (for example retrieve all tasks of one user, no matter the id of the task).
However, splitting these columns in 2 will increase the number of Indexed columns.
So, I have two questions :
Is a LIKE ...% request in an Indexed column realy more slow than a a simple where query (without like). I know that if the column is not indexed, it is not advisable to do where ... LIKE % requests but I don't realy know how Index work).
The fact that we split the reference columns in two will double the number of Indexed table. Is that a problem?
Thanks,
1) A like is always more costly than a full comparison (with = ), however it all comes down to the field data types and the number of records (unless we're talking of a huge table you shouldn't have issues)
2) Multicolumn indexes are not a problem, yes it makes the index bigger, but so what? Data types and ammount of total rows matter, but thats what indexes are for.
So go for it
There are a number of factors involved, but in general, adding one more index on a table that has only one index already is unlikely to be a big problem. Some things to consider.
If the table most mostly read-only, then it is almost certainly not a problem. If updates are rare, then the indexes won't need to be modified often meaning there will be very little extra cost (aside from the additional disk space).
If updates to existing records do not change either of those key values, then no index modification should be needed and so again there would be no additional runtime cost.
DELETES and INSERTS will need to update both indexes. So if that is the majority of the operations (and far exceeding reads), then an additional index might incur measurable performance degradation (but it might not be a lot and not noticeable from a human perspective).
The like operator as you describe the usage should be fully optimized. In other words, the clause WHERE combinedfield LIKE 'PRJ%' should perform essentially the same as WHERE element_type = 'PRJ' if there is an index existing in both situations. The more expensive situation is if you use the wild card at the beginning (e.g., LIKE '%abc%'). You can think of a LIKE search as being equivalent to looking up a word in a dictionary. The search for 'overf%' is basically the same as a search for 'overflow'. You can do a "manual" binary search in the dictionary and quickly find the first word beginning with 'overf'. Searching for '%low', though is much more expensive. You have to scan the entire dictionary in order to find all the words that end with "low".
Having two separate fields to represent two separate values is almost always better in the long run since you can construct more efficient queries, easily perform joins, etc.
So based on the given information, I would recommend splitting it into two fields and index both fields.

Is there any way to make queries using functions in their WHERE sections relatively fast?

Let's take a table Companies with columns id, name and UCId. I'm trying to find companies whose numeric portion of the UCId matches some string of digits.
The UCIds usually look like XY123456 but they're user inputs and the users seem to really love leaving random spaces in them and sometimes even not entering the XY at all, and they want to keep it that way. What I'm saying is that I can't enforce a standard pattern. They want to enter it their way, and read it their way as well. So i'm stuck having to use functions in my where section.
Is there a way to make these queries not take unusably long in mysql? I know what functions to use and all that, I just need a way to make the search at least relatively fast. Can I somehow create a custom index with the functions already applied to the UCId?
just for reference an example of the query I'd like to use
SELECT *
FROM Companies
WHERE digits_only(UCId) = 'some_digits.'
I'll just add that the Companies tables usually has tens of thousands of rows and in some instances the query needs to be run repeatedly, that's why I need a fast solution.
Unfortunately, MySQL doesn't have such things as function- (generally speaking, expression-) based indexes (like in Oracle or PostgreSQL). One possible workaround is to add another column to Companies table, which will actually be filled by normalized values (i.e., digits_only(UCId)). This column can be managed in your code or via DB triggers set on INSERT/UPDATE.

How do I search part of a column?

I have a mysql table containing 40 million records that is being populated by a process over which I have no control. Data is added only once every month. This table needs to be search-able by the Name column. But the name column contains the full name in the format 'Last First Middle'.
In the sphinx.conf, I have
sql_query = SELECT Id, OwnersName,
substring_index(substring_index(OwnersName,' ',2),' ',-1) as firstname,
substring_index(OwnersName,' ',2) as lastname
FROM table1
How do I use sphinx search to search by firstname and/or lastname? I would like to be able to search for 'Smith' in only the first name?
Per-row functions in SQL queries are always a bad idea for tables that may grow large. If you want to search on part of a column, it should be extracted out to its own column and indexed.
I would suggest, if you have power over the schema (as opposed to the population process), inserting new columns called OwnersFirstName and OwnersLastName along with an update/insert trigger which extracts the relevant information from OwnersName and populats the new columns appropriately.
This means the expense of figuring out the first name is only done when a row is changed, not every single time you run your query. That is the right time to do it.
Then your queries become blindingly fast. And, yes, this breaks 3NF, but most people don't realize that it's okay to do that for performance reasons, provided you understand the consequences. And, since the new columns are controlled by the triggers, the data duplication that would be cause for concern is "clean".
Most problems people have with databases is the speed of their queries. Wasting a bit of disk space to gain a large amount of performance improvement is usually okay.
If you have absolutely no power over even the schema, another possibility is to create your own database with the "correct" schema and populate it periodically from the real database. Then query yours. That may involve a fair bit of data transfer every month however so the first option is the better one, if allowed.
Judging by the other answers, I may have missed something... but to restrict a search in Sphinx to a specific field, make sure you're using the extended (or extended2) match mode, and then use the following query string: #firstname Smith.
You could use substring to get the parts of the field that you want to search in, but that will slow down the process. The query can not use any kind of index to do the comparison, so it has to touch each record in the table.
The best would be not to store several values in the same field, but put the name components in three separate fields. When you store more than one value in a fields it's almost always some problems accessing the data. I see this over and over in different forums...
This is an intractable problrm because fulll names can contains prefixes, suffixes, middle names and no middle names, composite first and last names with and without hyphens, etc. There is no reasonable way to do this with 100% reliability