MySQL/RDBMS: Is it okay to index long strings? Will it do the job? - mysql

Let's suppose I have a table of movies:
+------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| title | tinytext | YES | | NULL | |
| synopsis | synopsis | YES | | NULL | |
| year | int(4) | YES | | NULL | |
| ISBN | varchar(13) | YES | | NULL | |
| category | tinytext | YES | | NULL | |
| author | tinytext | YES | | NULL | |
| theme | tinytext | YES | | NULL | |
| edition | int(2) | YES | | NULL | |
| search | text | YES | | NULL | |
+------------+---------------------+------+-----+---------+----------------+
In this example, I'm using search column as a summary of the table. So, a possible record would be like the following:
+------------+-------------------------------------------------------------+
| Field | Value |
+------------+-------------------------------------------------------------+
| id | 1 |
| title | Awesome Book |
| synopsis | This is a cool book with a cool history |
| year | 2013 |
| ISBN | 1234567890123 |
| category | Horror |
| author | John Doe |
| theme | Programmer goes insane |
| edition | 2nd |
| search | 2013 horror john doe awesome book this is a cool book (...) |
+------------+---------------------+------+-----+---------+----------------+
This column search will be the one scanned when a search is made. Notice that it has all the words of other fields, in lower case, and possibly some extra words to help on a search.
I have two questions about it:
1) Knowing that this column is a text field and can get really big, is it okay to index it? Will it improve the performance as expected? Why?
2) Despite the index, is it a good idea to use this method to search or is it better to try every column on my query? How can I improve it?
OBS: I don't really have this table, it's just for example purposes. Please ignore any error in datatypes or syntax I may have done.

1) Knowing that this column is a text field and can get really big, is
it okay to index it? Will it improve the performance as expected? Why?
Yes, you can index it, but no, it won't improve performance. An index on string-type columns only helps when the query matches the start of the column - so in your case, someone searching '2013 horror john' would hit the index, but someone searching 'horror john 2013' would not.
2) Despite the index, is it a good idea to use this method to search
or is it better to try every column on my query? How can I improve it?
As Gordon Linoff writes, the best solution is probably full text searching - this is blazingly fast for text searches, deals with "fuzzy" matching, and generally allows you to write a search function similar to the way google works.

Indexing the search column is not helpful.
What you may want is full text search capabilities on the column, which you can read about here.
Which you use for search depends on whether the searches will be using context. If someone searches for "Clinton", do you want them to restrict the search to authors named "Clinton" or to books about "Clinton"? If you don't care about the context, then full text on one field is quite reasonable.
I need to add: you don't need to put all the search terms in a separate field to use full text search. You can create a full text index on multiple columns. This gives you the flexibility of using full text searches with context (by looking only in specific columns) or without context (by looking in all of them). Your question was about the search column in particular, but that is not the best way to implement the functionality that you are looking for.

Related

Pandas to_sql discarding rows when appending to mysql table

I'm working with articles scraped from online newspapers with a mysql database and python. I want to use pandas to_sql method on a dataframe for appending recently scraped articles to a mysql table. It works pretty well, but im having some problems with the following:
Since the articles are automatically scraped from news sites, about 1% of them have issues (encoding, or texts are too long or stuff like that) and dont fit on the mysql table fields. Pandas to_sql method for some reason IGNORES these errors and discards the rows that do not fit. For example I have the following mysql table:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| title | varchar(255) | YES | | NULL | |
| description | text | YES | | NULL | |
| content | text | YES | | NULL | |
| link | varchar(300) | YES | | NULL | |
+--------------+--------------+------+-----+---------+----------------+
And I also have a Dataframe that contains 15 rows and 4 columns (title, description, content, link).
If 1 of those rows has a title larger than 255 characters, it wont fit in the mysql table. I expected an error when doing df.to_sql('press', con=con, index=False, if_exists='append'), that way I know i have a problem to fix; but the actual result was that 14 ROWS where appended instead of 15.
This could work for me, but i need to know which row was discarded so i can flag it for later revision. Is it possible to tell pandas to let me know which indexes are ignored?
Thanks!

Exclude rows in mysql which contain a year without the NOT LIKE operator

I have a table which contains tags. Almost all tags are genres (such as action and comedy). However there are also tags such as Winter 2014 and Summer 2012. These are seasonal tags.
I want to exclude those tags from a genre listing. So how do I exclude those seasonal tags in the query?
The reason I don't want to use the NOT LIKE operator is to prevent full table scans.
This is what I currently have (in eloquent):
$genres = Tag::where('slug', 'not like', '%20%')->get()->lists('name');
Sidenote: a laravel 4 (eloquent) approach would be appreciated but not necessary.
This is my table
+---------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| slug | varchar(255) | NO | MUL | NULL | |
| name | varchar(255) | NO | | NULL | |
| suggest | tinyint(1) | NO | | 0 | |
| count | int(10) unsigned | NO | | 0 | |
+---------+------------------+------+-----+---------+----------------+
If I were you, I would add seasonal tinyint(1) field to this table and now you could simply run:
$genres = Tag::whereSeasonal(0)->get()->lists('name');
to get tags that are not seasonal.
If you cannot do it, you could store ids of seasonal tags in PHP array (or in one more table) - I don't know how many tags you have and how often you add seasonal tags and then you could get non-seasonal tags:
$genres = Tag::whereNotIn('id', $arrayOFSesonalIds)->get()->lists('name');
If you're looking for a solution to exclude years in data fields that are stored as strings (CHAR/VARCHAR), NOT LIKE is probably the best way to go about it going off of your description of the problem. If the dates you're checking are DATE/DATETIME/TIMESTAMP you can use the YEAR() MySQL function to yank the year out of the field to which you wish to compare.
If not, could you provide the output of DESCRIBE tablename for the table on which you wish to perform this action?

How to improve my natural language search query

I have a query I am trying to build which I want to dose some natural language searching. I am unsure of the best way to do this in mysql. I believe mysql has some cool natural language stuff that I can use.
I have two tables which I have shown below.
1. transaction_category...
+--------------------+--------------------+-------------------+----------+
| tran_category_code | tran_category_desc | tran_category_seq | btn_type |
+--------------------+--------------------+-------------------+----------+
| CarParking | Car Parking | 2 | default |
| Electricity | Electricity | 1 | default |
| Groceries | Groceries | 4 | default |
| HealthInsurance | Health Insurance | 5 | default |
| Other | Other | 7 | default |
| Petrol | Petrol | 3 | default |
| Phone | Phone | 6 | default |
+--------------------+--------------------+-------------------+----------+
2. transaction_category_keyword...
+---------------------------------+------------------------------+--------------------+
| transaction_category_keyword_id | transaction_category_keyword | tran_category_code |
+---------------------------------+------------------------------+--------------------+
| 6 | Telstra | Phone |
| 7 | Park | CarParking |
| 8 | Coles | Groceries |
| 9 | Bp Connect | Petrol |
| 10 | Bupa | HealthInsurance |
+---------------------------------+------------------------------+--------------------+
My query is below and that returns the results I want but I was just wondering if anyone could give me advice on whether this could be improved using mysql's natural language functions. This would help me because the search is very simple now but I will be building on it a lot soon.
SELECT
tck.transaction_category_keyword_id,
tck.transaction_category_keyword,
tck.tran_category_code
FROM transaction_category tc, transaction_category_keyword tck
WHERE tc.tran_category_code = tck.tran_category_code
AND 'Coles Menai Syd Au' like '%' ||UPPER(tck.transaction_category_keyword) || '%'
+---------------------------------+------------------------------+--------------------+
| transaction_category_keyword_id | transaction_category_keyword | tran_category_code |
+---------------------------------+------------------------------+--------------------+
| 7 | Park | CarParking |
| 8 | Coles | Groceries |
| 10 | Bupa | HealthInsurance |
| 9 | Bp Connect | Petrol |
| 6 | Telstra | Phone |
+---------------------------------+------------------------------+--------------------+
thanks
In general, if you have a wildcard at both the beginning and end of your search field, then your searches are going to be fairly slow on any non-trivial table sizes, as the field will have to be searched starting from every index.
You would definitely benefit from full text search and match as you are searching for bags of words (and their relative frequencies in the index), rather than a specific string within some other field. I assume you have read the docs at http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html. There are a number of subtleties you need to understand such as stop words, boolean search, query expansion, etc. The comments on these pages are very good as they have the accumulated knowledge of people who have been there before and experimented.
It is also worth reading about tf-idf which is how MySQL (and many other full-text searches) work internally, see the docs, wich basically ranks a search according to a combination of how rare a word is in all documents and how many times is occurs in a particular document.
I can't give you any more focused examples, or performance metrics, as your question is asking will full text outperform a double wildcard like search, to which the answer is a pretty much unqualified yes.
CAVEAT: Always worth mentioning, given the differences between engines, but before MySQL version 5.6 full-text search only words for MyISAM, but thereafter with InnoDB too.

mysql - How to filter results without specifying the columns

It might sound silly but Im just curious.
I have a table named posts:
+----------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| title | varchar(50) | YES | | NULL | |
| body | text | YES | | NULL | |
| created | datetime | YES | | NULL | |
| modified | datetime | YES | | NULL | |
+----------+------------------+------+-----+---------+----------------+
The values:
+----+-----------------------+----------------------------------------+---------------------+---------------------+
| id | title | body | created | modified |
+----+-----------------------+----------------------------------------+---------------------+---------------------+
| 2 | A title once again!!! | And the post body follows. Tralalalala | 2013-06-03 13:13:44 | 2013-06-05 09:36:51 |
| 3 | Title strikes back | This is really exciting! Not. | 2013-06-03 13:13:46 | NULL |
| 11 | Tomcat | Tommy boy!!! FFF | 2013-06-04 16:33:22 | 2013-06-04 16:48:40 |
| 12 | FFD | dsfdsf | 2013-06-04 16:48:56 | 2013-06-04 16:55:50 |
| 13 | fdf | dfdsf | 2013-06-04 16:57:47 | 2013-06-05 09:36:54 |
| 14 | GGD | dsfdsf | 2013-06-04 17:02:33 | 2013-06-04 17:02:33 |
| 15 | GG# | dsfdsfff322 | 2013-06-05 09:36:20 | 2013-06-05 09:36:28 |
+----+-----------------------+----------------------------------------+---------------------+---------------------+
Let's say I want to search for row that has the value Th (not case sensitive) regardless of the FIELD. This is like making a quick search function.
Normally I would do something like : SELECT * FROM posts WHERE title LIKE '%Th%' OR body LIKE '%Th%'
I did not include the other fields because obviously they are not gonna accept those values.
I wanna know if there's a shortcut to this? Like SELECT * FROM posts LIKE '%Th%'.
Please advise. Thanks.
Using plain old SQL you need to specify all the column names you wish to include.
If you want more search-box-like behavior, I'd suggest looking at MySQL's fulltext functions; see:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
The SQL language is based on the presumption of the schema being known. Thus, there is no "search any column" type of functionality. How would it work against non-text columns? What about columns of different collations? Aside from the language not having a feature, specifying the columns expresses your intent to the next developer and that as much as anything should be an overriding consideration.
Other answers have covered that you need to specify all the columns. Here is an alternative formulation that is a bit shorter:
SELECT *
FROM posts
WHERE concat(title, ' ', body) LIKE '%Th%'
If you are looking for an exact match, then you can do:
select *
from posts
where 'Th' in (title, body)
No there is no shortcut for using a where clause. and specifying the columns. Otherwise the query engine can never know what to filter and what column to filter unless you specify them in the where clause.
If you want a custom shortcut - you can write a function which takes a single parameter (the search string) and returns the required fields.
I'm afraid there isn't.
Not sure what your use case is... does this alternative approach work for your use case?
mysql -u{user} -p{password} -h{hostname} {database_name} -B -e "{query}" | grep "{search_string}"
It connects to the database and runs the specified query, returns query results in new lines, fields separated by tab stop. Then use Unix utility grep to filter returned rows.

Defining a webservice for usage analytics (dekstop application)

Current situation
I have a desktop application (C++ Win32), and I wish to track users' usage analytics anonymously (actions, clicks, usage time, etc.)
The tracking is done via designated web services for specific actions (install, uninstall, click) and everything is written by my team and stored on our DB.
The need
Now we're adding more usage types and events with a variety of data, so we need define the services.
Instead of having tons of different web services for each action, I want to have a single generic service for all usage types, that is capable of receiving different data types.
For example:
"button_A_click" event, has data with 1 field: {window_name (string)}
"show_notification" event, has data with 3 fields: {source_id (int), user_action (int), index (int)}
Question
I'm looking for an elegant & convenient way to store this sort of diverse data, so later I could query it easily.
The alternatives I can think of:
Storing the different data for each usage type as one field of JSON/XML object, but it would be extremely hard to pull data and write queries for those fields
Having extra N data fields for each record, but it seems very wasteful.
Any ideas for this sort of model? Maybe something like google analytics? please Advise...
Technical: The DB is MySQL running under phpMyAdmin.
Disclaimer:
There is a similar post, which brought to my attention services like DeskMetrics and Tracker bird, or try to embed google analytics to C++ native application, but I'd rather the service to by my own, and better understand how to design this sort of model.
Thanks!
This seems like a database normalization problem.
I am also going to assume that you also have a table named events where all events will be stored.
Additionally, I am going to assume you have to the following data attributes (for simplicity's sake): window_name, source_id, user_action, index
To achieve normalization, we will need the following tables:
events
data_attributes
attribute_types
This is how each of the tables should be structured:
mysql> describe events;
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| event_type | varchar(255) | YES | | NULL | |
+------------+------------------+------+-----+---------+----------------+
mysql> describe data_attributes;
+-----------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+------------------+------+-----+---------+----------------+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| event_id | int(11) | YES | | NULL | |
| attribute_type | int(11) | YES | | NULL | |
| attribute_name | varchar(255) | YES | | NULL | |
| attribute_value | int(11) | YES | | NULL | |
+-----------------+------------------+------+-----+---------+----------------+
mysql> describe attribute_types;
+-------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+----------------+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| type | varchar(255) | YES | | NULL | |
+-------+------------------+------+-----+---------+----------------+
The idea is that you will have to populate attribute_types with all possible types you can have. Then, for each new event, you will add an entry in the events table and corresponding entries in the data_attributes table to map that event to one or more attribute types with the appropriate values.
Example:
"button_A_click" event, has data with 1 field: {window_name "Dummy Window Name"}
"show_notification" event, has data with 3 fields: {source_id: 99, user_action: 44, index: 78}
would be represented as:
mysql> select * from attribute_types;
+----+-------------+
| id | type |
+----+-------------+
| 1 | window_name |
| 2 | source_id |
| 3 | user_action |
| 4 | index |
+----+-------------+
mysql> select * from events;
+----+-------------------+
| id | event_type |
+----+-------------------+
| 1 | button_A_click |
| 2 | show_notification |
+----+-------------------+
mysql> select * from data_attributes;
+----+----------+----------------+-------------------+-----------------+
| id | event_id | attribute_type | attribute_name | attribute_value |
+----+----------+----------------+-------------------+-----------------+
| 1 | 1 | 1 | Dummy Window Name | NULL |
| 2 | 2 | 2 | NULL | 99 |
| 3 | 2 | 3 | NULL | 44 |
| 4 | 2 | 4 | NULL | 78 |
+----+----------+----------------+-------------------+-----------------+
To write a query for this data, you can use the COALESCE function in MySQL to get the value for you without having to check which of the columns is NULL.
Here's a quick example I hacked up:
SELECT events.event_type as `event_type`,
attribute_types.type as `attribute_type`,
COALESCE(data_attributes.attribute_name, data_attributes.attribute_value) as `value`
FROM data_attributes,
events,
attribute_types
WHERE data_attributes.event_id = events.id
AND data_attributes.attribute_type = attribute_types.id
Which yields the following output:
+-------------------+----------------+-------------------+
| event_type | attribute_type | value |
+-------------------+----------------+-------------------+
| button_A_click | window_name | Dummy Window Name |
| show_notification | source_id | 99 |
| show_notification | user_action | 44 |
| show_notification | index | 78 |
+-------------------+----------------+-------------------+
EDIT: Bugger! I read C#, but I see you are using C++. Sorry about that. I leave the answer as-is as its principle could still be useful. Please regard the examples as pseudo-code.
You can define a custom class/structure that you use with an array. Then serialize this data and send to the WebService. For example:
[Serializable()]
public class ActionDefinition {
public string ID;
public ActionType Action; // define an Enum with possible actions
public List[] Fields; //Or a list of 'some class' if you need more complex fields
}
List AnalyticsCollection = new List(Of, Actiondefinition);
// ...
SendToWS(Serialize(AnalyticsCollection));
Now you can dynamically add as many events as you want with the needed flexibility.
on server side you can simply parse the data:
List[of, ActionDefinition] AnalyticsCollection = Deserialize(GetWS());
foreach (ActionDefinition ad in AnalyticsCollection) {
switch (ad.Action) {
//.. check for each action type
}
}
I would suggest adding security mechanisms such as checksum. I imagine the de/serializer would be pretty custom in C++ so perhaps as simple Base64 encoding can do the trick, and it can be transported as ascii text.
You could make a table for each event in wich you declare what param means what. Then you have a main table in wich you only input the events name and param1 etc. An admin tool would be very easy, you go through all events, and describe them using the table where each event is declared. E.g. for your event button_A_click you insert into the description table:
Name Param1
button_A_Click WindowTitle
So you can group your events or select only one event ..
This is how I would solve it.