I have a table in my database that contains quite a lot of data. I have to run a query that should simply count the number of rows where a certain keyword is present in the Message column.
Normally I would run this:
SELECT count(1) FROM applogs WHERE Message LIKE '%level_100%'
This works. It gives me 2972 rows back. But this is obviously very slow for larger datasets.
So I've added a FullText index to the Message column so I can use Match Against like so:
SELECT count(1) FROM applogs WHERE MATCH(Message) AGAINST('level_100');
Now this runs very fast. But this gives me a count of 5672 back.
Because the level_100 term can occur more than once per row. But I'm not interested in how many occurences. I just want to know how many rows.
Is there a way to do this using MATCH AGAINST?
Related
I am using mysql database,its table name is license_csv ,and it has 18 million records, and it is using MYISAM engine when i try to count record without where condition it take less than 1 sec for count, but when i applied where condition it take 2 min for count result, i have applied indexing for that 2 columns lic_state and lic_city, still it take 2 min for the count result, can anyone please help me what i need to do now for that query ? here i also added my query,
SELECT COUNT(*) as total
from license_csv WHERE
lic_state like '%ca%' AND lic_city LIKE '%fresno%'
The problem with your query is that you are using the % wildcard on both sides of the string. This defeats an existing index. Because indexes are sequential, MySQL is not able to use them for this type of condition, since there is nothing to tell him where the start searching.
Possible options :
if possible, switch to equality instead of LIKE; you could then takd advantage of an index on (lic_state, lic_sity)
else, removing the wildcard on the left side (lic_state like 'ca%') could help
switch to FULL TEXT search
store the list of CA codes and cities in another table (which should much smaller than the licences table), and have a column in the licences table referrencing its primary key; then JOIN that table in the search. With this option, the range of records to search when applying the search condition will be mch smaller
I am trying to do my first steps with SQL. Currently I am trying to analyse a database and stepped over a problem which I can't explain. Eventually someone could give me a hint.
I have a mySQL table ('cap851312') witch has 330.178 table rows. I already imported the table to Excel, and verified this number!
Every single row includes a field (column 'ID_MES_ANO') for the entries date. For the time being, all the date is uniquely set "201312".
Starting the following command, I would expect to see as a result the given number of rows, however the number which appears is 476.598.
SELECT movedb.cap851312.ID_MES_ANO, count(*)
FROM movedb.cap851312;
I already imported the file to Excel, and verified the number of lines. Indeed, it is 330.178!
How could I find out, what exactly is going wrong?
Update:
I've tried:
SELECT count(*) FROM movedb.cap851312
This returns as well 476.598.
As I am using workbench, I easily could confirm the numer of 330.178 table rows.
Update 2:
The Workbench Table Inspector confirms: "Table rows: 330178"
Solved - However unsure why:
I changed the statement to
SELECT count(ID_MES_ANO) FROM movedb.cap851512;
This time the result is 330178 !
COUNT(*) counts all rows.
COUNT(ID_MES_ANO) counts only rows for which ID_MES_ANO is not null.
So the difference between the two are the rows where ID_MES_ANO is null.
You can verify this with
SELECT count(*) FROM movedb.cap851512 WHERE ID_MES_ANO IS NULL;
By the way:
SELECT movedb.cap851312.ID_MES_ANO, count(*) FROM movedb.cap851312;
means: aggregate all rows to one single result row (by using the aggregate function COUNT without GROUP BY). This result row shows ID_MES_ANO and the number of all records. Standard SQL does not allow this, because you don't tell the DBMS which ID_MES_ANO of those hundreds of thousands of records to show. MySQL violates the standard here and simply picks one ID_MES_ANO arbitrarily from the rows.
I am testing SQL and I am stuck on one query. It is a useless query but I want to understand it.
select count(*), floor(rand()*2) as x from table_name group by x;
The result is either two rows, or duplicate entry '0/1' for key 'group_key'
What happens that leads to this error?
rand() is going to generate a random number for every row in your table. You are then grouping by the results of all of those random numbers. You will get one row for each unique value.
The main point here is not to group by some strange sinthetic data.
Better to group by some certain fields.
Because mysql have some bugs there.
Like this
http://bugs.mysql.com/bug.php?id=58081
Or this
https://bugs.mysql.com/bug.php?id=60808
Certainly it is trying to create a unique index on a tmp table and that is somehow not working
Question
I'm not a comp sci major so forgive me if I muddle the terminology. What is the computational complexity for calling
SELECT DISTINCT(column) FROM table
or
SELECT * FROM table GROUP BY column
on a column that IS indexed? Is it proportional to the number of rows or the number of distinct values in the column. I believe that would be O(1)*NUM_DISINCT_COLS vs O(NUM_OF_ROWS)
Background
For example if I have 10 million rows but only 10 distinct values/groups in that column visually you could simply count the last item in each group so the time complexity would be tied to the number of distinct groups and not the number of rows. So the calculation would take the same amount of time for 1 million rows as it would for 100. I believe the complexity would be
O(1)*Number_Of_DISTINCT_ELEMENTS
But in the case of MySQL if I have 10 distinct groups will MySQL still seek through every row, basically calculating a running some of each group, or is it set up in such a way that a group of rows of the same value can be calculated in O(1) time for each distinct column value? If not then I belive it would mean the complexity is
O(NUM_ROWS)
Why Do I Care?
I have a page in my site that lists stats for categories of messages, such as total unread, total messages, etc. I could calculate this information using GROUP BY and SUM() but I was under the impression this will take longer as the number of messages grow so instead I have a table of stats for each category. When a new message is sent or created I increment the total_messages field. When I want to view the states page I simply select a single row
SELECT total_unread_messages FROM stats WHERE category_id = x
instead of calculating those stats live across all messages using GROUP BY and/or DISINCT.
The performance hit either way is not large in my case and so this may seem like a case of "premature optimization", but it would be nice to know when I'm doing something that is or isn't scalable with regard to other options that don't take much time to construct.
If you are doing:
select distinct column
from table
And there is an index on column, then MySQL can process this query using a "loose index scan" (described here).
This should allow the engine to read one key from the index and then "jump" to the next key without reading the intermediate keys (which are all identical). This suggests that the operation does not require reading the entire index, so it is, in general, less than O(n) (where n = number of rows in the table).
I doubt that finding the next value requires only one operation. I wouldn't be surprised if the overall complexity were something like O(m * log(n)), where m = number of distinct values.
Is there any standard way to find which clause has limited the result to zero record?
For example i have this query:
SELECT * FROM `tb` WHERE `room` > 2 AND `keywords` LIKE 'Apartment'
If this query do not return any record, How i can find which field has limited the result to zero record.
When you try to search some thing, if there is no result, Some search engine show you a messeage like this:
Try to search without keywords
Or if you are using MATCH(city) AGAINST('tegas') It show you:
Are you meaning texas
During the query execution, all criteria is evaluated. In order to determine if one specific item caused the query to return zero records, then you must run a separate statement for each criteria scenario.
I would suggest starting with all possible criteria, and then working back based off of the importance of the remaining items. This way you are limiting the processing in the most effective manner.