Search table with 30 million records - mysql

I have sample mysql database with only 1 table ( InnoDB) that has the following columns
Id int PK
Description TEXT
table has more than 30 million records and description field has length up to 1000 character.
What is the most efficient way to make a search for some records in this table?
i need for example description that starts with / end with / contains.
When i run a query like
SELECT * FROM tbl WHERE Description like '%abc'
it takes too long time because the like operator scan all table records
I google and found that there is something called Full TEXT Index
I add the index using the following command
ALTER TABLE tbl ADD FULLTEXT INDEX `DescriptionIndex` (`Description` ASC)
they when i try to execute a query like this
SELECT * FROM tbl WHERE MATCH (`Description`) AGAINST ('"The sea is awesome"')
some times it takes long time and other time works good based on the value in against parameter , i could not identify the problem
I need to know if i miss something or there is better way to implement search.

Related

How to optimise mysql query as Full ProcessList is showing Sending Data for over 24 hours

I have the following query that runs forever and I am looking to see if there is anyway that I can optimise it. This is running on a table that has in total 1,406,480 rows of data but apart from the Filename and Refcolumn, the ID and End_Date have both been indexed.
My Query:
INSERT INTO UniqueIDs
(
SELECT
T1.ID
FROM
master_table T1
LEFT JOIN
master_table T2
ON
(
T1.Ref_No = T2.Ref_No
AND
T1.End_Date = T2.End_Date
AND
T1.Filename = T2.Filename
AND
T1.ID > T2.ID
)
WHERE T2.ID IS NULL
AND
LENGTH(T1.Ref_No) BETWEEN 5 AND 10
)
;
Explain Results:
The reason for not indexing the Ref_No is that this is a text column and therefore I get a BLOB/TEXT error when I try and index this column.
Would really appreciate if somebody could advise on how I can quicken this query.
Thanks
Thanks to Bill in regards to multi column indexes I have managed to make some headway. I first ran this code:
CREATE INDEX I_DELETE_DUPS ON master_table(id, End_Date);
I then added a new column to show the length of the Ref_No but had to change it from the query Bill mentioned as my version of MySQL is 5.5. So I ran it in 3 steps:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED;
UPDATE master_table SET Ref_No_length = LENGTH(Ref_No);
ALTER TABLE master_table ADD INDEX (Ref_No_length);
Last step was to change my insert query with the where clause for the length. This was changed to:
AND t1.Ref_No_length between 5 and 10;
I then ran this query and within 15 mins I had 280k worth of id's inserted into my UniqueIDs table. I did go change my insert script to see if I could add more values to the length by doing the following:
AND t1.Ref_No_length IN (5,6,7,8,9,10,13);
This was to bring in the values where length was also equal to 13. This query took a lot longer, 2hr 50 mins to be precise but the additional ask of looking for all rows that have length of 13 gave me an extra 700k unique ids.
I am looking at ways to optimise the query with the IN clause, but a big improvement where this query kept running for 24 hours. So thank you so much Bill.
For the JOIN, you should have a multi-column index on (Ref_No, End_Date, Filename).
You can create a prefix index on a TEXT column like this:
ALTER TABLE master_table ADD INDEX (Ref_No(10));
But that won't help you search based on the LENGTH(). Indexing only helps search by value indexed, not by functions on the column.
In MySQL 5.7 or later, you can create a virtual column like this, with an index on the values calculated for the virtual column:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED AS (LENGTH(Ref_No)),
ADD INDEX (Ref_No_length);
Then MySQL will recognize that your condition in your query is the same as the expression for the virtual column, and it will automatically use the index (exception: in my experience, this doesn't work for expressions using JSON functions).
But this is no guarantee that the index will help. If most of the rows match the condition of the length being between 5 and 10, the optimizer will not bother with the index. It may be more work to use the index than to do a table-scan.
the ID and End_Date have both been indexed.
You have PRIMARY KEY(id) and redundantly INDEX(id)? A PK is a unique key.
"have both been indexed" -- INDEX(a), INDEX(b) is not the same as INDEX(a,b) -- they have different uses. Read about "composite" indexes.
That query smells a lot like "group-wise" max done in a very slow way. (Alas, that may have come from the online docs.)
I have compiled the fastest ways to do that task here: http://mysql.rjweb.org/doc.php/groupwise_max (There are multiple versions, based on MySQL version and what issues your code can/cannot tolerate.)
Please provide SHOW CREATE TABLE. One important question: Is id the PRIMARY KEY?
This composite index may be useful:
(Filename, End_Date, Ref_No, -- first, in any order
ID) -- last
This, as others have noted, is unlikely to be helped by any index, hence T1 will need a full-table-scan:
AND LENGTH(T1.Ref_No) BETWEEN 5 AND 10
If Ref_No cannot be bigger than 191 characters, change it to a VARCHAR so that it can be used in an index. Oh, did I ask for SHOW CREATE TABLE? If you can't make it VARCHAR, then my recommended composite index is
INDEX(Filename, End_Date, ID)

query on large size table takes up a lot of time

I Have Table ArchiveArseh with size 15GB and 198997 row record with engine innoDB(and in the future 400G and 1000000 record).
This Table contain image(field document,thumbDocument).
on filed(Id) is primary key and 4 field are indexed.
explain
when run simple query select like
SELECT *
FROM archivearseh
WHERE CONCAT(BlockCode,ArsehRow)='01011000106001'
or
SELECT *
FROM archivearseh
WHERE BlockCode='106001' and ArsehRow='01011000'
get 2 min to return result?!
how can decrease time run query?
The problem is your WHERE isnt SARGEABLE
WHERE CONCAT(BlockCode,ArsehRow)='01011000106001'
So CANT use index, and has to calculate the CONCAT for every row
So either you create a trigger on update/insert to update a column and index that field so search will be faster (but update/insert will be slower).
SET NewColumn = CONCAT(BlockCode,ArsehRow)
CREATE INDEX for NewColumn;
and
WHERE NewColumn = '01011000106001'
or do something to reduce the search domain like
WHERE BlockCode LIKE '010110%'
AND CONCAT(BlockCode,ArsehRow)='01011000106001'
Create an index on the table with columns BlockCode and ArsehRow.

mysql select order by primary key. Performance

I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.

is my large mysql table destined for failure?

I have built a mysql table on my local computer to store stock market data. The table name is minute_data, and the structure is simple enough:
You can see that I made the key column a combination of date and symbol -> concat(date,symbol). This way I do an insert ignore ... query to add data to the table without duplicating a date/symbol combination.
With this table, data retrieval is very simple. Say I wanted to get all the data for the symbol CSCO, then I could simply do this query:
select * from minute_data where symbol = "CSCO" order by date;
Everything has been "working". The table now has data from over 1000 symbols, with over 22 million rows already. I am thinking that is is not even half full for all the 1000 symbols yet, so I am expecting to keep growing the size of the table.
I am starting to see serious performance problems when querying this table. For example the following query (which I often want to do, to see the latest date for a particular symbol) takes well over 1 minute to complete, and only returns 1 row!
select * from minute_data where symbol = "CSCO" order by date desc limit 1;
This query (which is also very import) is also taking over 1 minute on average:
select count(*), symbol from minute_data group by symbol;
The performance problems are making it unrealistic to keep working with the data in this way. These are the questions that I would like to ask the community:
Is it futile to continue building my data set into this table?
Is MySQL a bad choice altogether for a data set like this?
What can I do to this table to improve performance?
What kind of data structure should I use for this purpose (instead of a MySQL table)?
Thank You!
UPDATE
I am providing the output from the explain, the same for the following 2 queries:
explain select count(*), symbol from minute_data group by symbol;
explain select * from minute_data where symbol = "CSCO" order by date desc limit 1;
UPDATE 2
pretty simple fix. I performed this query to remove the useless key_col that I had defined above, and made a primary key on 2 columns: date and symbol:
alter table minute_data drop primary key, add primary key (date,symbol);
Now I tried the following query, and it finished in less than 1 second:
select * from minute_data where symbol = "CSCO" order by date desc limit 1;
This query still takes a long time to complete (72 seconds). I guess that's still because the query has to tabulate all 22 million rows in one query?:
select count(*), symbol from minute_data group by symbol;
Your key_col is completely useless. You know that you can have a primary key over multiple columns? I'd recommend, that you drop that column and create a new primary key on (date, symbol) in this order since your date column has the higher cardinality. Additionally you can then (if there's the need for it) create another unique index on (symbol, date). Post EXPLAINs of your most important queries. And what's the cardinality of symbol?
UPDATE:
What you can see in the explain is, that there's no index which can be used and it scans the whole 22.5 million rows. Please have a try with the above mentioned. If you don't want to drop the key_col right now, you should at least add an index on symbol column.

telling if key exists in mysql table is taking too long

I started with this question: is my large mysql table destined for failure?
The answer that I found from that question was satisfactory. I have a table with 22 million rows that I would like to grow to about 100 million. At this time, the table minute_data structure is like this:
A problem that I am having is as follows. I need to execute this query:
select datediff(date,now()) from minute_data where symbol = "CSCO" order by date desc limit 1;
Which is very fast ( < 1 sec ) when the table contains the value "CSCO". The problem is, sometimes I will query for a symbol that is not in the table already. When I execute a query like this for, say, symbol = "ABCD":
select datediff(date,now()) from minute_data where symbol = "ABCD" order by date desc limit 1;
Then the query takes a LONG TIME... like forever ( 180 seconds ).
A way I can get around this is by making sure that the table contains the symbol I am looking for before I execute the query. The fastest way I found to do this is with the follow query, which I just need to use to check to see if the table minute_data contains the symbol I am looking for or not. Basically I just need it to return a boolean value so I know if the symbol is in the table or not:
select count(1) from minute_data where symbol = "CSCO";
This query takes over 30 seconds to return 1 value, way too long for my liking, since the query above, which actually returns a datediff calculation only takes less than 1 second.
symbol column is part of the pri key, I thought it should be able to figure out if a value exists there very quickly.
What am I doing wrong? Is there a fast way to do what I want to do? Should I change the structure of the data to optimize performance?
Thank You!
UPDATE
I think I found a good solution to this problem. From the answer below by LastCoder, I did the following:
1) Created a new table called minute_data_2 with the exact same definition as minute_data.
2)
ALTER TABLE minute_data_2 ADD PRIMARY KEY (symbol, date);
3)
INSERT IGNORE INTO minute_data_2 SELECT * FROM minute_data;
4)
DROP TABLE minute_data;
5) Rename minute_data_2 to minute_data
Now I am seeing blindingly fast speed for the same query which I described above as taking more than 180 second, now completes in .001 seconds. Amazing.
Did you try using EXISTS (...)
select datediff(date,now()) from minute_data
where EXISTS(SELECT * FROM minute_data WHERE symbol = "CSCO")
AND symbol = "CSCO" order by date desc limit 1;
Even though symbol is a primary key, it seems you have the timestamp as a PK as well which makes me think you are using a COMPOSITE pk which means the ordering is by timestamp then symbol. You may want to put separate index on symbol, if all you have is a composite one where timestamp is first.
I think is better to make a table named symbols and add a reference to that table in your minute_data table:
symbols:
symbol_id (INT, Primary Key, Auto Increment)
symbol_text (VARCHAR)
minute_data:
key_col (BIGINT, Primary Key, Auto Increment)
symbol_id (INT, Index)
other_field
Use InnoDB as table type for adding references.
Try to avoid duplicate entries into your tables..