Mysql search of 118 million rows taking 86 seconds - mysql

I was given a Mysql Innodb with 2 tables. One is 117+ million rows and has over 340 columns, including name, address, city, state and zip. The second table is 17+ million rows that has name, address, city, state and zip plus email. The data in the 1st and 2nd tables will not be added to or updated. There is a primary key on an id in each table. There are no other indexes defined.
I first created a contacts table from the 117+ million row table which has just the name, address, city, state, and zip making it significantly smaller. I wrote a php script to perform a search using each row from the smaller table of 17+ million records, trying to find a match in the contacts table. When one is found, I insert the id and email into a separate table. I cancelled it because it was taking approximately 86 seconds per search. With 17+ million records it will take forever to finish.
Here is my search query:
q= "SELECT id FROM GB_contacts
WHERE LAST_NAME=\"$LAST\" and FIRST_NAME=\"$FIRST\" and MI=\"$MIDDLE\"
and ADDRESS=\"$ADDRESS\" and ZIP=\"$ZIP\"".
My question is how can I do this faster? Should I make an index on name, address, and zip in the contacts table or should I index each column in the contacts table? Is there a faster way of doing this through mysql? I have read a whole bunch of different resources and am unsure which is the best way to go. Since these are such large tables, anything I try to do takes a very long time, so I am hoping to get some expert advice and avoid wasting days, weeks and months trying to figure this out. Thank-you for any helpful advice!

The best way to do this is to create a clustered index on the fields that you are matching on. In this case, it might be a good idea to start with the zip code, followed by either first or last name first--last names are longer, so take longer to match, but are also more distinct, so it will leave less rows to do further matching (you will have to test which performs better). The strategy here is to tell mysql to look just in pockets of people, rather than search the entire database. While doing this, you got to be clever where to tell MySQL to begin narrowing it down. While you test, don't forget to use the EXPLAIN command.

did you tried typical join, if your join key is indexed it shouldn't take much time.
if its one time you can create indexes on join columns.
second step would be to load the records returned into new contacts table.

Related

DATABASE optimization insert and search

I was having an argument with a friend of mine. Suppose we have a db table with a userid and some other fields. This table might have a lot of rows. Let's suppose also that by design we limit the records for each userid in the table to about 50.My friend suggested that if I under every row for each userid one after another the lookup would be faster e.g
userid otherfield
1 .........
1 .........
.....until 50...
2 ........
etc. So when a user id 1 is created I pre-popopulate the 50 table's rows to with null values...etc. The idea is that if I know the amount of rows and find the first row with userid =1 I just have to look the next 49 an voila I don't have to search the whole table. Is this correct?can this be done without indexing? Is the pre-population an expensive process?Is there a performance difference if I just inserted in old fashioned way like
1 ........
2 ........
2 ........
1 ........
etc?
To answer a performance question like this, you should run performance tests on the different configurations.
But, let me make a few points.
First, although you might know that the records for a given id are located next to each other, the database does not know this. So, if you are searching for one user -- without an index -- then the engine needs to search through all the records (unless you have a limit clause in the query).
Second, if the data is fixed length (numeric and dates), the populating it with values after populating it with NULL values will occupy the same space on the page. But, if the data is variable length, then a given page will be filled with empty records. When you modify the records with real values, you will get page split.
What you are trying to do is to outsmart the database engine. This isn't necessary, because MySQL provides indexes, which provide almost all the benefits that you are describing.
Now, having said that, there is some performance benefit from having all the records for a user being co-located. If a user has 50 records, then reading the records with an index would typically require loading 50 pages into memory. If the records are co-located, then only one or two records would need to be read. Typically, this would be a very small performance gain, because most frequently accessed tables fit into memory. There might be some circumstances where the performance gain is worth it.

What works faster "longer table with less columns" or "shorter table with more columns"?

I have to make decision how to plan table that will be used to store dates.
I have about 20 different dates for each user and guess 100 000 users right now and growing.
So question is for SELECT query what will work faster if I make table with 20 fields? e.g.
"user_dates"
userId, date_registered, date_paid, date_started_working, ... date_reported, date_fired 20 total fields with 100 000 records in table
or make 2 tables it like
first table "date_types" with 3 fields and 20 records for above column names.
id, date_type_id, date_type_name
1 5 date_reported
2 3 date_registerd
...
and second table with 3 fields actual records
"user_dates"
userId, date_type, date
201 2 2012-01-28
202 5 2012-06-14
...
but then with 2 000 000 records ?
I think second option is more universal if I need to add more dates I can do it from front end just adding record to "date_type" table and then using it in "user_dates" however I am now worried about performance with 2 million records in table.
So which option you think will work faster?
A longer table will have a larger index. A wider table will have a smaller index but take more psychical space and probably have more overhead. You should carefully examine your schema to see if normalization is complete.
I would, however, go with your second option. This is because you don't need to necessarily have the fields exist if they are empty. So if the user hasn't been fired, no need to create a record for them.
If the dates are pretty concrete and the users will have all (or most) of the dates filled in, then I would go with the wide table because it's easier to actually write the queries to get the data. Writing a query that asks for all the users that have date1 in a range and date2 in a range is much more difficult with a vertical table.
I would only go with the longer table if you know you need the option to create date types on the fly.
The best way to determine this is through testing. Generally the sizes of data you are talking about (20 date columns by 100K records) is really pretty small in regards to MySQL tables, so I would probably just use one table with multiple columns unless you think you will be adding new types of date fields all the time and desire a more flexible schema. You just need to make sure you index all the fields that will be used in for filtering, ordering, joining, etc. in queries.
The design may also be informed by what type of queries you want to perform against the data. If for example you expect that you might want to query data based on a combination of fields (i.e. user has some certain date, but not another date), the querying will likely be much more optimal on the single table, as you would be able to use a simple SELECT ... WHERE query. With the separate tables, you might find yourself needing to do subselects, or odd join conditions, or HAVING clauses to perform the same kind of query.
As long as the user ID and the date-type ID are indexed on the main tables and the user_dates table, I doubt you will notice a problem when querying .. if you were to query the entire table in either case, I'm sure it would take a pretty long time (mostly to send the data, though). A single user lookup will be instantaneous in either case.
Don't sacrifice the relation for some possible efficiency improvement; it's not worth it.
Usually I go both ways: Put the basic and most oftenly used attributes into one table. Make an additional-attributes table to put rarley used attributes in, which then can be fetched lazily from the application layer. This way you are not doing JOIN's every time you fetch a user.

MySQL Optimization When Not All Columns Are Indexed

Say I have a table with 3 columns and thousands of records like this:
id # primary key
name # indexed
gender # not indexed
And I want to find "All males named Alex", i.e., a specific name and specific gender.
Is the naieve way (select * from people where name='alex' and gender=2) good enough here? Or is there a more optimal way, like a sub-query on name?
Assuming that you don't have thousand of records, matching the name, with only few being actually males, the index on name is enough. Generally you should not index fields with little carinality (only 2 possible values means that you are going to match 50% of the rows, which does not justify using an index).
The only usefull exception I can think of, is if you are selecting name and gender only, and if you put both of them in the index, you can perform an index-covered query, which is faster than selecting rows by index and then retrieving the data from the table.
If creating an index is not an option, or you have a large volume of data in the table (or even if there is an index, but you still want to quicken the pace) it can often have a big impact to reorder the table according to the data you are grouping together.
I have a query at work for getting KPIs together for my division and even though everything was nicely indexed, the data that was being pulled was still searching through a couple of gigs of table. This means a LOT of disc accessing while the query aggregates all the correct rows together. I reordered the table using alter table tableName order by column1, column2; and the query went from taking around 15 seconds to returning data in under 3. So the physical gathering of the data can be a significant influence - even if the tables are indexed and the DB knows exactly where to get it. Arranging the data so it is easier for the database to get to everything it needs will improve performance.
A better way is to have a composite index.
i.e.
CREATE INDEX <some name for the index> ON <table name> (name, gender)
Then the WHERE clause can use it for both the name and the gender.

30 Million Rows in MySQL

Evening,
I'm going through the long process of importing data from a battered, 15-year-old, read-only data format into MySQL to build some smaller statistical tables from it.
The largest table I have built before was (I think) 32 million rows, but I didn't expect it to get that big and was really straining MySQL.
The table will look like this:
surname name year rel bco bplace rco rplace
Jones David 1812 head Lond Soho Shop Shewsbury
So, small ints and varchars.
Could anyone offer advice on how to get this to workas quickly as possible? Would indexes on any of the coulmns help, or would they just slow queries down.
Much of the data in each column will be duplicated many times. Some fields don't have much more than about 100 different possible values.
The main columns I will be querying the table on are: surname, name, rco, rplace.
INDEX on a column fastens the search.
Try to INDEX columns that you would be using more often in queries. As you have mentioned you would be using the columns surname, name, rco, rplace. I'd suggest you index them.
Since the table has 32 million records, indexing would take sometime however it is worth the wait.

How to write a fast counting query for a large table?

I have two tables, Table1 with 100,000 rows and Table2 with 400,000 rows. Both tables have a field called Email. I need to insert a new field into Table1 which will indicate the number of times the Email from each row in Table1 appears in Table2.
I wrote a binary count function for Excel which performs this in a few seconds on this data sample. Is it possible to perform it this fast in Access?
Thank you.
Does this query express what you want to find from Table2?
SELECT Email, Count(*) AS number_matches
FROM Table2
GROUP BY Email;
If that is what you want, I don't understand why you would store number_matches in another table. Just use this query wherever/whenever you need number_matches.
You should have an index on Email for Table2.
Update: I offer this example to illustrate how fast Count() with GROUP BY can be for an indexed field.
SELECT really_big_table.just_some_text, Count(*) AS CountOfMatches
FROM really_big_table
GROUP BY really_big_table.just_some_text;
really_big_table contains 10,776,000 rows. That size is way beyond what you would ordinarily expect to find in a real-word Access (Jet/ACE) database. I keep it around for extreme stress testing of different database operations.
The field, just_some_text, is indexed. With that index, the query completes in well under a minute. I didn't bother to time it more precisely because I was only interested in a rough comparison with the several minutes the OP's similar query took for a table which includes less than 5% of the number of rows as mine.
I don't understand why the OP's query is so much slower by comparison. My intention here is to warn other readers not to dismiss this method. In my experience, the speed of operations like this ranges from acceptable to blazingly fast ... as long as the database engine has an appropriate index to work with. At least give it a try before you resort to copying values redundantly between tables.