Efficient lookup in MySQL table with millions of rows - mysql

I have a CSV file with about 20 million rows that I'd like to use in my web application. The data is a mapping of postal/zip codes to actual street addresses in the following format:
[zip_or_postal_code] [street_number] [street_name] [city] [state_or_province] [country]
My goal is to keep my lookups (searching by zip/postal code) under 200ms.
I'm not sure if this would make a difference, but I was planning on doing the following:
Move the state/province, country, and city columns to their own tables and reference those in my primary table in order to avoid unnecessary bloat.
Some zip/postal codes cover multiple streets and addresses, so I will consolidate the data and have 1 zip/postal code and will store multiple addresses in something like a varchar. This should cut down a few million rows from the table.
What are some optimizations I could make to help with lookup speed? As an example, Google's reverse geolocation API returns a result in under 300 ms with HTTP overhead included. How do they do it?
Also, I am open to using other databases, but since I'm already using MySQL, that would be preferable.
Edit: The lookups will always be done by zip/postal code, so as an example: given the zip 12345 I'd need to return the street #(s)/name(s), city, state, and country. The street #(s)/name(s) will be stored as a single string field, however, so my app will take care of parsing them.

20 million rows is not a lot for MySQL. Just index the zip/postal code and it will be fast. Way under 200ms fast. No need to split between tables. MySQL does get slow when the result set is large, but it doesn't seem like you would encounter that issue. MySQL will do just fine with hundreds of millions of records for basic queries like yours.
You will need to adjust the MySQL settings so that it uses more memory. The default settings are pretty low.
MySQL does support spacial indexes. So you could pull the longitude/latitude for the postal codes and use a spacial index to do proximity searches. Doesn't seem like you are looking for that though.
If you want things really, really fast, go the route you were thinking of but use memcache or redis. You can use the zip/postal code as the lookup key. You would still need a persistant disk based data store to load the data from. I don't think memcache/redis is necessary, but it's an option.

Related

Creating an API for retrieving data from a MySQL Database as efficiently as possible

I have an "architecture" question for a project I'm currently working on. I currently have a Python script that scrapes a website and then inserts the data into MySQL database. The DB has the following structure:
address
price
date_created
status
Address one, main street
£150,000
13/10/2022
new data
Address one, main street
£140,000
16/10/2022
update data
Address two, side road
£350,000
13/10/2022
new data
Whereby every price update has its own record. I'm going to create an API so that I can GET price information for a given address (for example: "Address one, main street ") which I would expect to then return 2 records and their relevant contents. I've read some conflicting information on this approach, such as a lack of a unique identifier being a problem or the API would be slow with poor performance. The DB currently has around 7000 records and whilst it is always growing, will never be into the millions for example.
Is the above approach short-sighted and likely to be inefficient/ unfeasible as the DB grows (possibly 100,000 rows) - or likely to perform poorly?
I'm really new to this so any advice on my approach would be really appreciated
Scraping involves multiple steps:
One (or several) processes, each reaching out to the other sites to download the pages.
Parse the page.
Clean up the data -- Remove '£' and commas from the price; rearrange the date to be yyyy/mm/dd format, etc.
INSERT into the database. (Optionally, batch several rows into a single Insert.)
Step 1 is the slowest for elapsed time, even if you have multiple processes.
As for what the table schema should look like -- well, that depends on what the SELECTs will be. You will need some INDEXes based on the queries.
The biggest hassle will be in step 3 -- "Main Street" = "Main St." = "Main St", etc. You will need to come up with a canonical form for variations in the spelling, spacing, abbreviations, missing parts, etc. Do not depend on the Select being able to handle such, though it may have to handle some of it.
With those tips, 100K rows should not be a problem. (Unless you expect a million Selects per hour.)

Geo-location search with MYSQL InnoDB

i am working on a GEO-enabled application where i have a obvious use case of searching users within some distance of given user location .Currently i am having MySQL DB used. as the User table is expected to be very large by time the time for getting results will get longer (too long in case it need to traverse entire table).
i am using InnoDB as my table do need many things which MYISAM cant do. i have tried mongo and had a test drive with adding 5 million users and doing some test over them . now i am curious to know what MYSQL can offer in same situation as i will prefer MYSQL if it gives slightly near results to mongo .
My user table is having other fields plus a lat field and a lng (both indexed). still it takes much time. can anyone suggest a better design approach for faster results.
Mongo has a bunch of very useful built in geospatial commands and aggregations that will be ideal for your given case of finding users near to a given user point. Others include within that finds points within a bounding box or polygon. In your case the geoNear aggregation is perfect and can provide the calculated distance away from the given point.
You will have to code a lot of that functionality with mysql. Then you also have Postgis an add on for Postgres. Postgres is the classic open source Mysql competitor and Postgis has been around longer than Mongo and the database presumably behind open street maps, government gis and similar.
But to the problem, you need to use geojson format and 2dsphere index that you might not be using. Post a single record of your data.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Normalize database or not? Read only MyISAM table, performance is the main priority (MySQL)

I'm importing data to a future database that will have one, static MyISAM table (will only be read from). I chose MyISAM because as far as I understand it's faster for my requirements (I'm not very experienced with MySQL / SQL at all).
That table will have various columns such as ID, Name, Gender, Phone, Status... and Country, City, Street columns. Now the question is, should I create tables (e.g Country: Country_ID, Country_Name) for the last 3 columns and refer to them in the main table by ID (normalize...[?]), or just store them as VARCHAR in the main table (having duplicates, obviously)?
My primary concern is speed - since the table won't be written into, data integrity is not a priority. The only actions will be selecting a specific row or searching for rows that much a certain criteria.
Would searching by the Country, City and/or Street columns (and possibly other columns in the same search) be faster if I simply use VARCHAR?
EDIT: The table has about 30 columns and about 10m rows.
It can be faster to search if you normalize as the database will only have to compare an integer instead of a string. The table data will also be smaller which makes it faster to search as more can be loaded into memory at once.
If your tables are indexed correctly then it will be very fast either way - you probably won't notice a significant difference.
You might also want to look at a full text search if you find yourself writing LIKE '%foo%' as the latter won't be able to use an index and will result in a full table scan.
I'll try to give you something more than the usual "It Depends" answer.
#1 - Everything is fast for small N - if you have less than 100,000 rows, just load it flat, index it as you need to and move on to something higher priority.
Keeping everything flat in one table is faster for reading everything (all columns), but to seek or search into it you usually need indexes, if your data is very large with redundant City and Country information, it might be better to have surrogate foreign keys into separate tables, but you can't really say hard and fast.
This is why some kind of data modeling principles are almost always used - either traditional normalized (e.g. Entity-Relationship) or dimensional (e.g. Kimball) is usually used - the rules or methodologies in both cases are designed to help you model the data without having to anticipate every use case. Obviously, knowing all the usage patterns will bias your data model towards supporting them - so a lot of aggregations and analysis is a strong indicator to use a denormalized dimensional model.
So it really depends a lot on your data profile (row width and row count) and usage patterns.
I don't have much more than the usual "It Depends" answer, unfortunately.
Go with as much normalization as you need for the searches you actually do. If you never actually search for people who live on Elm Street in Sacramento or on Maple Avenue in Denver, any effort to normalize those columns is pretty much wasted. Ordinarily you would normalize something like that to avoid update errors, but you've stated that data integrity is not a risk.
Watch your slow query log like a hawk! That will tell you what you need to normalize. Do EXPLAIN on those queries and determine whether you can add an index to improve it or whether you need to normalize.
I've worked with some data models that we would called "hyper-normalized." They were in all the proper normal forms, but often for things that just didn't need it for how we used the data. Those kinds of data models are difficult to understand with a casual glance, and they can be very annoying.

high load on mysql DB how to avoid?

I have a table contain the city around the worlds it contain more than 70,000 cities.
and also have auto suggest input in my home page - which used intensively in my home page-, that make a sql query (like search) for each input in the input (after the second letter)..
so i afraid from that heavily load,,...,, so I looking for any solution or technique can help in such situation .
Cache the table, preferably in memory. 70.000 cities is not that much data. If each city takes up 50 bytes, that's only 70000 * 50 / (1024 ^ 2) = 3MByte. And after all, a list of cities doesn't change that fast.
If you are using AJAX calls exclusively, you could cache the data for every combination of the first two letters in JSON. Assuming a Latin-like alphabet, that would be around 680 combinations. Save each of those to a text file in JSON format, and have jQuery access the text files directly.
Create an index on the city 'names' to begin with. This speeds up queries that look like:
SELECT name FROM cities WHERE name LIKE 'ka%'
Also try making your auto complete form a little 'lazy'. The more letters a user enters, lesser the number of records your database has to deal with.
What resources exist for Database performance-tuning?
You should cache as much data as you can on the web server. Data that does not change often like list of Countries, Cities, etc is a good candidate for this. Realistically, how often do you add a country? Even if you change the list, a simple refresh of the cache will handle this.
You should make sure that your queries are tuned properly to make best use of Index and Join techniques.
You may have load on your DB from other queries as well. You may want to look into techniques to improve performance of MySQL databases.
Just get your table to fit in memory, which should be trivial for 70k rows.
Then you can do a scan very easily. Maybe don't even use a sql database for this (as it doesn't change very often), just dump the cities into a text file and scan that. That'd definitely be better if you have many web servers but only one db server as each could keep its own copy of the file.
How many queries per second are you seeing peak? I can't imagine there being that many people typing city names in, even if it is a very busy site.
Also you could cache the individual responses (e.g. in memcached) if you get a good hit rate (e.g. because people tend to type the same things in)
Actually you could also probably precalculate the responses for all one-three letter combinations, that's only 26*26*26 (=17k) entries. As a four or more letter input must logically be a subset of one of those, you could then scan the appropriate one of the 17k entries.
If you have an index on the the city name it should be handled by the database efficiently. This statement is wrong, see comments below
To lower the demands on your server resources you can offer autocompletion only after n more characters. Also allow for some timeout, i.e. don't do a request when a user is still typing.
Once the user stopped typing for a while you can request autocompletion.