Evening,
I'm going through the long process of importing data from a battered, 15-year-old, read-only data format into MySQL to build some smaller statistical tables from it.
The largest table I have built before was (I think) 32 million rows, but I didn't expect it to get that big and was really straining MySQL.
The table will look like this:
surname name year rel bco bplace rco rplace
Jones David 1812 head Lond Soho Shop Shewsbury
So, small ints and varchars.
Could anyone offer advice on how to get this to workas quickly as possible? Would indexes on any of the coulmns help, or would they just slow queries down.
Much of the data in each column will be duplicated many times. Some fields don't have much more than about 100 different possible values.
The main columns I will be querying the table on are: surname, name, rco, rplace.
INDEX on a column fastens the search.
Try to INDEX columns that you would be using more often in queries. As you have mentioned you would be using the columns surname, name, rco, rplace. I'd suggest you index them.
Since the table has 32 million records, indexing would take sometime however it is worth the wait.
Related
I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.
I am working on finalising a site that will go live soon. It will process up to 1 million files per week and store all the information from these files in multiple tables in a database.
The main table will have 10 records per file so will gain about 10million records per week. Currently that table has 85 columns storing about 1.6KiB of data per row.
I'm obviously worried about having 85 columns, it seems crazy but I'm more worried about the joins if I split the data into multiple tables... If I end up with 4 tables of 20 odd columns and over 500,000,000 records in each of them them, won't those joins take massive amounts of time?
The joins would all take place on 1 column (traceid) which will be present in all tables and indexed.
The hardware this will run on is an i7 6700, 32GB RAM. The table type is innodb.
Any advice would be greatly appreciated!
Thanks!
The answer to this depends on the structure of your data. 85 columns is a lot. It will be inconvenient to add an 86th column. Your queries will be verbose. SELECT *, when you use it for troubleshooting, will splat a lot of stuff across your screen and you'll have trouble interpreting it. (Don't ask how I know this. :-)
If every item of data you process has exactly one instance of all 85 values, and they're all standalone values, then you've designed your table appropriately.
If most rows have a subset of your 85 values, then you should figure out the subsets and create a table for each one.
For example, if you're describing apartments and your columns have meanings like these:
livingroom yes/no
livingroom_sq_m decimal (7,1)
diningroom yes/no
diningroom_sq_m decimal (7,1)
You may want an apartments table and a separate rooms table with columns like this.
room_id pk
apt_id fk
room_name text
room_sq_m decimal (7,1)
In another example, if some of your columns are
cost_this_week
cost_week_1
cost_week_2
cost_week_3
etc.
you should consider normalizing that information in to a separate table.
I was given a Mysql Innodb with 2 tables. One is 117+ million rows and has over 340 columns, including name, address, city, state and zip. The second table is 17+ million rows that has name, address, city, state and zip plus email. The data in the 1st and 2nd tables will not be added to or updated. There is a primary key on an id in each table. There are no other indexes defined.
I first created a contacts table from the 117+ million row table which has just the name, address, city, state, and zip making it significantly smaller. I wrote a php script to perform a search using each row from the smaller table of 17+ million records, trying to find a match in the contacts table. When one is found, I insert the id and email into a separate table. I cancelled it because it was taking approximately 86 seconds per search. With 17+ million records it will take forever to finish.
Here is my search query:
q= "SELECT id FROM GB_contacts
WHERE LAST_NAME=\"$LAST\" and FIRST_NAME=\"$FIRST\" and MI=\"$MIDDLE\"
and ADDRESS=\"$ADDRESS\" and ZIP=\"$ZIP\"".
My question is how can I do this faster? Should I make an index on name, address, and zip in the contacts table or should I index each column in the contacts table? Is there a faster way of doing this through mysql? I have read a whole bunch of different resources and am unsure which is the best way to go. Since these are such large tables, anything I try to do takes a very long time, so I am hoping to get some expert advice and avoid wasting days, weeks and months trying to figure this out. Thank-you for any helpful advice!
The best way to do this is to create a clustered index on the fields that you are matching on. In this case, it might be a good idea to start with the zip code, followed by either first or last name first--last names are longer, so take longer to match, but are also more distinct, so it will leave less rows to do further matching (you will have to test which performs better). The strategy here is to tell mysql to look just in pockets of people, rather than search the entire database. While doing this, you got to be clever where to tell MySQL to begin narrowing it down. While you test, don't forget to use the EXPLAIN command.
did you tried typical join, if your join key is indexed it shouldn't take much time.
if its one time you can create indexes on join columns.
second step would be to load the records returned into new contacts table.
Here's the scenario, the old database has this kind of design
dbo.Table1998
dbo.Table1999
dbo.Table2000
dbo.table2001
...
dbo.table2011
and i merged all the data from 1998 to 2011 in this table dbo.TableAllYears
now they're both indexed by "application number" and has the same numbers of columns (56 columns actually..)
now when i tried
select * from Table1998
and
select * from TableAllYears where Year=1998
the first query has 139669 rows # 13 seconds
while the second query has same number of rows but # 30 seconds
so for you guys, i'm i just missing something or is multiple tables better than single table?
You should partition the table by year, this is almost equivalent to having different tables for each year. This way when you query by year it will query against a single partition and the performance will be better.
Try dropping an index on each of the columns that you're searching on (where clause). That should speed up querying dramatically.
So in this case, add a new index for the field Year.
I believe that you should use a single table. Inevitably, you'll need to query data across multiple years, and separating it into multiple tables is a problem. It's quite possible to optimize your query and your table structure such that you can have many millions of rows in a table and still have excellent performance. Be sure your year column is indexed, and included in your queries. If you really hit data size limitations, you can use partitioning functionality in MySQL 5 that allows it to store the table data in multiple files, as if it were multiple tables, while making it appear to be one table.
Regardless of that, 140k rows is nothing, and it's likely premature optimization to split it into multiple tables, and even a major performance detriment if you need to query data across multiple years.
If you're looking for data from 1998, then having only 1998 data in one table is the way to go. This is because the database doesn't have to "search" for the records, but knows that all of the records in this table are from 1998. Try adding the "WHERE Year=1998" clause to the Table1998 table and you should get a slightly better comparison.
Personally, I would keep the data in multiple tables, especially if it is a particularly large data set and you don't have to do queries on the old data frequently. Even if you do, you might want to look at creating a view with all of the table data and running the reports on that instead of having to query several tables.
I need to build a new table system, it'll store an id and 10 varchars(255) per id. That is all it will need to store. Other than the obvious insert/delete/update on whole rows only, the only other query that will be run is a SELECT * FROM table WHERE id='id'. 7 million records.
I have 2 structures I came up with, which are:
(1) - single table, id,then 10 varchars, no joins, nothing fancy, id is primary key, simple select *.
(2) - 2 tables, first has id, then 10 integer elements, second has integer(auto increment) and the varchars. This would use a join. Hence, I would guess 10 joins per query.
Clearly 2 is better as a formal structure and for later table structural changes, BUT, in terms of SPEED of querying alone, which is better?
If you will always be storing exactly 10 VARCHAR fields and each VARCHAR has its own meaning (like it's always a name, or always an address etc.), then just create a table with 10 fields.
Your second solution is called EAV (entity-attribute-value), which is mostly used for sparse matrices (when you have lots of possible attributes with only few of them being set for a given entity). It is scalable and maintainable, but will be less efficient for the query like yours.
definitely in terms of speed the first one is better. Joins are really slow on big tables