Using Sql Server for data mining - sql-server-2008

I am working on a project where I am storing data in Sql Server database for data mining. I 'm at the first step of datamining, collecting data.
All the data is being stored currently stored in SQL Server 2008 db. The data is being stored in couple different tables at the moment. The table adds about 100,000 rows per day.
At this rate the table will have more than million records in about a month's time.
I am also running certain select statements against these tables to get upto the minute realtime statistics.
My question is how to handle such large data without impacting query performance. I have already added some indexes to help with the select statements.
One idea is to archive the database once it hits a certain number of rows. Is this the best solution going forward?
Can anyone recommend what is the best way to handle such data, keeping in mind that down the road I want to do some data mining if possible.
Thanks
UPDATE: I have not researched enough to decide what tool I would use for datamining. My first order of task is to collect relevant information. And then do datamining.
My question is how to manage the growing table so that running selects against it does not cause performance issues.

What tool you will you be using to data mine? If you use a tool that uses a relational source then you check the worlkload that it is submitting to the database and optimise based on that. So you don't know what indexes you'll need until you actually start doing data mining.
If you are using SQL Server data mining tools then they pretty much run off SQL Server cubes (which pre aggregate the data). So in this case you want to consider which data structure will allow you to build cubes quickly and easily.
That data structure would be a star schema. But there is additional work required to get it into a star schema, and in most cases you can build a cube off a normalised/OLAP structure OK.
So assuming you are using SQL Server data mining tools, your next step is to build a cube of the tables you have right now and see what challenges you have.

Related

Redshift design or configuration issue? - My Redshift datawarehouse seems much slower than my mysql database

I have a Redshift datawarehouse that is pulling data in from multiple sources.
One is my from MySQL and the others are some cloud based databases that get pulled in.
When querying in redshift, the query response is significantly slower than the same mysql table(s).
Here is an example:
SELECT *
FROM leads
WHERE id = 10162064
In mysql this takes .4 seconds. In Redshift it takes 4.4 seconds.
The table has 11 million rows. "id" is indexed in mysql and in redshift it is not since it is a columnar system.
I know that Redshift is a columnar data warehouse (which is relatively new to me) and Mysql is a relational database that is able to utilize indexes. I'm not sure if Redshift is the right tool for us for reporting, or if we need something else. We have about 200 tables in it from 5 different systems and it is currently at 90 GB.
We have a reporting tool sitting on top that does native queries to pull data. They are pretty slow but are also pulling a ton of data from multiple tables. I would expect some slowness with these, but with a simple statement like above, I would expect it to be quicker.
I've tried some different DIST and SORT key configurations but see no real improvement.
I've run vacuum and analyze with no improvement.
We have 4 nodes, dc2.large. Currently only using 14% storage. CPU utilization is frequently near 100%. Database connections averages about 10 at any given time.
The datawarehouse just has exact copies of the tables from our integration with the other sources. We are trying to do near real-time reporting with this.
Just looking for advice on how to improve performance of our redshift via configuration changes, some sort of view or dim table architecture, or any other tips to help me get the most out of redshift.
I've worked with clients on this type of issue many times and I'm happy to help but this may take some back and forth to narrow in on what is happening.
First I'm assuming that "leads" is a normal table, not a view and not an external table. Please correct if this assumption isn't right.
Next I'm assuming that this table isn't very wide and that "select *" isn't contributing greatly to the speed concern. Yes?
Next question is wide this size of cluster for a table of only 11M rows? I'd guess it is that there are other much larger data sets on the database and that this table isn't setting the size.
The first step of narrowing this down is to go onto the AWS console for Redshift and find the query in question. Look at the actual execution statistics and see where the query is spending its time. I'd guess it will be in loading (scanning) the table but you never know.
You also should look at STL_WLM_QUERY for the query in question and see how much wait time there was with the running of this query. Queueing can take time and if you have interactive queries that need faster response times then some WLM configuration may be needed.
It could also be compile time but given the simplicity of the query this seems unlikely.
My suspicion is that the table is spread too thin around the cluster and there are lots of mostly empty blocks being read but this is just based on assumptions. Is "id" the distkey or sortkey for this table? Other factors likely in play are cluster load - is the cluster busy when this query runs? WLM is one place that things can interfere but disk IO bandwidth is a share resource and if some other queries are abusing the disks this will make every query's access to disk slow. (Same is true of network bandwidth and leader node workload but these don't seem to be central to your issue at the moment.)
As I mentioned resolving this will likely take some back and forth so leave comments if you have additional information.
(I am speaking from a knowledge of MySQL, not Redshift.)
SELECT * FROM leads WHERE id = 10162064
If id is indexed, especially if it is a Unique (or Primary) key, 0.4 sec sounds like a long network delay. I would expect 0.004 as a worst-case (with SSDs and `PRIMARY KEY(id)).
(If leads is a VIEW, then let's see the tables. 0.4s may be be reasonable!)
That query works well for a RDBMS, but not for a columnar database. Face it.
I can understand using a columnar database to handle random queries on various columns. See also MariaDB's implementation of "Columnstore" -- that would give you both RDBMS and Columnar in a single package. Still, they are separate enough that you can't really intermix the two technologies.
If you are getting 100% CPU in MySQL, show us the query, its EXPLAIN, and SHOW CREATE TABLE. Often, a better index and/or query formulation can solve that.
For "real time reporting" in a Data Warehouse, building and maintaining Summary Tables is often the answer.
Tell us more about the "exact copy" of the DW data. In some situations, the Summary tables can supplant one copy of the Fact table data.

SQL Server reporting database updated with data from MySQL by scheduled job

I am an ex multi value developer that over the last 6 months have been thrust in to the world of SQL and apologies in advance for the length of the question. So far I have got by with general instinct (maybe some ignorance!) and a lot of help from the good people on this site answering questions previously asked.
First some background …
I have an existing reporting database (SQL Server) and a new application (using MySQL) that I am looking to copy data from at either 30 min, hourly or daily intervals (will be based on reporting needs). I have a linked server created so that I can see the MySQL database from SQL Server and have relevant privileges on both databases to do read/writes/updates etc.
The data that I am looking to move to reporting on the 30 minute or hourly schedule typically are header/transactions by nature have both created and modified date/time stamp columns available for use.
Looking at the reporting DBs other feeds, Merge is the statement used most frequently across linked servers but to other SQL server databases. The merge statements also seem to do a full table to table comparison which in some cases takes a while (>5mins) to complete. Whilst the merge seems to be a safe options I do notice a performance hit on reporting whist the larger tables are being processed.
In looking at delta loads only, using dynamic date ranges (eg between -1 hour:00:00 and -1 hour:59:59) on created and modified time stamps, my concern would be the failure of any one job execution could leave the databases out of sync.
Rather than initially ask for specific sql statements what I am looking for is a general approach/statement design for the more regular (hourly) executed statements with the ideal being just to perform delta loads of the new or modified rows safely with a SQL Server to MySQL connection.
I hope the information given is sufficient and any help/suggestions/pointers to reading material gratefully accepted.
Thanks in advance
Darren
I have done a bit of “playing” over the weekend.
The approach I have working pulls the data (inserts and updates) from MySQL via openquery into a CTE. I then merge the CTE into the SQL Server table.
The openquery seems slow (by comparison to other linked tables) but the merge is much faster due to limiting the amount of source data.

MySQL: Database Plan For maintaining EOD data of stocks

Broader View: Database plan to maintain Stock EOD data.
Tools at hand: I am planning to use MySQL database 4.1+ alongwith PHP. A custom DBAL (based on mysqli) is implemented in PHP to work with MySQL. However I am open to any other database engines provided they are available free and work with SQL statements :P
Problem Domain: I need to plan out a database for my project to maintain the EOD data for stocks. Since, the number of stocks maintained in the database is going to be huge, so the updation process of EOD data for the same is going to be pretty heavy process at end of day.
Needless to say I am on shared hosting and have to avoid MySQL performance bottleneck, during initial startup. However may move to VPS later.
Questions:
1. Can normalized schemas take the heavy updation process without creating a performance issue ?
2. Analysis based on popular algorithms like MACD and CMF has to be done on the EOD data so as to spot a particular trend in the stocks, the data from analysis will again have to be stored for further reference. Analysis data will be calculated once the EOD data is updated for the day. So is going with normalized schemas here is fine, keeping the performance issue in view ? Also I would need to fetch both the EOD and Analysis data quite often !
Elaboration:
1 INSERT statement (to insert EOD data) + 1 INSERT statement (to insert analysis data)
= 2 INSERT statements * 1500 stocks (for startup)
= 3000 INSERT statements done back 2 back !
I am further planning to add more stocks as the project grows, so I am looking here at scalability as well.
Although I don't know about the concept of DW (only heard about it) but, if from performance point of view, it is more viable than OLTP, I am ready to give it a shot.

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

How to import 100 million rows table into database?

Can any one guide me about my query?, i m making application for banking sector with fuzzy logic. i have to import table with 100 million rows daily. and i am using MySql for this application which is processing slowly. so is there any another server for handling my database which can access fast?
We roughly load about half that many rows a day in our RDBMS (Oracle) and it would not occur to me to implement such a thing without access to DBA knowledge about my RDBMS. We fine-tune this system several times a month and we still encounter new issues all the time. This is such a non-trivial task that the only valid answer is:
Don't play around, have your managers get a DBA who knows their business!
Note: Our system has been in place for 10 years now. It hasn't been built in a day...
100 million rows daily?
You have to be realistic. I doubt any single instance of any database out there can handle this type of thouroughput efficiently. You should probably look at clustering options and other optimising techniques such as splitting data in two diffent DB's (sharding).
MySQL Enterprise has a bunch of features built-in that could ease and moniter the clustering process, but I think MySQL community edition supports it too.
Good-luck!
How are you doing it?
One hugh transaction?
Perhaps try to make small transactions in chunks of 100 or 1000.
Is there an index on that table? Drop the index before starting the improt (if that is possible due to unique costraints etc.) and rebuild the index after the import.
An other option would perhaps be an in memory database.
Well it seems your business' main logic does not depend on importing those 100mio rows into a database, otherwise you wouldn't be stuck with doing this by yourself, right? (correct me if I'm wrong)
Are you sure you need to import those rows into a database when the main business doesn't need to? What kind of questions are you going to ask of the date? Can't you maybe keep the log files on a bunch of servers and query them with eg Hadoop? Or can you aggregate the information contained in the log files and only store a condensed version?
I'm also asking this because it sounds like you're planning to perform some at least moderately sophisticated analysis on the data and the trouble with this amount of data won't stop once you have it in a DB.