Tools to preprocess a big data for dashboards? [closed] - open-source

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I have a complex data set with more than 16M rows coming from the pharmaceutical industry. Regarding the data, it is saved in a sql server with more than 400 relational tables.
Data got several levels of hierarchies like province, city, postal code, person, and antigens measures, etc.
I would like to create many dashboards in order to observe the changes & trends happening. I can use Pentaho, R (shiny) or Tableau for this purpose. But the problem is data is so huge, and it take so long to process it with dashboard software. I have a choice of making cube and connect it to dashboard.
My question here is whether there are any other solutions that I can use instead of making a cube? I don't want to go through the hassle of making & maintaining a cube.
I would like to use a software where I specify relationships between tables, so the aggregation/amalgamation happens smoothly and output processed tables that can connect to dashboards. I hear Alteryx is one software that can do it for you (I haven't tried it myself, and it is an expensive one!).
I understand this task needs two or more softwares/tools. Please share your input & experience. Please mention what tools do you use, size of your data, and how fast/efficient is the entire system, and other necessary details.

It depends a lot on how big your dataset is (not just the number of rows) and how fast your SQL server is.
I have loaded datasets with >20m rows (which are >4GB in size) directly into Tableau (though this was on 64 bit Windows machines or Macs with >8GB of RAM). And they function well.
If the amount of data is large (which means probably 10s of GB of disk space) then you are better off connecting Tableau directly to the SQL server and letting the server do the heavy lifting. This also works fine. I have billion row datasets on (fast and powerful) SQL servers where this also works at reasonable speed if the SQL server is optimized for speedy analytics rather than transaction processing.
If your local server power or capacity is limited then I'd also suggest putting your data onto something like Google's BigQuery (or Amazon's Redshift) as these are ridiculously cheap to set up and offer astonishing analytics power. Tableau has connectors for both so you can often achieve interactive speeds even with monster datasets. I have a test dataset of 500m rows and about 100GB of data where I get typical query responses for most queries in 15-30s even if I am driving them directly from Tableau.

Related

Should I use many database in one application? [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We are in the planning stages for a new multi-tenant SaaS app and have hit a deciding point. When designing a multi-tenant application, is it better to go for one monolithic database that holds all customer data (Using a 'customer_id' column) or is it better to have an independent database per customer? Regardless of the database decisions, all tenants will run off of the same codebase.
It seems to me that having separate databases makes backups / restorations MUCH easier, but at the cost of increased complexity in development and upgrades (Much easier to upgrade 1 database vs 500). It also is easier / possible to split individual customers off to separate dedicated servers if the situation warrants the move. At the same time, aggregating data becomes much more difficult when trying to get a broad overview of how customers are using the software.
We expect to have less than 250 customers for at least a year after launch, but they will be large customers and more will follow afterward.
As this is our first leap into SaaS, we are definitely looking to do this right from the start.
This is a bit long for a comment.
In most cases, you want one database with a separate customer id column in the appropriate tables. This makes it much easier to maintain the application. For instance, it much easier to replace a stored procedure in one database than in 250 databases.
In terms of scalability, there is probably no issue. If you really wanted to, you could partition your tables by client.
There are some reasons why you would want a separate database per client:
Access control: maintaining access control at the database level is much easier than at the row level.
Customization: customizing the software for a client is much easier if you can just work in a single environment.
Performance bottlenecks: if the data is really large and/or there are really large numbers of transactions on the system, it might be simpler (and cheaper) to distribute databases on different servers rather than maintain a humongous database.
However, I think the default should be one database because of maintainability and consistency.
By the way, as for backup and restore. If a client requires this functionality, you will probably want to write custom scripts anyway. Although you could use the database-level backup and restore, you might have some particular needs, such as maintaining consistency with data not stored in the database.

Multitenancy: Single database or database per tenant? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We are in the planning stages for a new multi-tenant SaaS app and have hit a deciding point. When designing a multi-tenant application, is it better to go for one monolithic database that holds all customer data (Using a 'customer_id' column) or is it better to have an independent database per customer? Regardless of the database decisions, all tenants will run off of the same codebase.
It seems to me that having separate databases makes backups / restorations MUCH easier, but at the cost of increased complexity in development and upgrades (Much easier to upgrade 1 database vs 500). It also is easier / possible to split individual customers off to separate dedicated servers if the situation warrants the move. At the same time, aggregating data becomes much more difficult when trying to get a broad overview of how customers are using the software.
We expect to have less than 250 customers for at least a year after launch, but they will be large customers and more will follow afterward.
As this is our first leap into SaaS, we are definitely looking to do this right from the start.
This is a bit long for a comment.
In most cases, you want one database with a separate customer id column in the appropriate tables. This makes it much easier to maintain the application. For instance, it much easier to replace a stored procedure in one database than in 250 databases.
In terms of scalability, there is probably no issue. If you really wanted to, you could partition your tables by client.
There are some reasons why you would want a separate database per client:
Access control: maintaining access control at the database level is much easier than at the row level.
Customization: customizing the software for a client is much easier if you can just work in a single environment.
Performance bottlenecks: if the data is really large and/or there are really large numbers of transactions on the system, it might be simpler (and cheaper) to distribute databases on different servers rather than maintain a humongous database.
However, I think the default should be one database because of maintainability and consistency.
By the way, as for backup and restore. If a client requires this functionality, you will probably want to write custom scripts anyway. Although you could use the database-level backup and restore, you might have some particular needs, such as maintaining consistency with data not stored in the database.

Database modeling troubleshooting [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Hi i recently came across a situation where i am asked to optimize data model for one of our client for there already developed and running product.The main reason for doing this exerciser is, the product suffers from performance slowness due to too many locks and too many slow running queries.As i am not a DBA, looking at first site to the data model and doing some tracing of queries, i realize that the whole data model suffers from improper design and storage.The database is MySQl 5.6 and we are running InnoDB engine on that.
I want to know that is there any tool out there which can analyze the whole data model and can point out to possible issues including data structure definitions,indexes and other stuffs?
I tried lots of profiling tools including Mysql Workbench,Mysql Enterprise Monitor(paid version),jet profiler but they all are seems to be limited to identifying slow queries only. What i am interested in a tool which can analyze the existing data model and report problems with it and possible solutions for the same
You can not look at the data model in isolation. You need to consider the data model together with the requirements and the actual data access/update patterns.
I recomend you identify the top X slowest queries and perform your root-cause analysis.
Make sure you focus on the parts of the application that matters, i.g. the performance problems that negatively affects the usefulness of the application.
And by data access/update patterns I mean for example:
High vs low nr of concurrent access
Mostly reads or updates?
Single record reads vs reading large nr of records at once
Access is evenly spread out during the day or in bulk at certain times
Random access (every record is likely to be selected or updated) vs mostly recent records
Are all tables equally used or some are more used than others
Are all columns of all tables read at once? Or are there clusters of columns that are used together?
What tables are frequently used together?
The slow queries are the most important to look at. Show us a few of them, together with SHOW CREATE TABLE, EXPLAIN, and how big the tables are.
Also, how many queries per second are you running?
SHOW VARIABLES LIKE '%buffer%';
There are no such tools as those you are looking for, so I guess you'll have to do your homework, proposing another data model that "follows the rules".
You could begin by getting familiar with the first three normal forms.
You could also try to detect SQL antipatterns (there are books talking about these) in your database. This should give you some leads to work on.

Is there an open source alternative to MS SQL Compact Edition for Remote Data Application? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
We currently use SQL CE databases on the client machines, which then synchronise their data to a central server using the merge replication/RDA functionality of MS SQL. The amounts of data involved is small, and the central server will often be idle for ~95% of the time - it's only active really when data is incoming, and is typically synchronised on a daily/weekly basis.
The SQL Standard licensing costs for this are large, relative to the SQL server workload / the amount of data we're talking about (in the order of 100s of MBs maximum). What I'd like to know is if there's an open source alternative (mySQL or similar) which we could use as the backend data storage for our .NET application. My background is Windows Server Admin, so relatively new to Linux, but happy to give it a go and learn some new skills, as long as it won't be prohibitively difficult. If there are any other alternatives that would be great too.
Well this is quite a open ended question so I am going to give you some guidelines around what you can start researching.
Client Side embeded databases.
MySQL can be embedded just from my understanding MySQL as a embedded server might be overkill for a client.There are however a stack of alternatives. Once such a point would be the Berkely database system. There are other alternatives as well. Keep in mind you dont want a FULL sql server on the client side you are looking for something light weight.You can read about Berkley here: http://en.wikipedia.org/wiki/Berkeley_DB and about alternatives here : Single-file, persistent, sorted key-value store for Java (alternative to Berkeley DB). They mention SQLite which might just be up your alley. So in short there is a whole stack of open source tools you can use here.
Back End Databases. MySQL will do the job very well and even PostgreSQL. PostegreSQL seems to support more enterprise features the last time I looked however that might have changed. These two are your main players in the SQL server market as far as open source is concerned. Either one will do fine in your scenario. Both PostgreSQL and MySQL run on windows as well so you dont have to install Linux though I would suggest that you invest the time in Linux as I have it is well worth the effort and the peace of mind you get is good.
There is one major sticking point for you if you switch over to MySQL/PostgreSQL that the current RDA/replication technology you have will not be supported by these databases and you will need to look at how to implement this probably from scratch. So while the backend and even front end DB's can be replaced the replication of the data will be a little more problematic but NOT impossible.
Go play with these technologies do some tests and then you will need to decide how you will replace that replication.

What about writing a large data volume at 70.000 records/second? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Perhaps someone can share an opinion on this? I'm currently looking into a solution to storing between 500 million and 4 billion records per day into a single (or 2) table(s) in a database with a minimum writing rate of 70.000 records/second. A record contains roughly 30 variables. We want to load data hourly and in parallel (data is partitioned) up to the machines maximum capacity in terms of CPU, memory and IO. During writing, queries must be possible and they should remain acceptably performant during write operations.
I've been browsing the web to see if others have attempted parallel writing these quantities to a MySQL db but have not found anything specific. Most look at transactions per second, but that is not what we're dealing with here. We're loading raw data and we need to do it fast, in parallel and with zero downtime (ie users must be able to query what's available). Is it worth looking into MySQL to do this job or should we not even consider it if we're not spending a HUGE amount (what do you reckon?) on hardware?
Note: Diskspace is no issue with SAN storage via GBit FC available in a multicore 64bit 128GB server. I'm not looking for detailed technical solutions, rather feasibility from an expert's point of view with perhaps a few hints/tips to point me in the right direction.
Appreciate your insights.
In response to comments:
Each record counts individually and each variable is a possible candidate search criterion. More info:
yesterday's and older data (up to 10d) has to be queryable (SQL would be great for it's simple)
Data access preferably not through a custom API, much rather prefer an open standard like ODBC or a client (such as the Oracle Client)
Data consumption involves summarization (after midnight and partially also every hour where stats concern min/max/avg) and storage in higher level history tables for end-user reporting. That and the earlier mentioned searching of the raw data for problem/adhoc analysis.
It needs to be easy to drop a full day's worth of data at the end of the 10 day cycle.
Just to highlight it once more: writing takes places every hour to keep up with the delivery and not create a backlog for midnight summaries cannot be postponed for long.
Search results don't need to be instant, but should preferably not exceed +- 15 minutes on the entire 10 day volume = 300 billion records.
With that amount of data I think you should try looking into NoSQL (MongoDB, Cassandra, HBase, etc.). With MySQL you have to scale your servers a lot. We tried doing ~1200 inserts/sec and MySQL failed (or hardware failed). The solution was using XCache (memcached failed at that time also). Try looking into NoSQL, you'll like it.
4B rows x 30 x 4 bytes is about 1/2 terabyte a day. I don't think you're going to be able to keep this in a single machine and your SAN may have trouble too. I would look at Cassandra as it's built for high write volumes.
If I were you, I'd separate the solution into data capture and data analysis servers; this is a fairly common pattern. Queries and reporting run against your data warehouse (where you may be able to use a different schema to your data collection system). You load data into your data warehouse using an ETL (extract, transform, load) process, which in your case could be very simple.
As to how you would support 70K writes per second - I'd say this is well beyond the capabilities of most RDBMS servers unless you have a dedicated team and infrastructure. It's not something you want to be learning on the job.
NoSQL seems a better match.