I am currently working on a web-application that would allow users to analyze & visualize data. For example, one of the use-cases is that the user will perform a Principal Component Analysis and store it. There can be other such analysis like a volcano plot, heatmap etc.
I would like to store these analysis and visualizations in a database in the back-end. The challenge that I am facing is how to design a relational database schema which will do this efficiently. Here are some of my concerns:
The data associated with the project will already be stored in a normalized manner so that it can be recalled. I would not like to store it again with the visualization.
At the same time, the user should be able to see what is the original data behind a visualization. For eg. what data was fed to a PCA algorithm? The user might not use all the data associated with the project for the PCA. He/she could just be doing this on a subset of the data in the project.
The number of visualizations associated with the webapp will grow with time. If I need to design an invoved schema everytime a new visualization is added, it could make overall development slower.
With these in mind, I am wondering if I should try to solve this with a relational database like MySQL at all. Or should I look at MongoDB? More generally, how do I think about this problem? I tried looking for some blogs/tutorials online but couldn't find much that was useful.
The first step you should do before thinking about technical design, including a relational or non-SQL platform, is a data model that clearly describes the structure and relations between your data in a platform independent way. I see the following interesting points to solve there:
How is a visualisation related to the data objects it visualizes? When the visualisation just displays the data of one object type (let's say the number of sales per month), this is trivial. But if it covers more than one object type (the number of sales per month, product category, and country), you will have to decide to which of them to link it. There is no single correct solution for this, but it depends on the requirements from the users' view: From which origins will they come to find this visualisation? If they always come from the same origin (let's say the country), it will be enough to link the visuals to that object type.
How will you handle insertions, deletes, and updates of the basic data since the point in time the visualisation has been generated? If no such operations relevant to the visuals are possible, then it's easy: Just store the selection criteria (country = "Austria", product category = "Toys") with the visual, and everyone will know its meaning. If, however, the basic data can be changed, you should implement a data model that covers historizing those data, i.e. being able to reconstruct the data values on which the original visual was based. Of course, before deciding on this, you need to clarify the requirements: Will, in case of changed basic data, the original visual still be of interest or will it need to be re-generated to reflect the changes?
Both questions are neither simplified nor complicated by using a NOSQL database.
No matter what the outcome of those requirements and data modeling efforts are, I would stick to the following principles:
Separate the visuals from the basic data, even if a visual is closely related to just one set of basic data. Reason: The visuals are just a consequence of the basic data that can be re-calculated in case they get lost. So the requirements e.g. for data backup will be more strict for the basic data than for the visuals.
Don't store basic data redundantly to show the basis for each single visual. A timestamp logic with each record of basic data, together with the timestamp of the generated visual will serve the same purpose with less effort and storage volume.
Related
I have many IoT devices sending data currently to MySQL Database.
I want to port it to some other Database, which will be Open Source and provide me with:
JSON support
Scalability
Flexibility to add multiple columns automatically as per payload
Python and PHP Support
Extremely Fast Read, Write
Ability to export at least 6 months of data in CSV format
Please revert back soon.
Any help will be appreciated.
Thanks
Shaping your database based on input data is a mistake. Think of tomorrow your data will be CSV or XML, in a slight different format. Design your database based on your abstract data model, normalize it and apply existing data to your model. Shape your structure based on what input you have and what output you plan to get. If you retrieve the same content as the input, storing data in files will be sufficient, you don't need a database.
Also, you don't want to store "raw" records the database. Even if your database can compose a data record out of the raw element at run time, you cannot run a selection based on a certain extracted element, without visiting all the records.
Most of the databases allow you to connect from anywhere (there is not such thing as better support of PostgreSQL in Java as compared to Python, but the quality and level of standardization for drivers may vary). The question is what features shall your driver support. For example, you may require support for bulk import (don't issue large INSERT sets to the database).
What you actually look for is:
scalability: can your database grow with your data? Would the DB benefit of adding additional CPUs (MySQL particularly doesn't for large queries). Can you shard your database on multiple instances? (MySQL again fails to handle that).
does your model looks like a snowflake? If yes, you may consider NoSQL, otherwise stay away of it. If you manage to model as a snowflake (and this means you are open for compromises) you may use anything like Lucene based search products, Mongo, Cassandra, etc. The fact you have timeseries doesn't qualify you for NoSQL. For example, you may have 10K devices issuing 5k message types. Specific data is redundantly recorded at device level and at message type level. In that case, because of the n:m relation, you don't have the snowflake anymore.
why do you store the data? What queries are you going to issue?
Why do you want to move away from MySQL? It is open source and can meet all of the criteria you listed above. This is a very subjective question so it's hard to give a good answer, but MySQL is not a bad option
I've researched a lot for the best multi-tenant setup of an e-commerce project but could not find a fitting answer. I am leaning to use a shared database separate schema setup with either MySQL or PostgreSQL. The structure of the tables is the same for all the tenants. I really like that with that setup the application code doesn't need to provide an extra WHERE clause for every query, so it is very developer friendly!
Now you also have the shared schema approach and that is what I am currently using, but I feel this is bad for security and isolation purposes. I would like to move to a different solution.
The app will be used by +- 100 webshops (a tenant) this year and I expect it to grow to into the thousands. Webshops ranges from small to large, so it is important that I can later pick out a specific shops data and put it in it's own database server.
Since I don't have any experience with a separate schema setup I would like to know if this would benefit me. What issues might I have when walking that path? Especially with changes in the structure of the tables, this is what bothers me the most. What is the limit of schema's to use in a separate schema approach using PostgreSQL (I will have 100-1000 schemas) before it would be a pain to manage?
The following is my take
Blockquote
When using the SharedDatabase with same table for all tenants, isolation happening with tenantid column is easy because you always add a filter that says tenantid = LoggedInTenantId [LoggedInTenantId => set during login]. When you have a base method in your ORM like EF [I'm from .Net], this would auto append to any query that goes out of the code.
When you opt for shared Schema, if there is a use-case like sharing data between tenants [webshops], it is not feasible. Else, if you have an accountant that wants to audit a collection of tenant's and wants a dashboard to view the a/c statistics etc, it becomes impossible
With the scaling point of view, you can better go for a separate db per tenant if a single tenant or a collection of them wants to scale out. This will be better than managing schemas.
Consider the use-cases that you may have for your product and share here so that we can take this discussion forward.
HTH
Even though I read another question here advising not to use both neo4j and mysql (neo4j - graph database along with a relational database?), I was wondering what approach would be the best for dataset that has some data which can be modeled like a graph and the rest looks relational. For some reasons, I can't post the kind of data I'm using.
I can shoehorn the relational part into neo4j but it looks ugly and complex, something I would want to avoid.
On the other hand, if I use both together, I'll have to do double the amount of queries to get the result, decreasing performance (assume the DBs are on cloud in separate machines).
I can't use mysql alone because one of the queries requires a depth of around 20-30 which I assume can't be handled by mysql.
Have any of you encountered such a situation before ? If so, how did you solve it ?
As everyone else says: "give us a better idea of what data you are trying to model so we can best give you a suggestion".
That being said, dealing with 2 DBs is not an issue and its more common than people think: often-times you use a Full-Text store for searches and then get back a list of Document IDs which you then hit the relational DB for additional metadata. Or hitting Redis to get a list of IDs which you also hit the relational DB for more data.
I proof-of-concepted a system of Neo4j+MySQL for targeted searching based on your social network ("show me all restaurants my network has recommended ordered by depth (e.g. 1st level friend recs are weighted higher than 2nd level, and so on) and it didn't feel awkward. But I also didn't take it to scale.
You will be having to keep both datastores in sync. So in my case when a user recommends a place on the web app (which inserts it into MySQL) you then need to turn around and do the same insert into Neo. You probably want to do this asynchronously as well, so you'll need to setup a message queue with workers.
I can't discuss things in great detail due to an NDA, but I'm hoping an overview of the system being built can help you in aiding me in making a decision concerning our databases.
I'm building an app that will help vendors compete to gain clientele by making strategic offers based on records of inventory/purchase from the storefronts.
One side of the app is for the store owners to see presented offers, network, etc. I've got that going with a standard php/MySQL setup.
My question is concerning the records of inventory. We are talking millions of records here nearly immediately. The sample data I'm using is roll up of four of their managers (they have dozens) over the course of a year or two and it had over 500k rows with about 30 or more columns. When we get scores of stores with all of their managers it will be massive, at least compared to anything I've worked with as of yet.
The vendors will have a side of the product in which they can search through these records and make competitive offers based off of it.
Is the sheer size a good reason to use something like mongo? Or is it more a matter of how the data is laid out / what it consists of? Or some other element that I'm not considering?
And, if not mongo/nosql, then is there some other methodology or technology that such large data stores would benefit from me using (sharding, amazon cloud database, etc).
Thanks
Answers ...
Q: Is the sheer size a good reason to use something like mongo?
A: I think so. Mongo was built from the ground up to scale in a massive way. You have replica sets and sharding that can help you scale. They also have features to make sure your data gets stored in the appropriately geographically distributed data centers.
Q: Or is it more a matter of how the data is laid out / what it consists of?
A: Mongo is a document database and you're right, the data models will be different. You have to think of data in a denormalized way instead of normalized. Just like any technology, there are pros and cons to storing things as documents.
Some pros: Schema management is a breeze. Data more naturally fits objects in your application. Don't have to pay the price of complicated/slow joins.
Some cons: Schemas can be inconsistent - you have to manage it. Data is repeated, which is not managed means it can become inconsistent.
In general I think Mongo would be a good choice to deal with that scale. Mongo has a new aggregation framework that brings a lot of SQL concepts to queries on documents. Easier to make complex queries. Also Mongo has map/reduce to run any kind of query you might have.
After using Mongo daily for about a year, I've really enjoyed the support around it as a product and the general ease of setting it up and working with it.
I'm developing an application that will require me to create my first large-scale MySQL database. I'm currently having a difficult time wrapping my mind around the best way to organize it. Can anyone recommend any reading materials that show different ways to organize a MySQL database for different purposes?
I don't want to try getting into the details of what I imagine the database's main components will be because I'm not confident that I can express it clearly enough to be helpful at this point. That's why I'm just looking for some general resources on MySQL database organization.
The way I learned to work these things out is to stop and make associations.
In more object oriented languages (I'm assuming you're using PHP?) that force OO, you learn to think OO very quickly, which is sort of what you're after here.
My workflow is like this:
Work out what data you need to store. (Customer name etc.)
Work out the main objects you're working with (e.g. Customer, Order, Salesperson etc), and assign each of these a key (e.g. Customer ID).
Work out which data connects to which objects. (Customer name belongs to a customer)
Work out how the main objects connect to each other (Salesperson sold order to Customer)
Once you have these, you have a good object model of what you're after. The next step is to look at the connections. For example:
Each customer has only one name.
Each product can be sold multiple times to anybody
Each order has only one salesperson and one customer.
Once you've worked that out, you want to try something called normalization, which is the art of getting this collection of data into a list of tables, still minimizing redundancy. (The idea is, a one-to-one (customer name) is stored in the table with the customer ID, many to one, one to many and many to many are stored in separate tables with certain rules)
That's pretty much the gist of it, if you ask for it, I'll scan an example sheet from my workflow for you.
Maybe I can provide some advices based on my own experience
unless very specific usage (like fulltext index), use the InnoDB tables engine (transactions, row locking etc...)
specify the default encoding - utf8 is usually a good choice
fine tune the server parameters (*key_buffer* etc... a lot of material on the Net)
draw your DB scheme by hand, discuss it with colleagues and programmers
define data types based not only on the programs usage, but also on the join queries (faster if types are equal)
create indexes based on the expected necessary queries, also to be discussed with programmers
plan a backup solution (based on DB replication, or scripts etc...)
user management and access, grant only the necessary access rights, and create a read-only user to be used by most of queries, that do not need write access
define the server scale, disks (raid?), memory, CPU
Here are also some tips to use and create a database.
I can recomend you the first chapter of this book: An Introduction to Database Systems, it may help you organize your ideas, and I certainly recomend not using 5th normal form but using 4th, this is very important.
If I could only give you one piece of advice, that would be to generate test data at similar volumes as production and benchmark the main queries.
Just make sure that the data distribution is realistic. (Not all people are named "John", and not all people have unique names. Not all people give their phone nr, and most people won't have 10 phone numbers either).
Also, make sure that the test data doesn't fit into RAM (unless you expect the production data volumes to do too).