Could it make sense to schedule an export of SQL database to NoSQL for graphical data mining? - mysql

Would it make sense for me to schedule an export my SQL database to a graph database (such as Neo4j) in order to generate interactive graphics of relationships such as here?
UPDATE: Or by extension, should I even be looking to move over to a graph database altogeher?
My graphical database would not need to be a live reflection of the relational database - an extract every few days would be more than sufficient.
In my case, I currently have a relational database (MySQL) where I’m recording stock items as they pass between individuals/depots. The concept is as follows:
Items:
STOCKID DISPATCHDATE
0001 2014-01-01
0002 2015-06-03
Individuals:
USERID FIRSTNAME
0001 Tom
0002 Jones
Depots:
DEPOTID ZIPCODE
0001 50421
0002 71028
Owners:
STOCK_ID USER_ID RECEIVED DISPATCHED
0001 0001 2015-05-01 2015-05-10
0001 0002 2015-05-11 2015-05-20
From the NoSQL database I would like to be able to visually see things such as:
The flow of which people an item has passed through (and dates of each relationship)
Which items are at each individual/depot (on a given date)
Which individuals are at which depots (on a given date)

As N.B. says in the comments, if the tool is useful then use it - worst case is you find that the tool isn't useful after all and you stop using it (having wasted some time in setting it up, but such is life).
In general, there are three ways to sync the database:
Two Phase Commit: modify MySql in one transaction, modify Neo4j in another transaction, if either transaction fails then you roll back both transactions; the transactions don't commit until both signal that they can be committed. This provides the highest data integrity but is very expensive.
Loosely synchronized transactions: modify MySql in one transaction, modify Neo4j in another transaction, if one succeeds and the other fails then retry the failed transaction a few times, and if it still fails then decide what to do (e.g. undo the successful transaction, which is complicated by the fact that the transaction has already committed and the values may have been used; or log the error and ask for a database administrator to manually sync the databases; or third option). This offers decent data integrity and is cheaper than two phase commit, but is more difficult to recover from if something goes horribly wrong.
Batch synchronization: modify MySql, and then after a time interval (five minutes, an hour, whatever's appropriate) you sync the changes with Neo4j based on a row version number or a timestamp (note that it's not much of a problem if you sync a bit too much data since you'll just be overwriting a value with the same value, so err on the side of syncing too much per batch). This solution is easy to program, and is appropriate if Neo4j doesn't need the latest and greatest data.
I worked on a similar project where we were syncing MySql with a key-value nosql database (caching expensive queries), using loosely synchronized transactions. We wrote a customized Transaction wrapper that contained a concurrent queue of side-effects (i.e. changes to be made to the key-value database); if the MySql transaction succeeded then we committed all of the side-effects in the queue to the key-value database (with three retries in the case of transient network failure, after which we logged the error, invalidated the key-value database entry which would result in a fallback to MySql, and notified a database admin - this happened one time when the key-value database crashed for an extended period, and was solved by running a batch synchronization), else we discarded them.

I think before starting with the migration there are some questions worth asking yourself:
Can I do the graphical representation without migrating/adding a new data source (using MySQL)?
What degree of efficiency do I want when using such graphical interface?
How easy would it be, in case, to add a new data source?
What you see in that video is done by a visual component on some data from either databases or flat files, so I'd say that the answer to the first question it's likely to be yes.
Depending on how many people and the kind of user it's going to use such graphical representation (internal or external, analyst or not, etc...) this can be another driver for the decision.
About the third question, without writing a duplicate of the other answer, I think #Zim-Zam O'Pootertoot already covered it. As usual, with many data sources the problem is always to keep things in sync together and the entity resolution problem (which you minimise using the same dataset).
In my experience what Neo4J is very good at is "pattern" querying: given a specific network pattern (drawn with the Cypher language) it will apply and find it to the network dataset.
When it's about neighbours querying also a SQL solution, in small projects, can achieve the same result without too many problems. Of course if your solution has to scale to hundreds analysts and hundred of thousands queries per day consider to move.
Anyway, given your dataset it looks to me that you are working on a time-based type of data. In this kind of scenarios it could be worth to have a look at the dynamical behaviour of your network to find also temporal pattern, more than simple network ones.
From the same author of the video you've posted have also a look at this other graphical representation.
In case you want to model a time based graph just note that there's not a bullet proof solution with any data source yet.
Here's a Neo4J tutorial on your to model and represent the data in case of a time based dataset.
I bet you can do similar things also with MySQL (probably with less efficiency and elegance in querying) but I haven't done it myself yet to give some numbers - maybe someone else did it and can add some benchmarks here.
Disclaimer: I work in the KeyLines team.

Related

One database or multiple databases for statistical architecture

I currently already have a website running using CodeIgniter and MySQL. The MySQL database is around 110 tables big and contains mainly website specific data, like user data, vacancy data, etc.
Now I want to extend this website to include a complete statistical module as well. We would capture a lot of user actions and other aggregations from the data gather on our own website, and would also pull in some data from google analytics API to use in our statistics (we will generate a report in Excel but also show statistical graphs and numbers on a page (using chart.js)).
We are not thinking (in a forseeable future) to use this data in other programs, but we need to be able to open some data to the public using an API.
We expect to start with about 300.000-350.000 data points gathered per day, but this amount will keep on growing every day of course, the more users we get.
Using multiple databases in CodeIgniter seems to not be an issue, so the main problem I am left with is how I should create the architecture for this statistical module.
I have a couple of idea's on how to start doing this, but I am not aware if there is performance impact from one to another solution or other things to take into consideration.
My main idea boils down to having a table containing all "events", which just insert in that table every time an action is performed, eg "user is registered", "user put account on private", "user clicked on X", ...
Then once a day (probably at around midnight), a CRON job would run over that table for the past day and aggregate all the values into a format usable for our statistical metrics. Those aggregated values would be stored in a new table. This way we can clean up the "event" table quite regularly since that will become very big very fast.
Idea 1: Extend the current MySQL database architecture with new tables to incorporate the statistics. I would keep on using the current database architecture and add 2 new tables for the events and the aggregated values.
Idea 2: Create a new database, separate from the current existing one, and use this to insert all the events in a table there and the aggregated values in a new table there.
Note: we already have quite a few CRONS running on our current database, updating statusses and dates, sending emails, ...
Note2: sync issues between databases is not an issue since we will never be storing statistics on a per-user level.
MySQL does not care whether tables are in the same database or separate databases. It is just a convenience for the user. Some things:
You might need db1.tbla JOIN db2.tblb to talk across dbs.
It is convenient to have different GRANTs for different databases, but clumsy to have different GRANTs for 110 tables.
I can't think of any performance differences.
Nightly aggregation is a middle-of-the road approach. Using IODKU gives you 'immediate' aggregation, but is probably more burden on the system.
My blog on Summary Tables .
350K rows inserted per day is about 5/second, which is comfortably low, so I don't think we need to discuss performance issues there.
"Summarize and toss" (for events) -- Yes. I like that approach. (Most people fail to think of this option.)
Do the math. Which table is the largest after a year? How many GB will it be? Then think about whether you can shrink any of the columns in it: SMALLINT instead of INT, normalization of long, oft-repeated, strings, etc.

Database (OLTP) and Reporting

I am working on a trading platform that has reporting as a big portion of its business.
The set up is the following:
SQL OLTP database (about 200 tables) – rather small in number of records. (20,000 records the biggest table – but keeps growing every week)
For reporting services, SQL views are being used to query the Live Transaction Database. Imagine the result set of the views a de-normalized one, in the spirit of a data warehouse approach. Then these data sets are passed to a third party Reporting platform (like Tableau, Power Bi or SiSense), which take these data sets and throws them into Cubes (probably some columnar structure, like mono db, hadoop, etc). From there the Reports are getting generated.
Current challenges.
The SQL views (about 8). Are huge and very hard to maintain. To give you an example, one of the views outputs 100 fields. But each of these fields are calculated fields with complicated CASE statements, nested IF statements, inline Functions, and what not, which makes this view as big as 700 lines of sql code. I inherited these from anther employee and now, sadly, I have to maintain them.
Because the data grows weekly by several hundreds records (through migration and transactions) and the number of fields in the views also grow (a few every week), the cube building takes longer and longer. To give you an example, few months ago we set up the cube re-built ever 10 minutes to refresh the data (which was taking 5 minutes). Currently takes 12-15 minutes to build, so we set it up every 30 minutes. As you can imagine, this will get worse as data and the number of fields keep growing; and we kind of need the data as current as possible.
The only good thing is that once the cube is built, the reports load fast because they are being pulled form the 3rd party platform, so no concerns here.
What I have in mind
I would like to get rid of the views so I could ease the process of maintenanace and also keep at minimum the duration of the cube re-built.
Options:
to build a data warehouse. And then build SSIS packages to populate this structure with the live transactional data. The de-normalized structure would probably look very similar the views mentioned above. The draw back here is that I don’t really feel like I simplify much, actualy adding one more layer, which is the data migration from the OLTP to the OLAP (datawarehouse). And I would still have to re-build the cube.
To turn the current views into SQL Indexed Views (materialized views), but in their current state, I simply cannot do it because of the agregate and inline functions used a lot across the entire view.
Another option I red about is to built a ODS (Operational Data Store – which would be a databse that would contain the necessary tables similar to the sql views I have now – and refresh it constantly) Maybe using triggers, or, Transaction logs? But I am not sure what involves to built such thing and how hard is to maintain.
Question:
What approach should I take?
Do any of the 3 above make any sense?
Of course, I am interested in other ideas or suggestions, as well.
Thank you!
from my experience your best approach will be 1. It is costly, but will give you better benefits . Create a ROLAP DWH(i recommend Kimball's "The data warehouse toolkit" for best practices and design patterns), if you have the oportunity use a columnar data store(like amazon redshift, or sap sybase iq) and all the case statements ,nested ifs and all operations that you mentioned, would be applied on ETL time, so in the ROLAP everything is precalculated and optimized to consumption. And dont forget about aplying indexes(depending on the relying technology you use) . Some database vendors already published "indexing best practices" for ROLAP so they will tell you which type of index aply depending of the type of table(dimension) and data type for example.

Choosing MYSQL over Mongodb [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
From mongoDB docs:
When would MySQL be a better fit?
A concrete example would be the booking engine behind a travel
reservation system, which also typically involves complex
transactions. While the core booking engine might run on MySQL, those
parts of the app that engage with users – serving up content,
integrating with social networks, managing sessions – would be better
placed in MongoDB
Two things i don't understand in this (not even a little) concrete example:
What kind of queries are complex enough to be better suited for MYSQL
(a concrete example of such a query be of great help)?
Where is the line that seperates the "core booking engine" from the
"parts of the app that engage with users"?
My concern is not theoretical as we use both MYSQL and MongoDB in our app, and a better understanding of the above would really help us in designing our DB models for future features.
MySQL is ACID compliant (assuming you're using INNODB or similar), MogoDB is not. Read the MongoDB docs about atomicity here:
MongoDB Atomicity
Think about going to the grocery store checkout, and that the POS system is using MySQL. What steps might take place in a single transaction?
Item scanned, price retrieved
Inventory updated, quantity on hand is subtracted by 1
Department metrics updated (add dollar amount, quantity, item
type, etc)
Is the item on sale? Show how much money the customer saved on
the receipt
Customer used a coupon, make sure we notify the vendor so we get
reimbursed
Send receipt total to accounting, update month / year / week stats
Now it's time to pay. OOPS! Customer left wallet at home, and says he'll come back later. We've made all these changes to many database tables, now what do we do? If we were using MySQL and had all these updates in a single transaction, we could just rollback that one transaction and no harm is done. All changes will be reverted automatically, and in the correct order.
Doing that in a non-transactional database means writing code to backtrack through all those changes, in the correct order.
MongoDB is good for document storage and retrieval. It wouldn't be my first choice for creating small pieces of a document a little at a time, where you want to store bits and pieces of information in seperate places.
How do we use MongoDB in our grocery store example? We could use it as part of an inventory system.
Our MySQL inventory could have a schema of things we absolutely MUST have --- SKU, price, department. We don't necessarily want to clutter it up with things that we don't often need to know, however, by adding columns such as 'Easter_2016_Promotion'. In MongoDB, since we don't have a schema that's set in stone, this isn't a problem.
Something like
db.inventory.update(
{ _id: 1 },
{ $set: { "Easter_2016": "y" } }
)
Could add the "Easter_2016" field to a single inventory item without affecting any of the others. In MySQL, you affect every row in a table by adding a single column --- not so in MongoDB. Additionally, when querying Mongo, you can search all records (documents) for a field that MAY or MAY not exist. In MySQL, the field either exists or it doesn't.
MongoDB is built for schemas that are fluid, dynamic, and (potentially) somewhat unknown. It's speed partially relies on the fact that there aren't monolithic transactions that it may have to undo, and in part that there isn't a schema to constantly validate against when inserting.
Need to analyze 100,000 receipt JSON files from our POS system? Just run mongoimport and start querying for what you want.
Need to add some special data for just a few inventory items or flag a handful of customers as 'special handing'? MongoDB for this as well.
Need to import and query tax returns from 20 different states (think: different field names, different number of fields, with a few overlaps)? Mongo wins here, hands down.
Anything that has several known, concrete steps that MUST work, and work in the proper seqeunce, however (think: ATM machine), and MySQL is a better fit.
A query with multiple joins will be a good example. The main idea behind this point, is in relational DB m:n relations are symmetrical, whilst in document-oriented DB, they are not. Since v3.2, MongDB has $lookup which address this issue to some degree.
The line between core booking engine and representation engine is drawn by CAP theorem. The core part must be consistent, while the client-facing part can be implemented with eventual consistency. A recommended workaround for lack of atomic transactions in MongoDB should shed some light to this statement. Alternatively your core booking part can use event sourcing to keep state consistent without transactions.

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!
Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.