Data warehouse fetch data directly from db or through api - mysql

We need to pull data into our data warehouse. One of the data source is from internal.
We have two options:
1. Ask the data source team to expose the data through API.
2. Ask the data source team dump the data at a daily base, grant us a read only db credential to access the dump.
Could anyone please give some suggestions?
Thanks a lot!

A lot of this really depends on the size and nature of the data, what kind of tools you are using, whether the data source team knows of "API", etc.
I think we need a lot more information to make an informed recommendation here. I'd really recommend you have a conversation with your DBAs, see what options they have available to you, and seriously consider their recommendations. They probably have far more insight than we will concerning what would be most effective for your problem.

API solution Cons:
Cost. Your data source team will have to build the api. Then you will have to build the client application to read the data from the api and insert it into the db. You will also have to host the api somewhere as well as design the deployment process. It's quite a lot of work and I think it isn't worth it.
Pefromance. Not necessary but usually when it comes to datawarehouses it means one has to deal with a lot of data. With an api you will most likely have to transform your data first before you can use bulk insert features of your database
Daily db dumps solution looks way better for me but I would change it slightly if I were you. I would use a flat file. Most databases have a feature to bulk insert data from a file and it usually turns out to be the fastest wat to accomplish the task.
So from what I know from your question I think you should the following:
Aggree with you data source team on the data file format. This way you can work independently and even use different RDBMS.
Choose a good share to which both the data source team db and your db have fast access to.
Ask the data source team to implement the export logic to a file.
Implement the import logic from a file.
Please note items 3 and 4 should take only a few lines of code. As I said most databases have built in and OPTIMIZED functionality to export/import data to a file.
Hope it helps!

Related

Best practice to generate cached request results

I'm building a custom backoffice dataviz tool for a client. This tool generates a lot of graphs, maps and tables to help the marketing team analyze the client needs.
The tool uses filters that the client will use to navigate the data. Each time the filter is set, the tool generates the filtered base tables on which the rest of the requests are based : extract position for a map, extract time for a time series...
This is not working, the request are awfully long. The client is OK to sacrifice the freshness of the data for the usability of the tool, and I'm buiding a script that prepares prefiltered tables during the night.
This works fine, but I have a feeling I'm trying to reinvent the wheel here.
I'm wondering what are the best practices for that :
I have choosen to run an external script instead of triggers, because I've been told it's easyer to debug.
I am creating a ton of tables, I didn't want to use views because they are virtual table and as such are not fit for my purpose of pre-processing as much as possible in advance.
I am overwriting the tables, but maybe it would be better to update them ?
Do you have any comment or suggestions ? Thanks in advance.

How do I set up the architecture for a "big data" analysis project?

A friend of mine and I are in our senior year and will be starting a senior project soon. We had the idea to do a data analysis and data visualization project for it. Our project involves reading a CSV file that is updated every 2 minutes, parsing that data, then storing it in a database. Once that data is stored we want to run some analysis on it and provide an API through which we could access that data to visualize in some way. Our end goal would be to build an Android app that displays some of the raw data from the CSV and the analysis in a user friendly format. I talked to another CS Major and he explained that I would need a few different servers to accomplish this: One for the storage, another for analysis, and another for some type of queue that would make sure things don't get screwy while we are doing scraping and analysis. The problem is, I don't really know where to start with this. I've done some work with a SQL database before and a PHP front end, but nothing with multiple servers. I've heard of tools to use with big data projects like Hadoop but i'm not exactly sure where it fits in. If someone could point me to a resource of some kind to explain, or explain themselves, how I would start to structure this kind of project, that would be awesome!
Since you don't have much experience with these things you'll probably want to look at projects like Cloudera. Specifically their resources page has a nice set of videos and articles.
Another source of solid information (that I personally use) is by clicking on an Stack Overflow tag and selecting the votes option. Many good questions on a plethora of big data topics already exists.

ETL between a MySQL primary Data Store and a MongoDB secondary Data Store

We have a rails app that has a MySQL backend, each client has one DB and the schema is identical. We use a custom gem to change the DB based on the URL of the request (This is some legacy code that we are trying to move away from)
We need to capture some changes from those MySQL databases (Changes in inventory, some order information, etc) transform and store in a single MongoDB database (multitenant data store), this data will be used for analytics at first, but our idea is to move everything there.
There was something in place to do this, using AR callbacks and Rabbit, but to be honest it wasn't working correctly and it looked like it was more trouble to fix it than to start over with a fresh approach.
We did some research and found some tools to do ETL but they are overkill for our needs.
Does anyone have some experience with a similar problem?
Recommendations on how to architect and implement this simple ETL
Pentaho provides change-data-capture option which can solve Data-synchronization problems.
If by Overkill you mean Setup, Configuration, then Yes that is the common problem with ETL tools and PENTAHO is the easiest among them.
If you can provide more details, I'll be glad to provide an elaborate answer.

Ruby on Rails - Database or excel

I am currently doing a project in Ruby on Rails and I have been presented with a dilemma.
The dilemma is that the users of my system will be uploading an excel spreadsheet. The issue is should I just read straight from this excel spreadsheet into my front-end or should I load this spreadsheet into my MySQL database and then to my front-end.
I have asked numerous people about this issue and have researched on-line to no avail.
Any help would be much appreciated.
The Excel file is not a database. If you need to allow it as source input, parse it, copy the data into a real database and connect to it.
The database is more flexible and efficient for querying and processing information.
I can think of two benefits, or rather options, of having them upload the excel spreadsheet for processing by your back end.
1) would be for your tracking purposes (who sent what and here is what the back-end did with it...). In fact consider that other formats/versions could be introduced, would it be important to keep them to identify what went wrong? "How can we handle this new format"?
2) On the other side, the front-end way that is, you offload processing from the back-end, but that means that the browser app could get fairly complex and depending on your excel, that is if it has many relationships, sending that data up to the server could be complex. However if is simply a flat spreadsheet, say simple rows without totals/tax calc/..., then it might be an advantage of loading it into the browser and then sending these rows up to the server if offloading processing is of any importance.
However point 2 really is diluted by point 1, which to me would be of greater importance for future migration of this service. So I personally would choose uploading it and processing on the back end.
Update
As you clarified in the comments, if you are asking about the use of Excel on the backend as a database? I would agree with Simone Carletti's answer here. Maybe just add a real database gives you much more flexibility, more tools and, more performance. This difference is loading a file, parsing it into some structure, then saving it (unless you are using some .NET framework and even if, the Database (MySQL, MongoDB...) would give you much more flexibility in structuring and querying, over the headache of managing with the speed of DB connections. You might just want to write a sample in both to evaluate, the DB solution will probably win you over.

Data dump filetype for not-yet-existant SQL database

A friend wants to start scraping data for a data-heavy site he wants me to try to build. I'm a (relatively new) Rails developer and don't know much about the data side of all this. If he's contracting out the scraping, any idea what sort of format can/should I get the data in to easily import it into a PostgreSQL database once I get the site started up?
Hope this isn't too vague a question. I don't know where to start looking for this.
CSV file format is compatible with almost any database systems and it is quite a good starter. Even, if you change your mind later, as for what database system you'll use, you don't have to worry too much about changing the format.
If you thinking about data mining, then probably NoSQL database systems can be a better solution (MongoDB, CouchDB, etc.). Then, then file format can be JSON as well.