Data definition assistance:
So I was talking to some data analyst acquaintances of mine the other day and described my setup:
I have set up mechanisms where large invoices that are received every month are imported into an Access database, processed by about 10 procedures, then loaded into a SQL database for permanent storage where people run queries off of for long term stats, CPRAs, short term reports, etc.
I also have long SAS procedures that harvest information from a very large Oracle database, process them, then upload them to a different SQL database for demographics and statistical mapping.
One of them said they were rudimentary "data pipelines" (the other laughed). I looked up the term when I got home and it looks like, on the face of it, they are data pipelines. They extract data, transform it, and load it for long term storage. Many people use the outputs and they are reported to internal company managers and public agencies (municipal, state, federal).
So, are they data pipelines or are they too rudimentary to be classified such? I ask, and the reason it's important to me is so I can speak of them in interviews as such. Thanks in advance.
Related
I have a video surveillance project running on a cloud infrastructure and using MySQL database.
We are now integrating some artificial intelligence into our project including face recognition, plate recognition, tag search, etc.. which implies a huge amount of data every day
All the photos and the images derived from those photos by image processing algorithms are stored in cloud storage but their references and tags are stored in the database.
I have been thinking of the best way to integrate this, do I have to stick to MySQL or use another system. The different options I thought about are:
1- Use another database MongoDB to store the photos references and tags. This will cost me another database server, as well as the integration with a new database system along with the existent MySQL server
2- Use elastic search to retrieve data and perform tag searching. This leads to question the capacity of MySql to store this amount of data
3- Stick with MySQL purely, but is the user experience will be impacted?
Would you guide me to the best option to choose or give me another proposal?
EDIT:
For more information:
The physical pictures are stored in cloud storage, only the URLs are stored in the database.
In the database, we will store the metadata of the picture like id, the id of the client, URL, tags, date of creation, etc...
Operations are of the type :
It will be generally a SELECTs based on different criteria and search by tags
How big the data is?
Imagine a camera placed outdoor in the street and each time it detects a face it will send an image.
Imagine thousands of cameras are doing so. Then, we are talking about millions of images per client.
MySQL can handle billions of rows. You have not provided enough other information to comment on the rest of your questions.
Large blobs (images, videos, etc) are probably best handled by some large, cheap, storage. And then, as you say, a URL to the blob would be stored in the database.
How many rows? How frequently inserting? Some desired SELECT statements? Is it mostly just writing to the database? Or will you have large, complex, queries?
I would like to get some advice on our BI architecture, which is pretty complex to maintain.
I work in a e-shopping company, our prod works on a LAMP system (Linux Apache MySQL PHP).
Every night:
data from our prod DB (MySQL) are extracted with Talend then inserted in another MySQL database named DWH for data warehouse
data from this DWH are then extracted by Jedox/Palo to load OLAP cubes, which are used by Excel with a plugin for restitution
data from this DWH are also accessed by Access/Excel one-shot reports, but this is not working very fine
Each time we need to modify an existing workflow or create a new report, there is a lot of steps and different technologies to use, which leads us to a pretty complicated platform.
What can I do to simplify this process?
You should be able to load the Palo OLAP tables with Talend as part of the data warehousing load process using the Palo components provided which should cut down on doing ETL stuff in a separate way to your main ETL process.
Roberto Machetto's blog has some interesting tutorials on how to do this and I'm sure there's plenty more examples across the internet and of course you could ask here for any specific issues you're having.
Once all of your data is properly loaded into the data warehouse and any OLAP cubes then your users should be able to run any bespoke or otherwise queries against the data as it's stored. If you're seeing instances where users don't have access to the proper data for their analysis then that should be resolved in the initial data warehouse/OLAP cube load by proper understanding of dimensional modelling.
It's a little bit difficult give advices about what or not to do, it depends on your final target or objective. What I could recommend you is to separate your data into some stages before delivery your OLAP cubes.
For example, you could create facts and dimensions into this DW database that you have, so you could separate your data into subjects, what could make your reports building much easier, since anyone could group this data as needed. You could have a fact table just for sales, another for churn, another for new customers, and go on...
Try to take a look on fact and dimensions tables, or just dimensional modeling, it will make your daily work a lot of easier.
Some links:
http://en.wikipedia.org/wiki/Dimensional_modeling
http://www.learndatamodeling.com/ddm.php
If SSAS in MOLAP mode stores data, what is the application of SSIS and why do we need a Data Warehouse and the ETL process of SSIS?
I have a SQL Server OLTP database. I am using SSIS to transfer my SQL Server data from OLTP database to a Data Warehouse database that contains fact and dimension tables.
After that I want to create cubes using SSAS form Data Warehouse data.
I know that MOLAP stores data. Do I need any Data warehouse with Fact and Dimension tables?
Is not it better to avoid creating Data warehouse and create cubes directly from OLTP database?
This might be a candidate for "Too Broad" but I'll give it a go.
Why would I want to store my data 3 times?
I have my data in my OLTP (online, transaction processing system), why would I want to move that data into a completely new structure (data warehouse) and then move it again into an OLAP system?
Let's start simple. You have only one system of record and it's not amazingly busy. Maybe you can get away with an abstraction layer (views in the database or named queries in SSAS) and skip the data warehouse.
So, you build out your cubes, dimensions and people start using it and they love it.
"You know what'd be great? If we could correlate our Blats to the Foos and Bars we already have in there" Now you need to integrate your simple app with data from a completely unrelated app. Customer id 10 in your app is customer id {ECA67697-1200-49E2-BF00-7A13A549F57D} in the CRM app. Now what? You're going to need to present a single view of the Customer to your users or they will not use the tool.
Maybe you rule with an iron fist and say No, you can't have that data in the cube and your users go along with it.
"Do people's buying habits change after having a child?" We can't answer that because our application only stores the current version of a customer. Once they have a child, they've always had a child so you can't cleanly identify patterns before or after an event.
"What were our sales like last year" We can't answer that because we only keep a rolling 12 weeks of data in the app to make it manageable.
"The data in the cubes is stale, can you refresh it?" Egads, it's the middle of the day. The SSAS processing takes table locks and would essentially bring our app down until it's done processing.
Need I go on with these scenarios?
Summary
The data warehouse serves as an integration point for diverse systems. It has conformed dimensions (everyone's has a common definition for what a thing is). The data in the warehouse may exceed the lifetime of the data in the source systems. The business needs might drive the tracking of data that the source application does not support. The data in the DW supports business activities while your OLTP system supports itself.
SSIS is just a tool for moving data. There are plenty out there, some better, some worse.
So No, generally speaking, it is not better to avoid creating a DW and build your cubes based on your OLTP database.
I've got confused with warehousing process... I'm in a process of building a data mart but the part I don't really understand is related to cubes. I've read some tutorials about SSAS but I don't see how can I use this data in other applications. What I need is following:
A warehouse (data mart) that contains all data needed for analysis (drill down and aggregated data, like a daily revenue and YTD revenue)
A .NET web service that can take this data so that many different apps can use it
A part I don't understand are cubes. I see that many people use SSAS to build cubes. What are these cubes in SSAS? Are they objects? Are they tables where data is stored? How can my web service access the data from cubes?
Are there any alternatives to SSAS? Would it be practical to just build cubes in a data mart and load them during ETL process?
Cubes are preaggregated stores of data in a format to make reporting much more efficient than is possible in a Relational database store. In SSAS you have several choices for how your data is ultimately stored, but generally they are stored in files in the OS file system. They can be queried similarly to SQL (using a specialized query language called MDX) or by several other methods depending upon your version level. You can set up connections to the data for your web service using the appropriate drivers from Microsoft. I am unsure of what you are meaning by data mart. Are you referring to relational table in a star schema format? If so, these are generally precursors to the actual cube. You will not get as much benefit from a reporting standpoint by using these relational sources as you would from a cube (since a cube stores the aggregates of each node (or tuple) within the dimensional space defined by your star schema tables) To explain this, if I have a relational store (even in star schema format) and I want to get sales dollars for a particular location for a particular date, I have to run a query against a very large sales fact table and join the location and date dimesion tables (which may also be very large). If I want the same data from a cube, I define my cube filters and the datawarehouse query pulls that single tuple from the data and returns it much more quickly.
There are many alternatives to SSAS, but each would be a form of a cube if you are using a datawarehouse. If you have a large data set, a cube, properly designed will out perform a relational datamart for multidimensional queries.
What do you think is a data store appropriate for sensor data, such as temperature, humidity, velocity, etc. Users will have different sets of fields and they need to be able to run basic queries against data.
Relational databases like MySQL is not flexible in terms of schema, so I was looking into NoSql approaches, but not sure if there's any particular project that's more suitable for sensor data. Most of NoSql seem to be geared toward log output as such.
Any feedback is appreciated.
Thanks!
I still think I would use an RDBMS simply because of the flexibility you have to query the data. I am currently developing applications that log approximately one thousand remote sensors to SQL Server, and though I've had some growing pains due to the "inflexible" schema, I've been able to expand it in ways that provides for multiple customers. I'm providing enough columns of data that, collectively, can handle a vast assortment of sensor values, and each user just queries against the fields that interest them (or that their sensor has features for).
That said, I originally began this project by writing to a comma separated values (CSV) file, and writing an app that would parse it for graphing, etc. Each sensor stored its data in a separate CSV, but by opening multiple files I was able to perform comparisons between two or more sets of sensor data. Also CSV is highly compressible, opens in major office applications, and is fairly user-editable (such as trimming sections that you don't need).