Am I missing something? It seems to be the case that there is just Master Data.
I researched a lot, but all I find is the most detailed Information.
Could it be the Case that it's true (just Masterdata in MM) and the transactional Data is store in SD Tables?
I'm looking for Data about, for example, how many Products of Type X are in Plant Y (I need Key-Figures for an Analysis in SAP Analytics Cloud as a preparation for Projects).
Related
We have a SaaS solution, in which each Tenant has his own MySQL database. Now I'm designing the dashboards of this SaaS system and it requires some analytical charts. To get the data needed for charts we could query the transactional data of each tenant from its database in real time. and get updated charts with no bad performance since so far the data volume not that big. However, because the data volume will be growing we decided to separate the analytical and transactional data of each company, we will get the analytical data for the charts in the background, save/caching them and do periodical updates. My question is:
What good questions or factors we should considering before deciding whether or not we need to include a data warehouse and data modeling from the beginning or simply Caching the analytical data of the charts resulting from our API in JSON columns in a new table for charts in each tenant's MYSQL database.
Instead of reaching into the "Fact" table for millions of rows, build and maintain a Summary table, then fetch from that. It may run 10 times as fast.
This does require code changes because of the extra table, but it may be well worth it.
Summary Tables
In other words, if the dataset will become bigger than X, Summary tables is the best solution. Caching will not help. Hardware won't be sufficient. JSON only gets in the way.
Building a year-long graph from a year's worth data points (one per second) is slow and wasteful. Building a year-long graph from daily subtotals is much more reasonable.
I have a graph-like data in SQL. The data can be described as:
products table - list of skus classified into two (2)
Class 1: non-vehicle specific (universally fits all vehicle)
Class 2: vehicle-specific (custom-fit to specific set of vehicle)
1 sku fits one or more vehicle (YMMSE)
vehicle master table (year, make model, submodel, engine) aka YMMSE
e.g.
2014 Ford Fiesta S 4 Cylinder, 1.6L
applications tables - relationship between custom-fit products and the corresponding vehicles YMMSE.
I have an applications table that runs into Gigabytes with approximately 85 Million records.
The problem is querying for SKU specific vehicle YMMSE takes a long time in SQL especially on skus that has a lot of applications mapping aka "almost-universal".
The applications table gets updated frequently so I need to be able to perform the expensive queries every-time until such point that the MySQL server is almost giving up or causes replication delays as a result.
The question is:
Would a distributed processing framework like Hadoop or Spark be able to help me speed up the process of discovering sku-specific vehicle mapping fast?
Regards,
Jun
Frameworks like Hadoop or Spark can help to remove some pressure from your database but are not designed for low-latency operations. If data is graphical and queries represent some types of graph traversal you'd be better with dedicated tools like some type of graph database.
I have searched everywhere on the web to find out how I can import data into a star schema data warehouse. A lot of the stuff online explain the design of the star schema and data warehouse but none explain how exactly data is loaded into the DW. Here is what i've done so far:
I am trying to make an application of high school basketball statistics for each player.
I have:
A list of all of the players name, height, position and number
A list of all of the high schools
list of all of the schedules
list of conferences
statistics(points, rebounds, steals, games played, etc) for each player for the current year.
I assume the the stats would be my fact table and the rest are my dim tables.
Now the million dollar question --How in the world do get the data into that format appropriately?
I tried simply importing them to their respective tables but dont know how they connect.
Example: there are 800 players and 400 schools. each schools has a unique id (primary key). I upload the players into dim players and schools into dim schools. Now how do I connect them?
Please help. Thanks in advance. Sorry for the rambling :)
There are many ways for importing data into a database: using builtin loaders, scripts or, what is mostly used for DW environments, an ETL tool.
About your fact table, I think stats are metrics, not the transaction. In other words, you measure a transaction, not a metric itself.
Using an ETL tool (E- Extract your data from your soruces , T- transform your data or manipulate it to go as you want, L - Load the data in your DW) you can safely and surely have your data loaded in your DW.
You can use ETL tools like : SSIS , Talend , etc.
Yes, "star", "dim", "fact", and "data warehouse" are appropriate terms, but I would rather approach it from "entities" and "relationships"...
You have essentially defined 5 "Entities". Each Entity is (usually) manifested as one database table. Write the CREATE TABLEs. Be sure to include a PRIMARY KEY for each; it will uniquely identify each row in the table.
Now think about relationships. Think about 1:many, such as 1 high school has 'many' players. Think about many:many.
For 1:many, you put, for example, the id of the high school as a column in the player table.
For many:many you need an extra table . Write the CREATE TABLEs for any of those you may need.
Now, read the data, and do INSERTs into the appropriate table.
After that, you can think about the SELECTs to extract interesting data. At the same time, decide what INDEX(es) will be useful. But that is another discussion.
When you are all finished, you will have learned a bunch about SQL, and may realize that some things should have been done a different way. So, be ready to start over. Think of it as a learning exercise.
You can use SQL server data tools for this project.
SQL server Data tools consists of a SSIS,SSAS and SSRS.
Use SSIS to create a ETL process for your data in your database.
Use SSAS to create dimensions, fact tables and cubes (You can do a lot more in this).
Use SSRS to present the data in a user friendly way.
Lot of videos are available youtube.
I'd like to get feedback on how to model the following:
Two main objects: collections and resources.
Each user has multiple collections. I'm not saving user information per se: every collection has a "user ID" field.
Each collection comprises multiple resources.
Any given collection belongs to only one user.
Any given resource may be associated with multiple collections.
I'm committed to using MySQL for the time being, though there is the possibility of migrating to a different database down the road. My main concern is scalability with the following assumptions:
The number of users is about 200 and will grow.
On average, each user has five collections.
About 30,000 new distinct resources are "consumed" daily: when a resource is consumed, the application associates that resource to every collection that is relevant to that resource. Assume that typically a resource is relevant to about half of the collections, so that's 30,000 x (1,000 / 2) = 15,000,000 inserts a day.
The collection and resource objects are both composed of about a half-dozen fields, some of which may reach lengths of 100 characters.
Every user has continual polling set up to periodically retrieve their collections and associated resources--assume that this happens once a minute.
Please keep in mind that I'm using MySQL. Given the expected volume of data, how normalized should the data model be? Would it make sense to store this data in a flat table? What kind of sharding approach would be appropriate? Would MySQL's NDB clustering solution fit this use case?
Given the expected volume of data, how normalized should the data model be?
Perfectly.
Your volumes are small. You're doing 10,000 to 355,000 transactions each day? Let's assume your peak usage is a 12-hour window. That's .23/sec up to 8/sec. Until you get to rates like 30/sec (over 1 million rows on a 12-hour period), you've got get little to worry about.
Would it make sense to store this data in a flat table?
No.
What kind of sharding approach would be appropriate?
Doesn't matter. Pick any one that makes you happy.
You'll need to test these empirically. Build a realistic volume of fake data. Write some benchmark transactions. Run under load to benchmarking sharding alternatives.
Would MySQL's NDB clustering solution fit this use case?
It's doubtful. You can often create a large-enough single server to handle this load.
This doesn't sound anything like any of the requirements of your problem.
MySQL Cluster is designed not to have any single point of failure. In
a shared-nothing system, each component is expected to have its own
memory and disk, and the use of shared storage mechanisms such as
network shares, network file systems, and SANs is not recommended or
supported.
The general idea of problem is that data is arranged in following three columns in a table
"Entity" "parent entity" "value"
A001 B001 .10
A001 B002 .15
A001 B003 .2
A001 B004 .3
A002 B002 .34
A002 B003 .13
..
..
..
A002 B111 .56
There is graph of entities and values can be seen as weight of directed edge from parent entity to entity. I have to calculate how many different subsets of parent entity of a particular entity are greater than .5 (say). To further calculate something (later part is easy, not complex computationally)
The point is data is huge (Excel files says data lost :( ). which language or tool I can use? some people have suggested me SAS or STATA.
Thanks in advance
You can do this in SQL. Two options for the desktop (without having to install a SQL server of some kind) are MS Access or OpenOffice Database. Both can read CSV files into a database.
In there, you can run SQL queries. The syntax is a bit odd but this should get you started:
select ParentEntity, sum(Value)
from Data
where sum(Value) > .5
group by ParentEntity
Data is the name of the table in which you loaded the data, Entity and Value are the names of columns in the Data table.
If you're considering SAS you could take a look at R, a free language / environment used for data mining.
I'm guessing that the table you refer to is actually in a file, and that the file is too big for Excel to handle. I'd suggest that you use a language that you know well. Of those you know, select the one with these characteristics:
-- able to read files line by line;
-- supports data structures of the type that you want to use in memory;
-- has good maths facilities.
SAS is an excellent language for quickly processing huge datasets (hundreds of millions of records in which each record has hundreds of variables). It is used in academia and in many industries (we use it for warranty claims analysis; many clinical trials use it for statistical analysis & reporting).
However, there are some caveats: the language has several deficiencies in my opinion which makes it difficult to write modular, reusable code (there is a very rich macro facility, but no user defined functions until version 9.2). Probably a bigger caveat is that a SAS license is very expensive; thus, it probably wouldn't be practical for a single individual to purchase a license for their own experimentation, though the price of a license may not be prohitive to a large company. Still, I believe SAS sells a learning edition, which is likely less expensive.
If you're interested in learning SAS, here are some excellent resources:
Official SAS Documentation:
http://support.sas.com/documentation/onlinedoc/base/index.html
SAS White Papers / Conference
Proceedings:
http://support.sas.com/events/sasglobalforum/previous/online.html
SAS-L Newsgroup (Much, much more
activity regarding SAS questions than
here on stackoverflow):
http://listserv.uga.edu/cgi-bin/wa?A0=sas-l&D=0
There are also regional and local SAS users groups, from which you can learn a lot (for example in my area there is a MWSUG (Midwest SAS Users Group) and MISUG (Michigan SAS User's Group)).
If you don't mind really getting into a language and using some operating system specific calls, C with memory-mapped files is very fast.
You would first need to write a converter that would translate the textual data that you have into a memory map file, and then a second program that maps the file into memory and scans through the data.
I hate to do this, but I would reccomend simply C. What you need is actually to figure out your problem in the language of math, then implement it into C. The ways of storing a graph in memory is a large research area. You could use an adjacency matrix if the graph is dense (highly connected), or an adjacency list if it is not. Each of the subtree searches will be some fancy code and it might be a hard problem.
As others have said, SQL can do it, and the code has even been posted. If you need help putting the data from a text file into a SQL database, that's a different question. Look up bulk data inserts.
The problem with SQL is that even though it is a wonderfully succinct language, it is parsed by the database engine and the underlying code might not be the best method. For most data access routines, the SQL database engine will produce some amazing code efficiencies, but for graphs and very large computations like this, I would not trust it. That's why you go to C. Some lower level language that makes you do it yourself will be the most efficient.
I assume you will need efficient code due to the bulk of the data.
All of this assumes the dataset fits into memory. If your graph is larger than your workstation's ram, (get one with 24GB if you can), then you should find a way to partition the data such that it does fit.
Mathematica is quite good in my experience...
Perl would be a good place to start, it is very effeciant at handling file input and string parsing. You could then hold the whole set in memory or only the subsets.
SQL is a good option. Database servers are designed to manage huge amounts of data, and are optimized to use every ressource available on the machine efficiently to gain performance.
Notably, Oracle 10 is optimized for multi-processor machines, automatically splitting requests across processors if possible (with the correct configuration, search for "oracle request parallelization" on your favorite search engine).
This solution is particularly efficient if you are in a big organization with good database servers already available.
I would use Java's BigInteger library and something functional, like say Hadoop.
at least simple SQL statement wont
work (please read problem carefuly) i
need to find sum of all subsets and
check tht sum of elements of the set
.5 or not . thanks – asin Aug 18 at
7:36
Since your data is in Stata, here is the code to do what you ask in Stata (paste this code into your do-file editor):
//input the data
clear
input str10 entity str10 parent_entity value
A001 B001 .10
A001 B002 .15
A001 B003 .2
A001 B004 .3
A002 B002 .34
A002 B003 .13
A002 B111 .56
end
//create a var. for sum of all subsets
bysort entity : egen sum_subset = total(value)
//flag the sets that sum > .5
bysort entity : gen indicator = 1 if sum_subset>.5
recode ind (.=0)
lab def yn 1 "YES", modify
lab def yn 0 "No", modify
lab val indicator yn
li *, clean
Keep in mind that when using Stata, your data is kept in memory so you are limited only by your system's memory resources. If you try to open your .dta file & it says 'op. sys refuses to provide mem', then you need to try to use the command -set mem- to increase your memory to run the data.
Ultimately, StefanWoe's question:
ay you give us an idea of HOW huge the
data set is? Millions? Billions of
records? Also an important questions:
Do you have to do this only once? Or
every day in the future? Or hundreds
of times each hour? – StefanWoe Aug 18
at 13:15
This really drives your question more than which software to use... Automating this using Stata, even on an immense amount of data, wouldn't be difficult but you you could max your resource limits quickly.