IF I wanted to predict future purchases in online shopping using historical data, do I need data science or data analysis or big data? - data-analysis

I wanted to learn to predict future events like......being able to predict number of plane crashes in 2018 using past two decades of plane crash data.....or.....predict how many tee-shirts with justin beibers face on it will be sold by 2018 depending upon fan base from previuos data..........or how many iphones 8's and samsungs s9's will be sold if they decide to launch on the same exact date....predicting somewhat accurate whole sale market.....stuff like that....please suggest a book...i really love head first series....is head first data analysis right for me? ....I dont lnow if i can ask questions other than programming here or not.....but here i am.....By the way does big data have anything to do with this?

it all falls in the category of data science (which is big data and data analysis). What you need for predictions and such stuff is some machine learning approach to data you have or can access about stuff you want to predict.
I'd recommend this, newest series of articles: https://medium.com/machine-learning-for-humans/why-machine-learning-matters-6164faf1df12
Apart from really nice intro, you'll find lots of resources for further learning there.
Also I highly recommend deeplearning.ai and machine learning course from Stanford you can find on Coursera.
Cheers!

I think most of the scenarios that you have asked are a case of Supervised Learning which is a type of machine learning, wherein you have previous data to train your machine learning model with the input and output values and once you have trained a model you feed new input values and it gives you the output which is the prediction.
I would highly recommend the following Machine Learning course by Andrew NG which on Coursera which covers all the basics of ML including Supervised and Unsupervised Learning.
https://www.coursera.org/learn/machine-learning
As for the books the following link from Analytics Vidya is a great place to start with, you can go through the books as they can give you some good basics of statistics and data sciences.
https://www.analyticsvidhya.com/blog/2015/10/read-books-for-beginners-machine-learning-artificial-intelligence/
As for the differences between Data Science, Data Analytics and Big Data. Data science and data analytics are similar in the sense that they both try to find patterns in data and based on those pattern you derive some insights.
Big data on the other hand is basically Data of huge size which is distributed across multiple machines, so you can store and compute large amount of data simultaneously and in parallel.
So you may ask how is big data and machine learning related? well the answer lies in the training of machine learning model, since the accuracy of prediction is to a certain extent depends on the amount of data you train it on. So more the training data better the predictions and in terms of quantity big data way ahead of others, hence the relation.

Related

How to analyze quantitative research data using SPSS

My team is doing a quantitative research about factors affecting university students when opting for secondhand clothes via the e-commerce platforms like Shopee, Lazada, etc. I'm taking charge of dealing with the data generated into Google Sheet from people filling the research form. Thus, my questions are:
How to load data into SPSS
What aspects do I have to analyze?
What to conclude after testing the hypothesises?
I haven't tried anything yet, and don't really know which direction I would opt for when analyzing the data.

IR Offline Metric Precision#k in Topic Modeling

Good day everyone! I'm currently learning LDA and I curious on how to validate the result of the topic model.
I've read the statement on this paper https://arxiv.org/pdf/2107.02173.pdf p.10
To the extent that our experimentation accurately represents current practice, our results do suggest that topic model evaluation—both automated and human—is overdue for a careful reconsideration. In this, we agree with Doogan and Buntine (2021), who write that “coherence measures designed for older models [. . . ] may be incompatible with newer models” and instead argue for evaluation paradigms centered on corpus exploration and labeling. The right starting point for this reassessment is the recognition that both automated and human evaluations are abstractions of a real-world problem. The familiar use of precision-at-10 in information retrieval, for example, corresponds to a user who is only willing to consider the top ten retrieved documents.
Suppose you have a corpus of a play store app user reviews, after a topic model processes this corpus, let's say it generates K=10 topics. How are we gonna use precision#10 offline metric to evaluate the result(10 topics)?

Azure Machine Learning Data Transformation

Can machine learning be used to transform/modifiy a list of numbers.
I have many pairs of binary files read from vehicle ECUs, an original or stock file before the vehicle was tuned, and a modified file which has the engine parameters altered. The files are basically lists of little or big endian 16 bit numbers.
I was wondering if it is at all possible to feed these pairs of files into machine learning, and for it to take a new stock file and attempt to transform or tune that stock file.
I would appreciate it if somebody could tell me if this is something which is at all possible. All of the examples I've found appear to make decisions on data rather than do any sort of a transformation.
Also I'm hoping to use azure for this.
We would need more information about your specific problem to answer. But, supervised machine learning can take data with a lot of inputs (like your stock file, perhaps) and an output (say a tuned value), and learn the correlations between those inputs and output, and then be able to predict the output for new inputs. (In machine learning terminology, these inputs are called "features" and the output is called a "label".)
Now, within supervised machine learning, there is a category of algorithms called regression algorithms. Regression algorithms allow you to predict a number (sounds like what you want).
Now, the issue that I see, if I'm understanding your problem correctly, is that you have a whole list of values to tune. Two things:
Do those values depend on each other and influence each other? Do any other factors not included in your stock file affect how the numbers should be tuned? Those will need to be included as features in your model.
Regression algorithms predict a single value, so you would need to build a model for each of the values in your stock file that you want to tune.
For more information, you might want to check out Choosing an Azure Machine Learning Algorithm and How to choose algorithms for Microsoft Azure Machine Learning.
Again, I would need to know more about your data to make better suggestions, but I hope that helps.

Interesting NLP/machine-learning style project -- analyzing privacy policies

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location?, do they share/sell with third parties?, etc.
I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:
First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic.. For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.
The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.
This is text classification. Given that you have multiple output categories per document, it's actually multilabel classification. The standard approach is to manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features of the documents; typically word or n-gram occurrences or counts, possibly weighted by tf-idf.
The popular learning algorithms for document classification include naive Bayes and linear SVMs, though other classifier learners may work too. Any classifier can be extended to a multilabel one by the one-vs.-rest (OvR) construction.
A very interesting problem indeed!
On a higher level, what you want is summarization- a document has to be reduced to a few key phrases. This is far from being a solved problem. A simple approach would be to search for keywords as opposed to key phrases. You can try something like LDA for topic modelling to find what each document is about. You can then search for topics which are present in all documents- I suspect what will come up is stuff to do with licenses, location, copyright, etc. MALLET has an easy-to-use implementation of LDA.
I would approach this as a machine learning problem where you are trying to classify things in multiple ways- ie wants location, wants ssn, etc.
You'll need to enumerate the characteristics you want to use (location, ssn), and then for each document say whether that document uses that info or not. Choose your features, train your data and then classify and test.
I think simple features like words and n-grams would probably get your pretty far, and a dictionary of words related to stuff like ssn or location would finish it nicely.
Use the machine learning algorithm of your choice- Naive Bayes is very easy to implement and use and would work ok as a first stab at the problem.

What "other features" could be incorporated into a train database?

This is a mini project for DBMS course. My task is to develop a Database for management of passenger trains.
I'm designing tables for Customers, Trains, Ticket Booking (via Telephone & Internet), Origins and Destinations.
He said, we are free to incorporate other features in our Database Model. Some of the features that we can include are as listed:
Ad-hoc Querying
Data Mining
Demographic Passenger Mapping
Origin and Destination Mapping
I've no clue about what these features mean. I know about datamining but unable to apply it in this context. Can any one kindly expand these features or suggest new ideas?
EDIT: What is Ad-hoc Querying? Give an example in this context.
Data mining would incorporate extracting useful facts/figures out of the data obtained by your system & stored in the database. For example, data mining might discover that trains between city x and y are always 5 minutes late, or is never at more than 50% capacity, etc. So you may wish to develop some tools or scripts that automatically run and generate statistics (graphs are best) which display this information and highlight unusual trends. In the given example, the schedulers could then analyse why the trains are always late (e.g., maybe the train speedos are wrong?).
Both points 3. and 4. are a subset of data mining imo. There is a huge amount of metrics you could try to measure, it is just really whatever you can think of. If you specify what type of data you are going to collect, that will make making suggestions easier.
Basically, data mining just means "sort the data to find interesting facts".
Based on comment below you could look for,
% of internet vs. phone sales
popular destinations & origins
customers age/sex/location
usage vs. time of day
...