Alternate methods to supply data for machine learning (Other than using CSV files) - csv

I have a question which is relating to machine learning application in real world. It might be sounds stupid lol.
I've been self study machine learning for a while and most of the exercise was using the csv file as data source (both processed and raw). I would like to ask is there any other methods other than import csv file to channel/supply data for machine learning?
Example: Streaming Facebook/ Twitter live feed's data for machine learning in real-time, rather than collect old data and stored them into CSV file.

The data source can be anything. Usually, it's provided as a CSV or JSON file. But in the real world, say you have a website such as Twitter, as you're mentioning, you'd be storing your data in a rational DB such as SQL databases, and for some data you'd be putting them in an in-memory cache.
You can basically utilize both of these to retrieve your data and process it. The thing here is when you have too much data to fit in the memory, you can't really just query everything and process it, in that case, you'll be utilizing some smart algorithms to process data in chunks.
Good thing about some databases such as SQL is that they provide you with a set of functions that you can invoke right in your SQL script to efficiently calculate some data. For example you can get a sum of a column across the whole table or something using SUM() function SQL, which allows for efficient and easy data manipulation

Related

Connect Adobe Analytics to MYSQL

I am trying to connect the data collected from Adobe Analytics to my local instance of MYSQL, is this possible? if so what would be the method of doing so?
There isn't a way to directly connect your mysql db with AA, make queries or whatever.
The following is just some top level info to point you in a general direction. Getting into specifics is way too long and involved to be an answer here. But below I will list some options you have for getting the data out of Adobe Analytics.
Which method is best largely depends on what data you're looking to get out of AA and what you're looking to do with it, within your local db. But in general, I listed them in order of level of difficulty of setting something up for it and doing something with the file(s) once received, to get them into your database.
First option is to within the AA interface, schedule data to be FTP'd to you on a regular basis. This can be a scheduled report from the report interface or from Data Warehouse, and can be delivered in a variety of formats but most commonly done as a CSV file. This will export data to you that has been processed by AA. Meaning, aggregated metrics, etc. Overall, this is pretty easy to setup and parse the exported CSV files. But there are a number of caveats/limitations about it. But it largely depends on what specifically you're aiming to do.
Second option is to make use of their API endpoint to make requests and receive response in JSON format. Can also receive it in XML format but I recommend not doing that. You will get similar data as above, but it's more on-demand than scheduled. This method requires a lot more effort on your end to actually get the data, but it gives you a lot more power/flexibility for getting the data on-demand, building interfaces (if relevant to you), etc. But it also comes with some caveats/limitations same as first option, since the data is already processed/aggregated.
Third option is to schedule Data Feed exports from the AA interface. This will send you CSV files with non-aggregated, mostly non-processed, raw hit data. This is about the closest you will get to the data sent to Adobe collection servers without Adobe doing anything to it, but it's not 100% like a server request log or something. Without knowing any details about what you are ultimately looking to do with the data, other than put it in a local db, at face value, this may be the option you want. Setting up the scheduled export is pretty easy, but parsing the received files can be a headache. You get files with raw data and a LOT of columns with a lot of values for various things, and then you have these other files that are lookup tables for both columns and values within them. It's a bit of a headache piecing it all together, but it's doable. The real issue is file sizes. These are raw hit data files and even a site with moderate traffic will generate files many gigabytes large, daily, and even hourly. So bandwidth, disk space, and your server processing power are things to consider if you attempt to go this route.

What to use JSON file or SQL

I want to create a website that will have an ajax search. It will fetch the data or from a JSON file or from a database.I do not know which technology to use to store the data. JSON file or MySQL. Based on some quick research it is gonna be about 60000 entries. So the file size if i use JSON will be around 30- 50 MB and if use MySQL will have 60000 rows. What are the limitations of each technique and what are the benefits?
Thank you
I can't seem to comment since I need 50 rep. for commenting, so I will give it as an answer:
MySQL will be preferable for many reasons, not the least of which being you do not want your web server process to have write access to the filesystem (except for possibly logging) because that is an easy way to get exploited.
Also, the MySQL team has put a lot of engineering effort into things such as replication, concurrent access to data, ACID compliance, and data integrity.
Imagine if, for instance, you add a new field that is required in whatever data structure you are storing. If you store in JSON files, you will have to have some process that opens each file, adds the field, then saves it. Compare this to the difficulty of using ALTER TABLE with a DEFAULT value for the field. (A bit of a contrived example, but how many hacks do you want to leave in your codebase for dealing with old data?) so to be really blunt about, MySQL is a database while JSON is not, so the correct answer is MySQL, without hesitation. JSON is just a language, and barely even that. JSON was never designed to handle anything like concurrent connections or any sort of data manipulation, since its own function is to represent data, not to manage it.
So go with MySQL for storing the data. Then you should use some programming language to read that database, and send that information as JSON, rather than actually storing anything in JSON.
If you store the data in files, whether in JSON format or anything else, you will have all sorts of problems that people have stopped worrying about since databases started being used for the same thing. Size limitations, locks, name it. It's good enough when you have one user, but the moment you add more of them, you'll start solving so many problems that you would probably end up by writing an entire database engine just to handle the files for you, while all along you could have simply used an actual database. Do note! Don't take my word for granted, I am not an expert on this field, so let others post their answer and then judge by that. I think enough people here on stackoverflow have more experience then I do haha. These are NOT entirely my words, but I have taken out the parts that were true from what I knew and know and added some of my own knowledge :) Have a great time making your website
For MySQl :you can select specific rows,or specific column using queries ,filter data based on a key,order alphabetically
downside:need a REST API to fetch data because it can't be accessed directly,you have to use php or python or whatever programming language for backend code.
for json file :benefits :no backend code directly accessed using GET http request.
downside:no filtering ,ordering or any queries,you have to do it manually.

Parse.com query database for business intelligence and statistics purposes

i would love to know how do i periodically run some complex queries on my app's Parse database? We are currently exporting the entire database as JSON, converting the JSON to CSV and putting it in an Excel and getting business intelligence data on that. It is not very efficient because the database is growing everyday and the process of converting the file to CSV is taking longer everyday. Any advice or good practices that you guys used?
Parse's own "Background Jobs": http://blog.parse.com/announcements/introducing-background-jobs/
A meaningful answer would probably require more specifics but it is normal that bigger data = longer processing times.
One way of keeping a lid would be to keep processed data, processed :P and just appending your new results to that (differencials).

How to approach frequent mass updates to SQL database by non-programmer. Should the DB just import from Excel each time?

I am new to database development so any advice is appreciated.
I am tasked with creating a SQL database. This database will store data that is currently on an Excel file. The data in about 50 rows changes every single day (different rows each day).
The person updating the data is a non-programmer and the updating of info needs to be simple and fast.
I'm thinking my client could just update the Excel file, and then this Excel file will upload to the database. Is that feasible? Is there a better way? My client spends enough time just updating the Excel file, so anything that will take a significant amount of extra time for inputing data is not feasible.
Any feedback or ideas is appreciated!
Context: I haven't made any decisions about which SQL DBMS I will use (or maybe I'll use noSQL or Access). I'm still in the how-should-I-approach-this stage of development.
If your data all fits in an excel file, there's nothing wrong with that approach. You need to spend your time thinking about how you want to get the data from excel in to the DB, as you have a ton of options as far as programming languages / tools to do that.
I'm personally a huge fan of Node.js (there are npm modules already existing for reading excel files, writing to mysql, etc, so you would end up writing almost nothing yourself) but you could do it using just about anything.
Do yourself a favor and use a simple database (like MySQL) and don't mess with NoSQL for this. The amount of data you have is tiny (if it's coming from an excel file) and you really don't need to start worrying about the complexities of NoSQL until you have a a TON of data or data that is changing extremely rapidly.
If the business process is currently updating Excel, then that is probably the best approach. Modify the Excel file. A little bit of VBA in the Excel can copy it and store it somewhere for the database. Schedule a job in the evening, load the data in, and do whatever processing you like. This is not an unusual architecture.
Given your choice of tools, I would advise you not to use either Access or a NoSQL database, unless you have a particular reason for choosing them. A free database such as Postgres or SQL Server Express should do a good job meeting your needs.

Options for transferring data between MySQL and SQLite via a web service

I've only recently started to deal with database systems.
I'm developing an ios app that will have a local database (sqlite) and that will have to periodically update the internal database with the contents of a database stored in a webserver (mySQL). My questions is, whats the best way to fetch the data from the webserver and store it in the local database? There are some options that came to me, don't know if all of them are possible
Webserver->XML/JSON->Send it->Locally convert and store in local database
Webserver->backupFile->Send it->Feed it to the SQLite db
Are there any other options? Which one is better in terms of amount of data taken?
Thank you
The XML/JSON route is by far the simplest while providing sufficient flexibility to handle updates to the database schema/older versions of the app accessing your web service.
In terms of the second option you mention, there are two approaches - either use an SQL statement dump, or a CSV dump. However:
The "default" (i.e.: mysqldump generated) backup files won't import into SQLite without substantial massaging.
Using a CSV extract/import will mean you have considerably less flexibility in terms of schema changes, etc. so it's probably not a sensible approach if the data format is ever likely to change.
As such, I'd recommend sticking with the tried and tested XML/JSON approach.
In terms of the amount of data transmitted, JSON may be smaller than the equivalent XML, but it really depends on the variable/element names used, etc. (See the existing How does JSON compare to XML in terms of file size and serialisation/deserialisation time? question for more information on this.)