Why does this model fail? - regression

Here is the data set
https://gist.github.com/kirkstrobeck/d8b768867890807f9dc9
When using Google Prediction API it will go from RUNNING for about 30 minutes, then ERROR: INTERNAL ERROR.
Why does it fail? It seems to be a standard consumable regression model data set.

When attempting to answer this question, I looked at the API you speak of as well as its requirements. These requirements lie in the file format and how the text in said file is formatted. The first thing I will point out is that the Google Prediction API that "is uploaded to Google Cloud Storage as a CSV (comma-separated value) file." Your file is a TXT(at least on GitHub), but appears to have the correct structure of a CSV. However, when you take a look at the standards for this filetype, almost everyone has a different way they want it done. In the case of Google, they have very strict requirements on the file format(they also have some good examples here: cloud.google.com/prediction/docs/developer-guide#examples). Long story short, you shouldn't have spaces between your columns, it might cause an error in the processing seeing how it doesn't match the Wikipedia standards or Google's requirements.
EDIT: Sorry about the weird link stuff, I don't have enough rep to do more than two yet.

Related

Converting CSV into GRIB2 data for mapping in Leaflet

TL;DR: I'm looking for some resources on generating GRIB2 data sets on the fly, ideally using in-house-generated wind data in a CSV format.
We have a bunch of data for a series of localized weather stations monitoring wind information around our city. They report in at ~2-3 minute intervals (far more frequent than standard weather data), and from their reports we have lat, lon, wind speed, and wind direction. Someone went and told the boss about these really slick visualizations, like this that can display wind speed and direction, and it's my job to make it happen.
The above plug-in for Leaflet, GitHub here, as well as several others, all use GRIB2 data, which from my research involves a left/right set of data and an up/down set of data for a series of points plotted out across a region.
The problem I'm having is that I've only found a handful of tools that interact with GRIB2 data, and most seem to decode data from the GRIB2 dataset, and only one tool running on Fortran seems to exist that compiles GRIB2 data together.
So, is there any way to generate GRIB2 data on the fly using proprietary data at 2-3 minute intervals?
I've gone through this resource on NOAA's website, which is where I found a few tools.
I know how frustrating it can be to work with GRIB and some of the other science/weather related formats. This may not be the best answer, but it might be your only answer as I find these types of questions to only gather dust because of the general lack of knowledge with the formats and tools.
From what I remember, CDO tools (link here) can do some magical things - but I am not that experienced with it. I do use it for converting satellite data to plain text and it's been an absolute lifesaver! So I will explain :
My suggestion was to first convert the CSV to netCDF. I had a link saved for this a long time ago, but never came to really needing it. (discussion here). Essentially, some python code should be able to do the conversion for you. There may be several ways to do this, but I have never looked into it beyond initial research.
Next, you should be able to convert .nc to .grib using CDO. I know it can do quite alot. Here is a discussion regarding this, so it must be able to be done.
I also see at this link where someone converts grib to netcdf, but you should be able to do it in reverse as well. I just don't know the exact commands. From the link :
As an example of use of CDO, converting from GRIB to netCDF can be as simple as
cdo -f nc copy file.grb file.nc
I would suspect its just the reverse but probably something like :
cdo -f grb file.nc file.grb
Hopefully you can put things together for it to work without being too hack-y.
You can do this in a simple python script using pandas , xarray and cfgrib
import pandas as pd
import cfgrib
data = pd.read_csv('your_csv_data.csv')
xarray_data = data.to_xarray()
cfgrib.to_grib(xarray_data, 'out2.grib')
Please note that you have to define grib specifications first before you store as grib data.

Converting large JSON file to XLS/CSV file (Kickstarter campaigns)

As part of my Master's thesis, I'm trying to run some statistics on which factors affect whether crowdfunding campaigns get funded or not. I've been trying to get data from the largest platform Kickstarter.com. Unfortunately, they have removed all the non-successful campaigns from their website (unless you have the direct link).
Luckily, I'm not the only one looking for this data.
Webrobots.io have a scraper robot which crawls all Kickstarter projects and collects data in JSON format (http://webrobots.io/kickstarter-datasets/).
The latest dataset can be found on:
http://webrobots.io/wp-content/uploads/2015/10/Kickstarter_2015-10-22.json_.zip
However, my programming skills are limited, and I don't know how to convert it into an excel file where I can manipulate the data and run my analysis. I found a few online converters, but the file is far too big for it (approx 300 mb).
Can someone please help me get the file converted?
It will earn you an acknowledgement in my Master's thesis when it gets published :)
Thanks in advance!!!
I guess the answer for this varies massively on a few things.
What subject is the masters covering? (mainly to appease many people who will probably assume you're hoping for people to do your homework for you! This might explain why the thread has been down-voted already)
You mention your programming skills are limited... What programming skills do you have? What language would you be using to achieve this goal? Bear in mind that even with a fully coded solution, if it's not in the language you know, you might not be able to compile it!
What kind of information do you want from the JSON file?
With regards to question 3, I've looked in the JSON file and it contains hierarchical data which is pretty difficult to replicate in a flat file i.e. an Excel or CSV file (I should know, we had to do this a lot in a previous job of mine).
But, I would look at the following plan of action to achieve what you're after:
Use a JSON parser to serialize the data into a class structure (Visual Studio can create the classes for you... See this S/O thread - How to show the "paste Json class" in visual studio 2012 when clicking on Paste Special?)
Once you've got the objects in memory, you can then step through them one by one and pick out the data you want and append them to a comma-separated string (in C# I'd use the StringBuilder) and write the rows of data out to a file on disk.
Once this is complete, you'll have the data you want.
Depending on what data you want from the JSON file, step 2 could be the most difficult part as you'd need to step into the different levels of the data hierarchy.
Hope this points you in the right direction?
You may want to look at this Blog.
http://jdunkerley.co.uk/2015/09/04/downloading-and-parsing-met-office-historic-station-data-with-alteryx/
He uses a process with Alteryx that may line up with what you are trying to do. I am looking to do something similar, but haven't tried it yet. I'll update this answer if I get it to work.

Google Keyword Tool Results

I want to understand what this data is after I make a search query for "potato"
in google's keyword tool.
Its interesting because at the bottom it contains cost per click, names, suggestions, but the actual values for things like global searches are not definite.
Since its a bit long I paste it in here:
http://pastebin.com/UCTEhdB1
Any ideas are welcome.
Do not recommend any APIs
I am assuming that the values from your pastebin are coming from this tool.
I understand that this is not what you requested, but, unfortunately, there does not appear to be any documentation on the data returned from the Keyword tool, which should be expected considering that Google offers the Google Adwords API (unfortunately, a paid service).

How do i convert hdb file? ... believed to be from act! source

Any ideas ?
I think the original source was a goldmine database, looking around it appears that the file was likely built using an application called ACT which I gather is a huge product I don't really want to be deploying for a one off file total size less than 5 meg.
So ...
Anyone know of a simple tool that I can run this file through to convert it to a standard CSV or something?
It does appear to be (when looking at it in notepad and excel) in some sort of csv type format but it's like the data is encrypted somehow.
Ok this is weird,
I got a little confused because the data looked a complete mess, in actual fact the mess was the data, that's what it was meant to look like.
Simply put, i opened the file in notepad, seemed to have a sort of pattern so i droppped it on excel.
Apparently excel has no issues reading these files ... strange huh !!!
I am unaware of any third party tooling for opening these files specifically, although there is an SDK available for C# which could resolve your problem with a little elbow grease.
The SDK can be aquired for free Here
Also there is a developer forum which could provide some valuable resources including training material with sample code Here
Resources will be provided with the SDK
Also, out of interest since ACT is a Sage product have you any Sage software floating about which you could attempt to access the data with? Most offices have!
Failing all of the above there is a trial available for ACT! Here!
Good luck with your problem!

Reliably extracting identity fields from scanned documents / images?

I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".
I've tried the following software:
Tesseract-OCR
Cuneiform
Canon ImageRunner built-in OCR
Asprise OCR Java API (demo)
I've tried the following settings:
Scanning at resolutions of 300dpi and 600dpi
Tried different fonts, including OCR-A and OCR-B.
In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".
Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?
Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.
Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.
If you have control over the fields, why use a human-readable format in the first place? For scanning, it seems like a QR Code, or something similar would be best. It is marked for orientation, and has some built-in error correction.
http://en.wikipedia.org/wiki/QR_Code
I started digging for products starting with Tomato's suggestion. I tried ABBYY and CVISION. Both have products that can automate OCR:
CVISION Maestro Recognition Server 4.0
ABBYY Recognition Server 2.0
In addition, ABBYY has SDKs for various platforms, and CVISION has an SDK that appears to work with at least VB/VC++.
I haven't tried either SDK yet, and am not sure it's necessary for my project. All I need is PDFs coming in that I can extract the text from. I did however try CVISION's server product and with the OCR on its most accurate settings, it worked really well. I haven't tried ABBYY's server product yet because I have to go through a reseller to get a trial. I'm in the process of doing so, but if it starts getting annoying I'm probably going to go with CVISION. I did try ABBYY's FineReader standalone product, and it worked very well, so I assume that their server product would also.