What software is availible for data quality checking

What software is availible for data quality checking - csv

I'm looking to identify some possible software options that will allow for custom rules to manipulate bulk data files (.csv) For example, proper capitalization (allowing for states to remain capital and unique surnames), identifying the word count of specific words in a field, and some other custom rules. Any guidance would be appreciated.

You could use Talend Open Studio for this task. It is an Opensource ETL tool for data manipulation and integration. You can for example ImportCSV >> DATABASE >> perform transformations >> ExportCSV. The possibilities are endless.
You can find it here: http://www.talend.com/products-data-integration/talend-open-studio.php
It also sounds like you might be looking to create a profile of the data. For this you can use Talend Open Profiler, they recently added support for flat files such as your .csv. It is simple to use and you should be up and running in 30 mins.
You can find the download here: http://www.talend.com/products-data-quality/talend-open-profiler.php
You can find some tutorials here:http://www.talendforge.org/tutorials/menu.php
On the tutorials choose the Data Quality tab, and scroll down until 'Talend Open Profiler'
It is my first step in assessing data quality on a new dataset.

A quick google "data scrubbing utilities" turned up this:
http://data-scrubbing.qarchive.org/
They look to be very close to what you're looking for.
It'll really depend on how complex the rules get. Much more complex than simple stuff, and you'd probably be ahead by just coding something up (or having it coded).

Related

Converting large JSON file to XLS/CSV file (Kickstarter campaigns)

As part of my Master's thesis, I'm trying to run some statistics on which factors affect whether crowdfunding campaigns get funded or not. I've been trying to get data from the largest platform Kickstarter.com. Unfortunately, they have removed all the non-successful campaigns from their website (unless you have the direct link).
Luckily, I'm not the only one looking for this data.
Webrobots.io have a scraper robot which crawls all Kickstarter projects and collects data in JSON format (http://webrobots.io/kickstarter-datasets/).
The latest dataset can be found on:
http://webrobots.io/wp-content/uploads/2015/10/Kickstarter_2015-10-22.json_.zip
However, my programming skills are limited, and I don't know how to convert it into an excel file where I can manipulate the data and run my analysis. I found a few online converters, but the file is far too big for it (approx 300 mb).
Can someone please help me get the file converted?
It will earn you an acknowledgement in my Master's thesis when it gets published :)
Thanks in advance!!!

I guess the answer for this varies massively on a few things.
What subject is the masters covering? (mainly to appease many people who will probably assume you're hoping for people to do your homework for you! This might explain why the thread has been down-voted already)
You mention your programming skills are limited... What programming skills do you have? What language would you be using to achieve this goal? Bear in mind that even with a fully coded solution, if it's not in the language you know, you might not be able to compile it!
What kind of information do you want from the JSON file?
With regards to question 3, I've looked in the JSON file and it contains hierarchical data which is pretty difficult to replicate in a flat file i.e. an Excel or CSV file (I should know, we had to do this a lot in a previous job of mine).
But, I would look at the following plan of action to achieve what you're after:
Use a JSON parser to serialize the data into a class structure (Visual Studio can create the classes for you... See this S/O thread - How to show the "paste Json class" in visual studio 2012 when clicking on Paste Special?)
Once you've got the objects in memory, you can then step through them one by one and pick out the data you want and append them to a comma-separated string (in C# I'd use the StringBuilder) and write the rows of data out to a file on disk.
Once this is complete, you'll have the data you want.
Depending on what data you want from the JSON file, step 2 could be the most difficult part as you'd need to step into the different levels of the data hierarchy.
Hope this points you in the right direction?

You may want to look at this Blog.
http://jdunkerley.co.uk/2015/09/04/downloading-and-parsing-met-office-historic-station-data-with-alteryx/
He uses a process with Alteryx that may line up with what you are trying to do. I am looking to do something similar, but haven't tried it yet. I'll update this answer if I get it to work.

GEDCOM to HTML and RDF

I was wondering if anyone knew of an application that would take a GEDCOM genealogy file and convert it to HTML format for viewing and publishing on the web. I'd like to have separate html files for each individual and perhaps additional files for other content as well. I know there are some tools out there but I was wondering if anyone used any tools and could advise on this. I'm not sure what format to look for such applications. They could be Python or php files that one can edit, or even JavaScript (maybe) or just executable files.
The next issue might be appropriate for a topic in itself. Export of GEDCOM to RDF. My interest here would be to align the information with specific vocabularies, such as BIO or REL which both are extended from FOAF.
Thanks,
Bruce

Like Rob Kam said, Ged2Html was the most popular such program for a long time.
GRAMPS can also create static HTML sites and has the advantage of being free software and having a native XML format which you could easily modify to fit your needs.

Several years ago, I created a simple Java program to turn gedcom into xml. I then used xslt to generate html and rdf. The html I generate is pretty rudimentary, so it would probably be better to look elsewhere for that, but the rdf might be useful to you:
http://jay.askren.net/Projects/SemWeb/

There are a number of these. All listed at http://www.cyndislist.com/gedcom/gedcom-to-web-page-conversion/
Ged2html used to be the most popular and most versatile, but is now no longer being developed. It's an executable, with output customisable through its own scripting syntax.

Family Historian http://www.family-historian.co.uk will create exactly what you are looking for, eg one file per person using the built in Web Site creator. As will a couple of the other Major genealogy packages. I have not seen anything for the RDF part of your question.

I have since tried to produce a Genealogy application using Semantic MediaWiki - MediaWiki, the software behind Wikipedia, and Semantic MediaWiki includes various extensions related to the Semantic Web. I thought it is very easy to use with the forms and the ability to upload a GEDCOM but some feedback from people into genealogy said that it appeared too technical and didn't seem to offer anything new.
So, now the issue is whether to stay with MediaWiki and make it more user friendly or create an entirely new application that allows for adding and updating data in a triple store as well as displaying. I'm not sure how to generate a family tree graphical view of the data, like on sites like ancestry.com, where one can click on a box to see details about the person and update that info or one could click on a right or left arrow around a box to navigate the tree. The data comes from SPARQL queries sent to the data set/triple store both when displaying the initial view and when navigating the tree, where an Ajax call is needed to get more data.
Bruce

How do i convert hdb file? ... believed to be from act! source

Any ideas ?
I think the original source was a goldmine database, looking around it appears that the file was likely built using an application called ACT which I gather is a huge product I don't really want to be deploying for a one off file total size less than 5 meg.
So ...
Anyone know of a simple tool that I can run this file through to convert it to a standard CSV or something?
It does appear to be (when looking at it in notepad and excel) in some sort of csv type format but it's like the data is encrypted somehow.

Ok this is weird,
I got a little confused because the data looked a complete mess, in actual fact the mess was the data, that's what it was meant to look like.
Simply put, i opened the file in notepad, seemed to have a sort of pattern so i droppped it on excel.
Apparently excel has no issues reading these files ... strange huh !!!

I am unaware of any third party tooling for opening these files specifically, although there is an SDK available for C# which could resolve your problem with a little elbow grease.
The SDK can be aquired for free Here
Also there is a developer forum which could provide some valuable resources including training material with sample code Here
Resources will be provided with the SDK
Also, out of interest since ACT is a Sage product have you any Sage software floating about which you could attempt to access the data with? Most offices have!
Failing all of the above there is a trial available for ACT! Here!
Good luck with your problem!

How do you database access (I/O) to/from Magento Commerce?

So, I want to import, export and modify the database. I have read that I have to do that by XML, but I don't really understand their doc system and I haven't found any good tutorials out there that explain this. I am slowly reading the very expensive and short book which is somewhat answering my questions, but I crave more.
As a second question, I want to have a order system where I can send out information or emails with my own code. I assume this would be some type of plug-in that would override or be called at a certain time. Any info would be helpful.

Some parts of the magento data can be imported/exported via the backend (System->Import/Export), namely products and customers.
If you want to deal with the complete DB - use your DB tool of choice (I prefer mysqldump).
When dealing with exported CSV.. use OpenOffice, from my experience it deals better with the separation characters than Excel.
As for your second question - as far as I understood, you will have to develop a module if you want to do something different than the existing functionality and keep the original mail functions. If you don't want to/have to keep the original functions, you can opt to overwrite the module, which is much easier as far as I can see. Google search for "overriding magento module" should turn up atleast one decent tutorial.

I found what I was looking for here:
(on magento site: Resources -> Magento Core API -> Product API or whichever API you want)
The problem is there is no Order API yet (or none that I've seen)
http://www.magentocommerce.com/wiki/doc/webservices-api/api/catalog_product#examples
This details how you'd write an external php script and obtain,edit or delete products (or anything else with an API).
Modules still look daunting, but I am reading through the (very thin) magento book (the only one available).
I hope this helps someone else.

How to reverse engineer binary file formats for compatibility purposes

I am working of a file preparation software to enable translators work easily and efficiently on a wide range of file formats.
As far as text-based formats (xml, php, resource files,...) are concerned, my small preparation utility works fine, but a major problem for most translators is to handle all kinds of proprietary binary formats (Framemaker, Publisher, Quark...).
These files are rarely requested and need to be opened in expensive applications (few freelance can afford to buy $20,000 worth of software just to handle a few projects per year), and even then it is not convenient to work directly in those applications anyway.
I would like to be able to read these files and extract the text in such a way that it can be translated and then re-imported in the original application with minimal effort, or even better, to recreate a valid native binary file.
Does that sound doable?
Where can I find more information on handling binary file formats and are there useful tools for these kind of jobs (besides regular hex editors)?
Thanks in advance.

Of course reverse engineering is possible, but without format specs it will take a lot of work. I would look at the return on effort regarding supporting these 'rarely requested, very expensive' formats. You may be better off spending that effort improving the core functionality of your app.
Another angle is to contact the companies with these formats, explain your goal, explain that it helps their product, and if they don't see you as competition they might be willing to help.

I know that you want to reverse engineer them - but since these may be propriety file formats you are looking at a very steep curve trying to decode them...
Some (as I have written some propritety formats for interal use before) have specific methods and objects written into them that serve some alternative process than the file contents themselves. Stuff that would prove the new file is illegal.
Just my 2 cents and I am no lawyer =>

Maybe you could pick a cheaper application which has import features for QuarkXPress. For example InDesign should be able to read Quark documents. Then use the importing application to export to whatever format you need - maybe with a help of plug-in.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008