How to train Document AI to get specific fields? - ocr

I have approximately 6000 documents in pdf format, they have a different structure but inside they all contain the same date and code (With different structure I mean that the location of these values ​​changes in each document) I am working with Document IA that extracts all the information, but I would like to know if there is a possibility to only extract the fields that I need. Would Document IA workbench be the best option?

Did you mean creating a custom document extractor? You can do this in Document AI, Visit this link for this feature.
Tldr; you will have to do this on Document AI's workbench and train your own extractor(Uploading files and train the processor to extract data specified) For steps on this feature, I would suggest to visit this documentation for the detailed steps on this.
Also please be noted that this feature is on the Preview stage at the moment. Preview offerings are often publicly announced, but are not necessarily feature-complete, and no SLAs or technical support commitments are provided for these.

Related

Autodesk Forge randomly loses object and room information

I'm using Autodesk Forge to integrate with our remodeling tool. In particular, I need to count objects of different families and types and determine to what room they actually belong. I use Model Derivative API for this purpose. To keep the room/area information I convert .rvt files to .nwc files as suggested here. However, when I retrieve data with GET /modelderivative/v2/designdata/{urn}/metadata/{guid}/properties I face the following problems from time to time:
Room information sometimes disappears from Objects for some reason
Objects disappear from result data for some reason (but they seem to exist when I browse them in A360)
I have no idea, what can be the reason for this.
I have no explanation for the disappearance of room data or objects for you.
If you can provide a reproducible case demonstrating that, I will gladly pass it on to the development team for analysis.
If you are interested in an immediate reliable solution and full control, which I assume is the case, I would suggest following the second bullet item in the advice provided by Eason in the previous answer that you refer to above:
Extract all the room information and object relationships you are interested in via the Revit API, store that data somewhere yourself, and use it later on wherever you like to your heart's content.
Then you will be completely safe and independent of all other components and their unpredictable behaviour.
If the only information that you need is the room containing each family instance, I can even implement a suitable Revit add-in for you.
Another suggestion that might help, if that is indeed the data you require: determine that information in a Revit add-in and attach it to each family instance in your own personal shared parameter. That will ensure that it remains intact through the translation process. Afaik, all shared parameter data is retained, independent of other behaviour.

Converting large JSON file to XLS/CSV file (Kickstarter campaigns)

As part of my Master's thesis, I'm trying to run some statistics on which factors affect whether crowdfunding campaigns get funded or not. I've been trying to get data from the largest platform Kickstarter.com. Unfortunately, they have removed all the non-successful campaigns from their website (unless you have the direct link).
Luckily, I'm not the only one looking for this data.
Webrobots.io have a scraper robot which crawls all Kickstarter projects and collects data in JSON format (http://webrobots.io/kickstarter-datasets/).
The latest dataset can be found on:
http://webrobots.io/wp-content/uploads/2015/10/Kickstarter_2015-10-22.json_.zip
However, my programming skills are limited, and I don't know how to convert it into an excel file where I can manipulate the data and run my analysis. I found a few online converters, but the file is far too big for it (approx 300 mb).
Can someone please help me get the file converted?
It will earn you an acknowledgement in my Master's thesis when it gets published :)
Thanks in advance!!!
I guess the answer for this varies massively on a few things.
What subject is the masters covering? (mainly to appease many people who will probably assume you're hoping for people to do your homework for you! This might explain why the thread has been down-voted already)
You mention your programming skills are limited... What programming skills do you have? What language would you be using to achieve this goal? Bear in mind that even with a fully coded solution, if it's not in the language you know, you might not be able to compile it!
What kind of information do you want from the JSON file?
With regards to question 3, I've looked in the JSON file and it contains hierarchical data which is pretty difficult to replicate in a flat file i.e. an Excel or CSV file (I should know, we had to do this a lot in a previous job of mine).
But, I would look at the following plan of action to achieve what you're after:
Use a JSON parser to serialize the data into a class structure (Visual Studio can create the classes for you... See this S/O thread - How to show the "paste Json class" in visual studio 2012 when clicking on Paste Special?)
Once you've got the objects in memory, you can then step through them one by one and pick out the data you want and append them to a comma-separated string (in C# I'd use the StringBuilder) and write the rows of data out to a file on disk.
Once this is complete, you'll have the data you want.
Depending on what data you want from the JSON file, step 2 could be the most difficult part as you'd need to step into the different levels of the data hierarchy.
Hope this points you in the right direction?
You may want to look at this Blog.
http://jdunkerley.co.uk/2015/09/04/downloading-and-parsing-met-office-historic-station-data-with-alteryx/
He uses a process with Alteryx that may line up with what you are trying to do. I am looking to do something similar, but haven't tried it yet. I'll update this answer if I get it to work.

What software is availible for data quality checking

I'm looking to identify some possible software options that will allow for custom rules to manipulate bulk data files (.csv) For example, proper capitalization (allowing for states to remain capital and unique surnames), identifying the word count of specific words in a field, and some other custom rules. Any guidance would be appreciated.
You could use Talend Open Studio for this task. It is an Opensource ETL tool for data manipulation and integration. You can for example ImportCSV >> DATABASE >> perform transformations >> ExportCSV. The possibilities are endless.
You can find it here: http://www.talend.com/products-data-integration/talend-open-studio.php
It also sounds like you might be looking to create a profile of the data. For this you can use Talend Open Profiler, they recently added support for flat files such as your .csv. It is simple to use and you should be up and running in 30 mins.
You can find the download here: http://www.talend.com/products-data-quality/talend-open-profiler.php
You can find some tutorials here:http://www.talendforge.org/tutorials/menu.php
On the tutorials choose the Data Quality tab, and scroll down until 'Talend Open Profiler'
It is my first step in assessing data quality on a new dataset.
A quick google "data scrubbing utilities" turned up this:
http://data-scrubbing.qarchive.org/
They look to be very close to what you're looking for.
It'll really depend on how complex the rules get. Much more complex than simple stuff, and you'd probably be ahead by just coding something up (or having it coded).

GEDCOM to HTML and RDF

I was wondering if anyone knew of an application that would take a GEDCOM genealogy file and convert it to HTML format for viewing and publishing on the web. I'd like to have separate html files for each individual and perhaps additional files for other content as well. I know there are some tools out there but I was wondering if anyone used any tools and could advise on this. I'm not sure what format to look for such applications. They could be Python or php files that one can edit, or even JavaScript (maybe) or just executable files.
The next issue might be appropriate for a topic in itself. Export of GEDCOM to RDF. My interest here would be to align the information with specific vocabularies, such as BIO or REL which both are extended from FOAF.
Thanks,
Bruce
Like Rob Kam said, Ged2Html was the most popular such program for a long time.
GRAMPS can also create static HTML sites and has the advantage of being free software and having a native XML format which you could easily modify to fit your needs.
Several years ago, I created a simple Java program to turn gedcom into xml. I then used xslt to generate html and rdf. The html I generate is pretty rudimentary, so it would probably be better to look elsewhere for that, but the rdf might be useful to you:
http://jay.askren.net/Projects/SemWeb/
There are a number of these. All listed at http://www.cyndislist.com/gedcom/gedcom-to-web-page-conversion/
Ged2html used to be the most popular and most versatile, but is now no longer being developed. It's an executable, with output customisable through its own scripting syntax.
Family Historian http://www.family-historian.co.uk will create exactly what you are looking for, eg one file per person using the built in Web Site creator. As will a couple of the other Major genealogy packages. I have not seen anything for the RDF part of your question.
I have since tried to produce a Genealogy application using Semantic MediaWiki - MediaWiki, the software behind Wikipedia, and Semantic MediaWiki includes various extensions related to the Semantic Web. I thought it is very easy to use with the forms and the ability to upload a GEDCOM but some feedback from people into genealogy said that it appeared too technical and didn't seem to offer anything new.
So, now the issue is whether to stay with MediaWiki and make it more user friendly or create an entirely new application that allows for adding and updating data in a triple store as well as displaying. I'm not sure how to generate a family tree graphical view of the data, like on sites like ancestry.com, where one can click on a box to see details about the person and update that info or one could click on a right or left arrow around a box to navigate the tree. The data comes from SPARQL queries sent to the data set/triple store both when displaying the initial view and when navigating the tree, where an Ajax call is needed to get more data.
Bruce

gartrip calibration file format/generation

I need to generate calibration files* for the Gartrip program. Does anyone known where I can find the file format it uses or, better yet, code for generating them?
I'm not to picky on the language.
If no one point me at anything better, I'm working on getting some examples to reverse engineer. Shouldn't be to bad as I have reason to expect them to be text.
Edit:
What I need is to be able to generate is the files need to tell gartrip: "for file xyz.jpg, point A is at location B, point C is at location D"
*calibration files tell gartrip how to use image files as background so that tracks and whatnot can be overladed on it
I've made search for Gartrip related entries in your blog but were in vain.
One possibility is to reverse engineer the calibration file's format and write an utility based on knowldedge extracted.
Gartrip seems to be a proprietary software so as a registered user are you entitled to do so?
The only thing I am sure of is that it's author doesn't disclose any file format.