How to get cel value in table in PDF scanned by Kofax to excel - ocr

I am new to Kofax capture and I am working on retrieving data from a basic scanned invoice copy(PDF) with table that contains list of items to index file. The steps followed are as follows:
Created document class and added index field of type table and table columns such as Date as field. Date column value screenshot of PDF is as follows:
During the validation the date field values are all displayed in one field as follows:
Date: 12/01/2018 12/02/2018 12/03/2018 12/04/2018
Also when the values exported to index file are in the above format.
Is there a way to retrieve values in every cell as separate entries or comma separated using kofax capture?

Plain vanilla Kofax Capture (KC) can't extract data organized in tables. KC can extract static data, i.e. simple key-value pairs (e.g. invoice number, invoice date, total amount).
Sure, you could try to extract a column like this:
However, this could lead to potential issues down the line. What if the data isn't always in the same place? What if data continues on subsequent pages? What in your zone is smaller than the entire column? What if there are overlapping texts? What if you want another column with additional data, essentially creating rows, but if there are huge gaps in some columns (as in my screenshot)?
If table extraction is a requirement, you might want to use Kofax Transformation Modules (KTM) which is available as an Add-On to Kofax Capture. KTM has more sophisticated methods of extracting tables that is not limited to individual form layouts.

Related

Is there a way to provide schema or auto-detect schema when uploading csv from GCS to BigQuery?

I am trying to upload a csv file from Google Cloud Storage (GCS) to BigQuery (BQ) and auto-detect schema.
What I tried to do is enable auto-detect schema and enter the number of rows to skip in "Header rows to skip" option. I have 6 rows which contain descriptive information about the data which I need to skip. The 7th row is my actual header row.
According to Google's documentation in: https://cloud.google.com/bigquery/docs/schema-detect#auto-detect:
"The field types are based on the rows having the most fields. Therefore, auto-detection should work as expected as long as there is at least one row of data that has values in every column/field."
The problem with my CSV is that the above condition is not met in the sense that I have nulls in the rows.
Also, my CSV contains many rows which do not include any numerical values which I think adds an extra complexity for Google's schema auto detection.
The auto detect is not detecting the correct column names nor the correct field types. All field types are being detected as strings and column names assigned as such: string_field_0 , string_field_1, string_field_3 ,...etc. It is also passing the column names of my CSV as a row of data.
I would like to know what I can do to correctly upload this CSV to BQ with skipping the leading unwanted rows and having the correct schema (field names and field types).
You can try using tools like bigquery-schema-generator to generate the schema from your csv file and then use it in a bq load job for example.
After reading some of the documentation, specifically the CVS header section I think what you're observing is the expected behavior.
An alternative would be to manually specify the schema for the data.
Solved this by including my actual header row in the csv in the number of rows to skip.
I had 6 rows I actually needed to skip. The 7th row was my header (column names). I was entering 6 in the Header rows to skip.
When I entered 7 instead of 6, the schema was auto detected correctly.
Also, I realized that in this sentence in Google's documentation: "The field types are based on the rows having the most fields. Therefore, auto-detection should work as expected as long as there is at least one row of data that has values in every column/field.", nulls are considered values and so that was not actually causing a problem in the upload to BQ.
Hope this helps someone facing the same issue!

How can I create a table that uses an equation to average data from another table?

I have a table that contains data from repeated experiments (for example, site A has one sample, and the lab processed the sample three times obtaining slightly different values). I need to average these results in a separate table, but what I have read on the Microsoft support site is that a query that pulls data into another table with a calculated field is not possible on Access.
Can I query multiple data points from one table into a single calculated field in another table? Thank you.
UPDATE
I ended up doing a lot of manual adjustments of the file format to create a calculated field in the existing table that averages each sites data, so my problem is, for my current purposes, solved. However I would still like to understand. Following up with you both, I think the problem was that I had repeated non-unique IDs between rows when I probably should have made data columns with unique variable names so that I could query each variable name for an average.
So, instead of putting each site separately on the y axis, I formatted it by putting the sample number for each site on the x-axis:
I was able to at least create a calculated field using this second format in order to create an average value for each site.
Would have there been a way to write a query using the first method? Luckily, my data set was not at all very hefty, so I could handle a reformat manually, but if the case were with thousands of data entries, I couldn't have done that.
Also, here is the link to the site I mentioned originally https://support.office.com/en-ie/article/add-a-calculated-field-to-a-table-14a60733-2580-48c2-b402-6de54fafbde3.
Thanks all.

SSIS - Reuse Ole DB source when matching Fact against lookup table twice

I am pretty new to SSIS and BI in general, so first of all sorry if this is a newbie question.
I have my source data for the fact table in a csv, so I want to match the ids against the surrogate keys in lookup tables.
The data structure in the csv is like this
... userId, OriginStationId, DestinyStationId,..
What I am trying to accomplish is to match the data against my lookup table. So what I am doing is
Reading Lookup data using OLE DB Source
Reading my csv file
Sorting both inputs by the same field
Doing a left join by Id, in order to get the SK
This way, if there is no match (aka can't find the surrogate key) I can redirect that to a rejected csv and handle it later.
something like this:
(sorry for the spanish!)
I am doing this for each dimension, so I can handle each one with different error codes.
Since OriginStationId and DestinyStationId are two values from the same dimension (they both match against the same lookup table), I wanted to know if there's a way to avoid reading two times the data from the table (I mean, not to use two ole db sources to read twice the data from the same table).
I tried adding a second output to the sort but I am not allowed to. The same goes to adding another output from OLE DB Source.
I see there's an "cache option", is the best way to go ? (Although it would impy creating anyway another OLE DB source.. right?)
The third option I thought of was joining by the two fields, but since there is only one field in the lookup table (the same field) I am getting an error when I try to map both colums from my csv against the same column in my Lookup table
There are columns missing with the sort order 2 to 2
What is the best way to go for this ?
Or I am thinking something incorrectly ?
If something was not clear let me know and I'll update my question
Any time you wish you could have multiple outputs from a component that only allows one, all you have to do is follow that component with the Multicast component, whose sole purpose is to split a Data Flow stream into multiple outputs.
Gonzalo
I have just used this article on how to derive columns for a data warehouse building:- How to Populate a Fact Table using SSIS (part 1).
Using this I built a simple package that reads a CSV file with two columns that are used to derive separate values from the same CodeTable. The CodeTable has two fields Id and Description.
The Data Flow has two "Lookup" tasks. The first one joins the attribute Lookup1 against the Description to derive its Id. The second joins the attribute Lookup2 against the Description to derive a different Id.
Here is the Data Flow:-
Note the "Data Conversion" was required to convert the string attributes from the CSV file into "Unicode string [DT_WSTR]" so they could be joined to the nvarchar(50) description attribute in the table.
Here is the Data Conversion:-
Here is the first Lookup (the second one joins "Copy of Lookup2" to the Description):-
Here is the Data Viewer output with the to two derived Ids CodeTableFirstId and CodeTableSecondId:-
Hopefully I understand your problem and this is of use to you.
Cheers John

MS Lightswitch Application, add temporary fields in data table

I have a requirement to add a temporary field in a data table. There are two types of fields you can add in data table. One is DATA fields which are actual fields, and Second is COMPUTED fields which have some limitations (not discussing here).
Let me explain my scenario, I have a table with fields Qty, Rate and Amount. Now I need to add one more field like InputRate which will be just a temporary field, do not have any role in the database. Why I need this is, I need to input rate let say in US Dollar, then then I have to convert this to my own currency say SAR. So I want that this temporary field will not save in database, but the actual one (the rate converted to SAR) will be stored.
We can do this easily in .Net windows Applications or Web Applications. But how can we do this in MS Lightswitch, because it will not allow to add fields on screen untill it is part of the data table. Even if you add a custom field (as I experiment) only on the screen, then it is going to repeat the same field's value for all rows (since my this table is DETAILS table). Means if my table has 5 rows, and in 6th row if I enter anything in custom field(with scope only on screen) then it is showing the same value of all other rows also, e.g. if I entered, then all other rows start also displaying the same value 10.
Any idea how to do this?

DB Structure for saving form data for dynamically created forms

I am considering a use case wherein every user of the application can create N number of forms with n different form fields of different data types. Each user can then receive form data in their respective forms. This data can then be updated, edited, searched, sorted etc.
I wonder, how do I design the DB architecture for this in MySQL.
My initial approach was to save form structure in serialized form in one table with form_id as primary key. Another table, data shall hold the records with columns: form_id, record_id, order, value. Here, order will be number denoting the position of the field in the form structure. value column shall have the value for that field. This approach caused me to have 10 rows (for a form with 10 fields) for 1 set of records. Also, I don't think such a query can be written to search records for a particular form.
I did think of using mongodb for this use case, wherein form structure shall be stored in an array and within that array, all records for that particulsr form shall be stored. I have never ever used mongodb, but I guess mongodb has some sort of restriction on size, if I store documents within a document. So, what is the best way to do this using MySQL (or any other DB perfect for such a use case)
Each mongoDB document is restricted to 16mb in size. If you needed each document to be bigger then you would need to adapt your design appropriately. MongoDb does also include GridFS which allows you to break larger files into chunks.
In your example, you could store the form fields as an array in mongoDB but you could also store them as an embedded document. Whichever way you choose to do it, your schema design in mongoDB needs to match your data access patterns.
It's easy to add new key:value pairs to mongoDB documents through your application. Pseudocode would be:
if user adds form field
db.collection.update({'id':1},{$addToSet {'newField':'value'}})
I know your question was about MySQL structure but I thought it worth showing how it could work in mongoDB (obviously would need to be adapted to your chosen language).