What are some ways to map & normalize related data?

What are some ways to map & normalize related data? - language-agnostic

Let's say you need to funnel random, related data given to you into more succinct categories.
Example - You're given the following data. NOTE - There could be any number of other related, columnar data:
Customer Product Category
========== ========= =================================
Customer A Product A Cat 1
CustomerA Product B Category 1
Cust-A Product C Totally Lame & Unrelated Grouping
Task - Consolidate and normalize the above into clean, pre-defined groupings:
CustomerA
Category1
ProductA
ProductB
ProductC
Please don't worry about how the finished data will be persisted. But rather focus on how you'll persist and manage the rules for grouping.
Only one assumption: You can't use a database to persist your grouping rules. So when we say "normalize", we're not speaking in terms of relational database normalization rules. But rather we're wanting to remove inconsistencies from data inputs (as seen above) to bring the random data into a consistent state.
So what are the available options? Remain technology agnostic:
XML?
Config files?
Settings file (compiled or not)?
Ini File?
Code?
etc.
List pros & cons for each answer. And though this is indeed an excersize, it's a real-world problem. So assume your client/employer has tasked you with this.

This seems like a data cleansing exercise, perfection is pretty impossible. Issues:
1). Can you specify up front the categories, or must you deduce from the data?
2). What rules can we use to accept equivalence?
"Cat 1" is the same as "Category 1" ? and "Category one" ?
is
"Cat 1." als "Cat 1"? what about "Cat 1?" ? and "Cat 12" ?
Just getting a good set of rules in a challenge.
2). How would you capture those rules? Code or config? If config how would you express it? Do you end up just writing a new specilaised programming language?

This seems like a data cleansing exercise, perfection is pretty impossible. Issues:
1). Can you specify up front the categories, or must you deduce from the data?
2). What rules can we use to accept equivalence?
"Cat 1" is the same as "Category 1" ? and "Category one" ?
is
"Cat 1." als "Cat 1"? what about "Cat 1?" ? and "Cat 12" ?
Just getting a good set of rules in a challenge.
3). How would you capture those rules? Code or config? If config how would you express it? Do you end up just writing a new specilaised programming language?

A dictionary mapping for each value. 'Cat1' => 'Category1', 'Category 2' => 'Category2'. This is easy to store, and has no unintended consequences. The disadvantage is that creating all those mappings by hand is actual work.
A series of regular expressions. That way, you're able to capture nearly all rules using relatively little work. The disadvantage is that regular expressions 'misfire' relatively easily, and the order of evaluation matters (i.e. when values match more than one 'rule'.
As for how to persist them? I can't think of a more uninteresting question. You just use whatever's easiest in your preferred programming language.

Related

Generating truth tables for basic logic circuits

Let's say I have a text file that looks like this:
<number> <name> <type> <inputs...>
1 XOR1 XOR A B
2 SUM XOR 1 C
What would be the best approach to generate the truth table for this circuit?

That depends on what you have available, and how big your file is.
Perl is optimized for reading files and generating simple text output. It doesn't have a library of boolean operators, but they're easy enough to write. I'd use that if I just wanted text-in, text-out.
If I wanted to display the data online AND generate a results file, I'd use PHP to read the data and write the table to a CSV file that could either be opened in Excel, or posted online in an HTML table.
If your data is in a REALLY BIG data file, I'd use SQL.
If your data is in a really huge file that you want to be accessible to authorized users online, and you want THEM to be able to create truth tables, I'd use Oracle's APEX to create an easy interface for them to build their own truth tables and play around with the data without altering it.
If you're in an electrical engineering environment, use the tools designed for your problem -- Verilog or similar.
Whatcha got? Whatcha wanna do with it?
-- Ada

I prefer using C#. I already have the code to 'parse' the input
text file. I just don't know where to start in terms of
actually 'simulating' it. The output can simply be a text file
with inputs and output values – Don 12 mins ago
How many inputs and how many outputs in the circuit you want to simulate?
The size of the simulation determines how it can most easily be run. If the circuit is small(ish), you can enter the inputs and circuit values into vector arrays, then cross them to get the output matrix.
Matlab is ideal for this, as it was written for processing arrays.
Again: Whatcha got, and whatcha wanna do with it?
-- Ada

What data format is this?

I was checking one share trading site's AJAX response and below is what it showed up in Firebug Response tab of XHR section. Can anyone explain me what format is this and how is it parsed ?
<ST=tat>
<SI=0>
<TB=txtSearch>
<560v=Tata Motors Ltdv=TATMOT>
<566v=Tata Steel Ltdv=TATSTE>
<3199v=Ashram Online.com Ltdv=ASHONL>
<4866v=Kreon Finnancial Services Ltdv=KREFIN>
<552v=Tata Chemicals Ltdv=TATCHE>
<554v=Tata Power Company Ltdv=TATPOW>
<2986v=Tata Metaliks Ltdv=TATMET>
<300v=Tata Sponge Iron Ltdv=TATSPO>
<121v=Tata Coffee Ltdv=TATCOF>
<2295v=Tata Communications Ltdv=TATCOM>
<0v=Time In Milli-Secondsv=0>

I think what we are dealing with here is some proprietary format, likely an Eldricht SGML Horror of some sort.
Banking in general has all sorts of Eldricht horrors running about.
On a related note, this is very much not XML.
Edit:
A quick analysis* indicates that this is a format consisting of a series of statements bracketed by <>; with the parts of the statements separated by = or v=. = seems to indicate a parameter to a control statement, indicated by a two-letter code. (<ST=tat>), while v= seems to indicate an assignment or coupling of some kind (short for "value"?), or perhaps just a field separator.
<ST appears to be short for "search term"; <TB appears to be short for "(source) table". The meaning of <SI eludes me. It is possible that <TB terminates the metadata section, but it's equally possible that the metadata section has a fixed number of terms.
As nothing refers to the number of fields in each statement in the data section, and they are all of the same length (3 fields), it is likely that the number of fields is fixed, but it might derive from the value of <TB, or even <SI, in some way.
What is abundantly clear, however, is that this data is not intended for consumption by other applications than the one that supplies it.
*Caveat: Without a much larger sample it's impossible to tell if this analysis is valid.

It is not a commonly used "web format".
It is probably a proprietary format used by that site and will be parsed by their custom JavaScript.

When could a CSV records not have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.

Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.

As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >

Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.

Tesseract-Job: how to parse an image in order to get the information out of it

good moring.
first of all. This is the most impressive community i ever saw!
Well several days i mused about the three-folded job of
a. getting
b. parsing
c. storing a number of pages.
Two days ago i thought that getting the pages would be the major-task. No this isnt the case - i guess that the parser-job would be a heroic task. Each of the pages that are intended to be parsed is a png-image.
So the question is - after getting all them. How to parse them!? This seems to be the issue. Guess that there are some perl-modules out there - that can help in doing this...
Well - i think that this job only can be done with some OCR embedded! Question: is there a perl-module that can be use here to support this task:
BTW: see the result-pages.
BTW;: and as i thought i can find all 790 resultpages within a certain range between
Id= 0 and Id= 100000 i thought, that i can go the way with a loop:
http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=949&InterfaceLanguage=1&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=20011&InterfaceLanguage=1&Type=Html
http://www.foundationfinder.ch/ShowDetails.php?Id=10579&InterfaceLanguage=1&Type=Html
i thought i can go the Perl-Way but i am not very very sure:
I was trying to use LWP::UserAgent on the same URLs [see below]
with different query arguments, and i am wondering if LWP::UserAgent provides a
way for us to loop through the query arguments? I am not sure that LWP::UserAgent has a method for us to do that. Well - i sometimes heard that it is easier to use Mechanize. But is it really easier!?
But - to be frank; The first task " GETTING all the pages is not very difficult - if we compare this task with the parsing... How can this be done!?
Any ideas - suggestions -
look forward to hear from you...
zero

You do not need a Perl module, you only need the system function.
system qw[ tesseract.exe foo.png foo.txt ];
my $text = read_file('foo.txt');
You may need to preprocess the images to help Tesseract, say using ImageMagick like:
system qw[ convert.exe -resize 200% image.jpg foo.png ];

Paper form to database

I am doing a Rails 3 app that replaces a paper form for a company. The paper form spans two pages and contains a LOT of fields, checkboxes, drop downs, etc.
I am wondering how to model that in the DB - one approach is to just create a field in the DB for every field on the form (normalized of course). That will make it somewhat difficult to ad or remove fileds since a migration will be needed. An other approach is to do some kind of key/value store (no - MongoDB/CouchDB is not an option - MySQL is required). Doing key/value will be very flexible but will be a pain to query. And it will directly work against ActiveRecord?
Anyone have a great solution for this?
Regards,
Jacob

I would recommend that you model the most common attributes as separate database fields. Once you have setup as many fields as possible then fall back to using a key-value setup for your pseudo-random attributes. I'd recommend a simple approach of storing a Hash through the ActiveRecord method serialize. For example:
class TPS < ActiveRecord::Base
serialize :custom, Hash
end
#tps = TPS.create(:name => "Kevin", :ssn => "123-456-789", :custom => { :abc => 'ABC', :def => )'DEF' })
#tps.name # Kevin
#tps.ssn # 123-456-789
#tps.custom[:abc] # ABC
#tps.custom[:def] # DEF

If your form is fairly static, go ahead and make a model for it, that's a reasonable approach even if it seems rather rudimentary. It's not your fault the form is so complicated, you're just coming up with a solution that takes that into account. Migrations to make adjustments to this are really simple to implement and easy to understand.
Splitting it up into a key/value version would be better but would take a lot more engineering. If you expect that this form will be subject to frequent and radical revisions it may make more sense to build for the future in this regard. You can see an example of the sort of form-builder you might want to construct at something like WuFoo but of course building form builders is not to be taken lightly.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What are some ways to map & normalize related data? - language-agnostic

Related

Generating truth tables for basic logic circuits

What data format is this?

When could a CSV records not have the same number of fields?

Tesseract-Job: how to parse an image in order to get the information out of it

Paper form to database

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What are some ways to map & normalize related data? - language-agnostic

Related

Generating truth tables for basic logic circuits

What data format is this?

When could a CSV records *not* have the same number of fields?

Tesseract-Job: how to parse an image in order to get the information out of it

Paper form to database

Categories

Resources

When could a CSV records not have the same number of fields?