What is the technical programming term for concatenating two variables in order to fit a restraint only to seperate them later down the line? - terminology

I discovered it when working with PHP and an API. I haven't seen it used for years as languages have become more programmer-friendly.
Context:
I have a program that takes in X number pieces of data of arbitrary size. This data is passed through to an API that returns some information based on such given data sent.
Now, the API only has fields for x-1 pieces of data so to get around this I would concatenate two pieces of X, say x1 & x2, with a known character in between. I would generally use ":" in the case I'm describing so the API would be sent X-1 pieces of data, one of which being x1:x2. This x1:x2 piece would be part of what was returned so on the receiving script of my server I would write a simple function (possibly a method) to split this data on the ":" character, assigning one side to a variable and the other to another.
How would a professional computer scientist describe this technique?

This is usually referred to as marshalling.

Related

Bulk export of binary waveform data from oscilloscope to data points (csv preferred)

I'm working with some binary waveform files from various early to mid-90's HP scopes. I am trying to do a bulk conversion (we have over 5000) of the files to CSV's and then upload them into a database. I've tried hexdump, xxd, od, strings, etc. and none of them seem to work. I did hunt down a programmers manual but it's not making a whole lot of sense.
The files have a preamble line as ascii text but then the data points are in binary and for some reason nothing I try can decode them. The preamble gives the data necessary to use the binary values and calculate the correct values. It also states that the data is in WORD format.
:WAV:PRE 2,1,32768,1,+4.000000E-08,-4.9722700001108E-06,0,+2.460630E-04,+2.500000E+00,16384;:WAV:DATA #800065536^W�^W�^W�^
I'm pretty confused.
Have a look at
http://www.naic.edu/~phil/hardware/oscilloscopes/9000A_Programmer_Reference.pdf
specifically page 1-21. After ":WAV:DATA", I think the rest of the chunk above will have 65536 8-bit data bytes (the start of which is represented above by �) . The ^W is probably a delimiter, so you would have to parse that out. Just a thought.
UPDATE: I'm new to oscilloscope data collection and am trying to figure the whole thing out from scratch. So, on further digging, it looks like the data you have provided shows this:
PREamble:
- WORD format (16-bit signed integers split into 2 8-bit bytes)
- If there is a WAV:BYT section, that would specify byte order for each pair
- RAW data
- 32768 data points
- COUNT = 1 (I'm not clear on the meaning of this)
- Next 3 should be X increment, origin, reference
- Next 3 should be Y increment, origin, reference, although the manual that I pointed you at above has many more fields than just these, so you might want to consult your specific scope manual.
DATA:
- On closer examination, I don't think the ^W is a delimiter, I think it is the first byte of the pair (0010111). The � character is apparently a standard "I don't know how to represent this character" web representation. You would need to look at that character as 8 bits also.
- 65536 byte pairs of data
I'm not finding a utility that will do this for you. I think you're going to have to write or acquire some code (Perl, C, Java, Python, VB, etc.) to get this done.

What is an Abstract Syntax Tree/Is it needed?

I've been interested in compiler/interpreter design/implementation for as long as I've been programming (only 5 years now) and it's always seemed like the "magic" behind the scenes that nobody really talks about (I know of at least 2 forums for operating system development, but I don't know of any community for compiler/interpreter/language development). Anyways, recently I've decided to start working on my own, in hopes to expand my knowledge of programming as a whole (and hey, it's pretty fun :). So, based off the limited amount of reading material I have, and Wikipedia, I've developed this concept of the components for a compiler/interpreter:
Source code -> Lexical Analysis -> Abstract Syntax Tree -> Syntactic Analysis -> Semantic Analysis -> Code Generation -> Executable Code.
(I know there's more to code generation and executable code, but I haven't gotten that far yet :)
And with that knowledge, I've created a very basic lexer (in Java) to take input from a source file, and output the tokens into another file. A sample input/output would look like this:
Input:
int a := 2
if(a = 3) then
print "Yay!"
endif
Output (from lexer):
INTEGER
A
ASSIGN
2
IF
L_PAR
A
COMP
3
R_PAR
THEN
PRINT
YAY!
ENDIF
Personally, I think it would be really easy to go from there to syntactic/semantic analysis, and possibly even code generation, which leads me to question: Why use an AST, when it seems that my lexer is doing just as good a job? However, 100% of my sources I use to research this topic all seem adamant that this is a necessary part of any compiler/interpreter. Am I missing the point of what an AST really is (a tree that shows the logical flow of a program)?
TL;DR: Currently in route to develop a compiler, finished the lexer, seems to me like the output would make for easy syntactic analysis/semantic analysis, rather than doing an AST. So why use one? Am I missing the point of one?
Thanks!
First off, one thing about your list of components does not make sense. Building an AST is (pretty much) the syntactic analysis, so it either shouldn't be in there, or at least come before the AST.
What you got there is a lexer. All it gives you are individual tokens. In any case, you will need an actual parser, because regular languages aren't any fun to program in. You can't even (properly) nest expressions. Heck, you can't even handle operator precedence. A token stream doesn't give you:
An idea where statements and expressions start and end.
An idea how statements are grouped into blocks.
An idea Which part of the expression has which precedence, associativity, etc.
A clear, uncluttered view at the actual structure of the program.
A structure which can be passed through a myriad of transformations, without every single pass knowing and having code to accomodate that the condition in an if is enclosed by parentheses.
... more generally, any kind of comprehension above the level of a single token.
Suppose you have two passes in your compiler which optimize certain kinds of operators applies to certain arguments (say, constant folding and algebraic simplifications like x - x -> 0). If you hand them tokens for the expression x - x * 1, these passes are cluttered with figuring out that the x * 1 part comes first. And they have to know that, lest the transformation is incorrect (consider 1 + 2 * 3).
These things are tricky enough to get right as it is, so you don't want to be pestered by parsing problems as well. That's why you solve the parsing problem first, in a separate parsing step. Then you can, say, replace a function call with its definition, without worrying about adding parenthesis so the meaning remains the same. You save time, you separate concerns, you avoid repetition, you enable simpler code in many other places, etc.
A parser figures all that out, and builds an AST which consequently holds all that information. Without any further data on the nodes, the shape of the AST alone gives you no. 1, 2, 3, and much more, for free. None of the bazillion passes that follow have to worry about it anymore.
That's not to say you always have to have an AST. For sufficiently simple languages, you can do a single-pass compiler. Instead of generating an AST or some other intermediate representation during parsing, you emit code as you go. However, this becomes harder for less simple languages and you can't reasonably do a lot of stuff (such as 70% of all optimizations and diagnostics -- and yes I just made that number up). Generally, I wouldn't advise you to do this. There are good reasons single-pass compilers are mostly dead. Even languages which permit them (e.g. C) are nowadays implemented with multiple passes and ASTs. It's a simple way to get started, but will severely limit you (and the language, if you design it) later.
You've got the AST at the wrong point in your flow diagram. Typically, the output of the lexer is a series of tokens (as you have in your output), and these are fed to the parser/syntactic analyzer, which generates the AST. So the output of your lexer is different from an AST because they are used at different points in the compilation process and fulfill different purposes.
The next logical question is: What, then, is an AST? Well, the purpose of parsing/syntactic analysis is to turn the series of tokens generated by the lexer into an AST (or parse tree). The AST is an intermediate representation that captures the relationship between syntactical elements in a way that is easier to work with programmatically. One way of thinking about this is that a text program is a one dimensional construct, and can only represent ideas as a sequence of elements, while the AST is freed from this constraint, and can represent the underlying relationships between those elements in 2 dimensions (as typically drawn), or any higher dimension space if you so choose to think about it that way.
For instance, a binary operator has two operands, let's call them A and B. In code, this may be spelled 'A * B' (assuming an infix operator - another advantage of an AST is to hide such distinctions that may be important syntactically, but not semantically), but for the compiler to "understand" this expression, it must read 5 characters sequentially, and this logic can quickly become cumbersome, given the many possibilities in even a small language. In an AST representation, however, we have a "binary operator" node whose value is '*', and that node has two children, values 'A' and 'B'.
As your compiler project progresses, I think you will begin to see the advantages of this representation.

On-the-fly parser/pre-generation space/time tradeoff considerations

Do the space-related benefits of using an on-the-fly parser outweigh the time-related benefits of a pre-generated lookup table?
Long version:
I am authoring a chemistry reference tool, and am including a feature that will automatically name formulae conforming to a specific pattern; e.g. C[n]H[2n+2] => [n]ane; where [n] is an integer for the LHS; and an index into an array of names on the RHS. (meth, eth, …)
As far as I can see, this can be implemented in one of two ways:
I pre-generate a key/value dual lookup dictionary of formula <=> name pairs; either when the application starts (slower startup), or a static list which is published with the application (slower download).
Formulae are evaluated on the fly by a custom-built parser.
In approach 1. name => formula lookup becomes simpler by an order of magnitude; but the generator will, unless I want to ship dozens of megabytes of data with the application, have to have a preset, and fairly low, value for n.
Compounding this is the fact that formulae can have several terms; such as C[n]H[2n+1]OC[n']H[2n'+1]; and for each of these, the number of possible matches increases geometrically with n. Additionally, using this approach would eat RAM like nobody's business.
Approach 2. lets me support fairly large values of n using a fairly small lookup table, but makes name => formula lookup somewhat more complex. Compared to the pre-generation to file for shipping with the application, it also lets me correct errors in the generation logic without having to ship new data files.
This also requires that each formula be matched against a cursory test for several rules, determining if it could fit; which, if there are a lot of rules, takes time that might lead to noticeable slowdowns in the interface.
The question then, is:
Are there any considerations in the tradeoff I have failed to account for, or approaches that I haven't considered?
Do the benefits of using an on-the-fly parser justify the increased implementation complexity?
You should go with the second approach.
One possible solution is a greedy algorithm. Define your set of transformations as a regular expression (used to test the pattern) and a function which is given the regexp match object and returns the transformed string.
Regular expressions aren't quite powerful enough to handle what you want directly. Instead you'll have to do something like:
m = re.match(r"C\[(\d+)\]H\[(\d+)]\]", formula)
if m:
C_count, H_count = int(m.group(1)), int(m.group(2))
match_size = len(m.group(0))
if C_count*2+2 == H_count:
replacement = alkane_lookup[C_count]
elif C_count*2 == H_count:
replacement = alkene_lookup[C_count]
...
else:
replacement = m.group(0) # no replacement available
(plus a lot more for the other possibilities)
then embed that in a loop which looks like:
formula = "...."
new_formula = ""
while formula:
match_size, replacement = find_replacement(formula)
new_formula += replacement
formula = formula[match_size:]
(You'll need to handle the case where nothing matches. One possible way is to include a list of all possible elements at the end of find_replacement(), which only returns the next element and counts.)
This is a greedy algorithm, which doesn't guarantee the smallest solution. That's more complicated, but since chemists themselves have different ideas of the right form, I wouldn't worry so much about it.

Generating unique codes that are different in two digits

I want to generate unique code numbers (composed of 7 digits exactly). The code number is generated randomly and saved in MySQL table.
I have another requirement. All generated codes should differ in at least two digits. This is useful to prevent errors while typing the user code. Hopefully, it will prevent referring to another user code while doing some operations as it is more unlikely to miss two digits and match another existing user code.
The generate algorithm works simply like:
Retrieve all previous codes if any from MySQL table.
Generate one code at a time.
Subtract the generated code with all previous codes.
Check the number of non-zero digits in the subtraction result.
If it is > 1, accept the generated code and add it to previous codes.
Otherwise, jump to 2.
Repeat steps from 2 to 6 for the number of requested codes.
Save the generated codes in the DB table.
The algorithm works fine, but the problem is related to performance. It takes a very long to finish generating the codes when requesting to generate a large number of codes like: 10,000.
The question: Is there any way to improve the performance of this algorithm?
I am using perl + MySQL on Ubuntu server if that matters.
Have you considered a variant of the Luhn algorithm? Luhn is used to generate a check digit for strings of numbers in lots of applications, including credit card account numbers. It's part of the ISO-7812-1 standard for generating identifiers. It will catch any number that is entered with one incorrect digit, which implies any two valid numbers differ in a least two digits.
Check out Algorithm::LUHN in CPAN for a perl implementation.
Don't retrieve the existing codes, just generate a potential new code and see if there are any conflicting ones in the database:
SELECT code FROM table WHERE abs(code-?) regexp '^[1-9]?0*$';
(where the placeholder is the newly generated code).
Ah, I missed the generating lots of codes at once part. Do it like this (completely untested):
my #codes = existing_codes();
my $frontwards_index = {};
my $backwards_index = {};
for my $code (#codes) {
index_code($code, $frontwards_index);
index_code(reverse($code), $backwards_index);
}
my #new_codes = map generate_code($frontwards_index, $backwards_index), 1..10000;
sub index_code {
my ($code, $index) = #_;
push #{ $index{ substr($code, 0, length($code)/2) } }, $code;
return;
}
sub check_index {
my ($code, $index) = #_;
my $found = grep { ($_ ^ $code) =~ y/\0//c <= 1 } #{ $index{ substr($code, 0, length($code)/2 } };
return $found;
}
sub generate_code {
my ($frontwards_index, $backwards_index) = #_;
my $new_code;
do {
$new_code = sprintf("%07d", rand(10000000));
} while check_index($new_code, $frontwards_index)
|| check_index(reverse($new_code), $backwards_index);
index_code($new_code, $frontwards_index);
index_code(reverse($new_code), $backwards_index);
return $new_code;
}
Put the numbers 0 through 9,999,999 in an augmented binary search tree. The augmentation is to keep track of the number of sub-nodes to the left and to the right. So for example when your algorithm begins, the top node should have value 5,000,000, and it should know that it has 5,000,000 nodes to the left, and 4,999,999 nodes to the right. Now create a hashtable. For each value you've used already, remove its node from the augmented binary search tree and add the value to the hashtable. Make sure to maintain the augmentation.
To get a single value, follow these steps.
Use the top node to determine how many nodes are left in the tree. Let's say you have n nodes left. Pick a random number between 0 and n. Using the augmentation, you can find the nth node in your tree in log(n) time.
Once you've found that node, compute all the values that would make the value at that node invalid. Let's say your node has value 1,111,111. If you already have 2,111,111 or 3,111,111 or... then you can't use 1,111,111. Since there are 8 other options per digit and 7 digits, you only need to check 56 possible values. Check to see if any of those values are in your hashtable. If you haven't used any of those values yet, you can use your random node. If you have used any of them, then you can't.
Remove your node from the augmented tree. Make sure that you maintain the augmented information.
If you can't use that value, return to step 1.
If you can use that value, you have a new random code. Add it to the hashtable.
Now, checking to see if a value is available takes O(1) time instead of O(n) time. Also, finding another available random value to check takes O(log n) time instead of... ah... I'm not sure how to analyze your algorithm.
Long story short, if you start from scratch and use this algorithm, you will generate a complete list of valid codes in O(n log n). Since n is 10,000,000, it will take a few seconds or something.
Did I do the math right there everybody? Let me know if that doesn't check out or if I need to clarify anything.
Use a hash.
After generating a successful code (not conflicting with any existing code), but that code in the hash table, and also put the 63 other codes that differ by exactly one digit into the hash.
To see if a randomly generated code will conflict with an existing code, just check if that code exists in the hash.
Howabout:
Generate a 6 digit code by autoincrementing the previous one.
Generate a 1 digit code by incrementing the previous one mod 10.
Concatenate the two.
Presto, guaranteed to differ in two digits. :D
(Yes, being slightly facetious. I'm assuming that 'random' or at least quasi-random is necessary. In which case, generate a 6 digit random key, repeat until its not a duplicate (i.e. make the column unique, repeat until the insert doesn't fail the constraint), then generate a check digit, as someone already said.)

Emulating IBM floating point multiplication/addition in VBA

I am attempting to emulate a (no longer existing) mainframe report generator in an Access 2003 or Access 2010 environment. The data it generates must match exactly with paper reports from the early 70s. Unfortunately, the earliest years data were run on hardware that used IBM floating point representation instead of IEEE. With the help of Google, I've found a library of VBA functions that will convert a float from decimal to the IEEE 754 32bit binary format. I had to modify the library to accept either 32bit or 64bit floats, so I have a modest working knowledge of floating point formats, however, I'm having trouble making the conversion from IEEE to IBM binary format, as well as trouble multiplying and adding either the IBM or the IEEE numbers.
I haven't turned up any other libraries for performing this conversion and arithmetic operations in VBA - is there an easier way to go about this, or an existing library that I'm not finding? Failing that, a clear and straightforward explanation of the relevant algorithms?
Thanks in advance.
To be honest you'd probably do better to start by looking at the Hercules emulator.
http://www.hercules-390.org/ Other than that in theory with VBA you can use the Decimal type to get good results (note you have to CDec to create these) it uses 12 bits with a variable power of ten scalar.
A quick google shows this post from the hercules group, which confirms Alberts point about needing to know the hardware:
---Snip--
In theory, but rather less so in practice. S/360 and S/370 had a
choice of Scientific or Commercial instruction sets. The former added
the FP instructions and registers to the base; the latter the decimal
instructions, including Edit and Edit & Mark. But larger 360 (iirc /65
and up) and 370 (/155 and up) models had the union of the two, called
the Universal instruction set, and at some point the S/370 dropped the
option.
---snip---
I have to say that having looked at the hercules source code you'll probably need to figure out exactly which floating point operation codes (in terms of precision single,long, extended) are being performed.
The problem is here's your confusing the issue of decimal type in access, and that of single and double type floating point values available in access.
If you use the currency data type in access, this is a scaled integer, and will not produce rounding (that is what most of us use for financial calculations and reports). You can also use decimal values in access, and again they don't round at all as they are packed decimals.
However, both the single and double values available inside of access are in fact the same format and conform to the IEEE floating point standard.
For an access single variable, this is a 32bit number, and the range is:
-3.402823E38
to
-1.401298E-45 for negative values
and
1.401298E-45
to
3.402823E38 for positive values
That looks to be the same to me as the IEEE 754 standard.
So, if you add up values in access as a single, you should get the rouding same results.
So, Intel based, and Access single and doubles I believe are the same as this IEEE standard.
The only real issue it and here is what is the format of the original data you're pulling into access, and what kinds of text or string or conversion process is occurring when that data is pulled in and stored?
Access can convert numbers. Try typing these values at the access command line prompt (debug window)
? hex(255)
Above will show FF
? csng(&hFF)
Above will show 255
Edit:
Ah, ok, I see now I have this reversed, my wrong here. The problem here is assuming you convert a number to the older IBM format (Excess 64?), you will THEN have to get your hands on their code that they used for adding those numbers. In fact, even back then, different IBM models depending on what you purchased actually produced different results (more money = more precision).
So, not only do you need conversion routines to convert to the internal representation, you THEN need the routines that add/subtract/multiply those numbers. So, just having conversion routines is not going to get you very far, since you also have to duplicate their exact routines that do math. Those types of routines are likely not all created equal in terms of how they round numbers etc.