I am looking at a the German HBCI/FinTS protocol. One peculiarity of this protocol is that it can contain binary blobs, which are prefixed by #NUM_OF_BINARY_CHARS#. Otherwise the protocol is quite simple, a grammar could be described as follows (a bit simplified, terminals are quoted by "):
message = segment+
segment = elements "'"
elements = element "+" elements | element
element = items
items = item ":" items | item
item = [a-zA-Z0-9,._-]* | escaped item
escaped = ?[-#?_-a-zA-Z0-9,.]
The # is missing here!
A sample message could look something like this
FirstSegment+Elem1+Item1:Item2+#4#:'+#+The_last_four_chars_are_binary+Elem4'SecondSegment+Elem5'
Can this language (with the escaping of binary strings) be described by a context free grammar?
No, this language is not context-free. The format you're describing is essentially equivalent to this language
{ #n#w | n is a natural number and |w| = n }
You can show that this isn't context-free by using the context-free pumping lemma. Let the pumping length be p and consider the string #1p#x1111...1 (p times). This is a string encoding of a binary piece of data that show have length 111...1 (p times). Now split the string into u, v, x, y, z where |vy| > 1 and |vxy| ≤ p. If v or y is the # sign, then uv0xy0z isn't in the language because it doesn't have enough # signs. If v and y are purely contained in 1p, then pumping up the string will end up producing a string not in the language because the binary data string won't have the right size. Similarly, if v and y are purely contained in x111...1 (p times), pumping up or down will make the payload the wrong size. Finally, if v is in the length field and x is in the payload, pumping up v and x simultaneously will make the payload have the wrong length because v is written in decimal (so each extra character increases the payload size by a factor of ten) while x's length isn't.
Hope this helps!
Related
For working with countable sets I have to define a coding function of all finite subsets of N (natural numbers). How can I do this?
I started with finding a function for all natural numbers: f(n)=1+2+...+(n-1)+n. But how can I express a coding function for all possible subsets of f? And how can I say that f contains all finite natural numbers? I can not say n=infinity-1 because infinity-1 is still infinity. Is there a formal way constitute all finite natural numbers?
If I understand you correctly, you wish to define a function that would count through all finite subsets of N. One way to achieve this is to use the 1s in the binary representation of a number n to encode the elements of f(n), that is
f(n) = {k \in N | the k-th binary digit of n is 1}.
In programming terms, say for instance in Python (here I'm using lists to represent subsets of N) this would look like
def f(n):
result = []
k = 1
while n != 0:
if n % 2 == 1:
result.append(k)
k += 1
n //= 2
return result
I've used the Facebook feature to download all my data. The resulting zip file contains meta information in JSON files. The problem is that unicode characters in strings in these JSON files are escaped in a weird way.
Here's an example of such a string:
"nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n"
When I try parse the string for example with javascript's JSON.parse() and print it out I get:
"nejnižšà bod: 0 mnm Benátky\n"
While it should be
"nejnižší bod: 0 mnm Benátky\n"
I can see that \u00c5\u00be should somehow correspond to ž but I can't figure out the general pattern.
I've been able to figure out these characters so far:
'\u00c2\u00b0' : '°',
'\u00c3\u0081' : 'Á',
'\u00c3\u00a1' : 'á',
'\u00c3\u0089' : 'É',
'\u00c3\u00a9' : 'é',
'\u00c3\u00ad' : 'í',
'\u00c3\u00ba' : 'ú',
'\u00c3\u00bd' : 'ý',
'\u00c4\u008c' : 'Č',
'\u00c4\u008d' : 'č',
'\u00c4\u008f' : 'ď',
'\u00c4\u009b' : 'ě',
'\u00c5\u0098' : 'Ř',
'\u00c5\u0099' : 'ř',
'\u00c5\u00a0' : 'Š',
'\u00c5\u00a1' : 'š',
'\u00c5\u00af' : 'ů',
'\u00c5\u00be' : 'ž',
So what is this weird encoding? Is there any known tool that can correctly decode it?
The encoding is valid UTF-8. The problem is, JavaScript doesn't use UTF-8, it uses UTF-16. So you have to convert from the valid UTF-8, to JavaScript UTF-16:
function decode(s) {
let d = new TextDecoder;
let a = s.split('').map(r => r.charCodeAt());
return d.decode(new Uint8Array(a));
}
let s = "nejni\u00c5\u00be\u00c5\u00a1\u00c3\u00ad bod: 0 mnm Ben\u00c3\u00a1tky\n";
s = decode(s);
console.log(s);
https://developer.mozilla.org/docs/Web/API/TextDecoder
You can use a regular expression to find groups of almost unicode characters, decode them into Latin-1 and then encode back into UTF-8
The following code should work in python3.x:
import re
re.sub(r'[\xc2-\xf4][\x80-\xbf]+',lambda m: m.group(0).encode('latin1').decode('utf8'), s)
The JSON file itself is UTF-8, but the strings are UTF-16 characters converted to byte sequences then converted to UTF-8 using escape sequences.
This command fixes a file like this in Emacs:
(defun k/format-facebook-backup ()
"Normalize a Facebook backup JSON file."
(interactive)
(save-excursion
(goto-char (point-min))
(let ((inhibit-read-only t)
(size (point-max))
bounds str)
(while (search-forward "\"\\u" nil t)
(message "%.f%%" (* 100 (/ (point) size 1.0)))
(setq bounds (bounds-of-thing-at-point 'string))
(when bounds
(setq str (--> (json-parse-string (buffer-substring (car bounds)
(cdr bounds)))
(string-to-list it)
(apply #'unibyte-string it)
(decode-coding-string it 'utf-8)))
(setf (buffer-substring (car bounds) (cdr bounds))
(json-serialize str))))))
(save-buffer))
Thanks to Jen's excellent question and Shawn's comment.
Basically facebook seems to take each individual byte of the unicode string representation, then exporting to JSON as if these bytes are individual Unicode code points.
What we need to do is take last two characters of each sextet (e.g. c3 from \u00c3), concatenate them together and read as a Unicode string.
This is how I do it in Ruby (see gist):
require 'json'
require 'uri'
bytes_re = /((?:\\\\)+|[^\\])(?:\\u[0-9a-f]{4})+/
txt = File.read('export.json').gsub(bytes_re) do |bad_unicode|
$1 + eval(%Q{"#{bad_unicode[$1.size..-1].gsub('\u00', '\x')}"}).to_json[1...-1]
end
good_data = JSON.load(txt)
With bytes_re we catch all sequences of bad Unicode characters.
Then for each sequence replace '\u00' with '\x' (e.g. \xc3), put quotes around it " and use Ruby's built-in string parsing so that the \xc3\xbe... strings are converted to actual bytes, that will later remain as Unicode characters in the JSON or properly quoted by the #to_json method.
The [1...-1] is to remove quotes inserted by #to_json
I wanted to explain the code because question is not ruby specific and reader may use another language.
I guess somebody can do it with a sufficiently ugly sed command..
Just adding the general rule how to get from something like '\u00c5\u0098' to 'Ř'. Putting together the last two letters from the \u parts gets you c5 and 98 which are the two bytes of the utf-8 representation. UTF-8 encodes the code point in two bytes like this: 110xxxxx 10xxxxxx, where x are the actual bits of the character code. You can take the two bytes, use & to get the x parts, put them one after the next and read that as a number and you get the 0x158, which is the code for 'Ř'.
My javascript implementation:
function fixEncoding(s) {
var reg = /\\u00([a-f0-9]{2})\\u00([a-f0-9]{2})/gi;
return s.replace(reg, function(a, m1, m2){
b1 = parseInt(m1,16);
b2 = parseInt(m2,16);
var maskedb1 = b1 & 0x1F;
var maskedb2 = b2 & 0x3F;
var result = (maskedb1 << 6) | maskedb2;
return String.fromCharCode(result);
})
}
Imagine I have a custom type and two functions:
type MyType = Int -> Bool
f1 :: MyType -> Int
f3 :: MyType -> MyType -> MyType
I tried to pattern match as follows:
f1 (f3 a b i) = 1
But it failed with error: Parse error in pattern: f1. What is the proper way to do the above?? Basically, I want to know how many f3 is there (as a and b maybe f3 or some other functions).
You can't pattern match on a function. For (almost) any given function, there are an infinite number of ways to define the same function. And it turns out to be mathematically impossible for a computer to always be able to say whether a given definition expresses the same function as another definition. This also means that Haskell would be unable to reliably tell whether a function matches a pattern; so the language simply doesn't allow it.
A pattern must be either a single variable or a constructor applied to some other patterns. Remembering that constructor start with upper case letters and variables start with lower case letters, your pattern f3 a n i is invalid; the "head" of the pattern f3 is a variable, but it's also applied to a, n, and i. That's the error message you're getting.
Since functions don't have constructors, it follows that the only pattern that can match a function is a single variable; that matches all functions (of the right type to be passed to the pattern, anyway). That's how Haskell enforces the "no pattern matching against functions" rule. Basically, in a higher order function there's no way to tell anything at all about the function you've been given except to apply it to something and see what it does.
The function f1 has type MyType -> Int. This is equivalent to (Int -> Bool) -> Int. So it takes a single function argument of type Int -> Bool. I would expect an equation for f1 to look like:
f1 f = ...
You don't need to "check" whether it's an Int -> Bool function by pattern matching; the type guarantees that it will be.
You can't tell which one it is; but that's generally the whole point of taking a function as an argument (so that the caller can pick any function they like knowing that you'll use them all the same way).
I'm not sure what you mean by "I want to know how many f3 is there". f1 always receives a single function, and f3 is not a function of the right type to be passed to f1 at all (it's a MyType -> MyType -> MyType, not a MyType).
Once a function has been applied its syntactic form is lost. There is now way, should I provide you 2 + 3 to distinguish what you get from just 5. It could have arisen from 2 + 3, or 3 + 2, or the mere constant 5.
If you need to capture syntactic structure then you need to work with syntactic structure.
data Exp = I Int | Plus Exp Exp
justFive :: Exp
justFive = I 5
twoPlusThree :: Exp
twoPlusThree = I 2 `Plus` I 3
threePlusTwo :: Exp
threePlusTwo = I 2 `Plus` I 3
Here the data type Exp captures numeric expressions and we can pattern match upon them:
isTwoPlusThree :: Exp -> Bool
isTwoPlusThree (Plus (I 2) (I 3)) = True
isTwoPlusThree _ = False
But wait, why am I distinguishing between "constructors" which I can pattern match on and.... "other syntax" which I cannot?
Essentially, constructors are inert. The behavior of Plus x y is... to do nothing at all, to merely remain as a box with two slots called "Plus _ _" and plug the two slots with the values represented by x and y.
On the other hand, function application is the furthest thing from inert! When you apply an expression to a function that function (\x -> ...) replaces the xes within its body with the applied value. This dynamic reduction behavior means that there is no way to get a hold of "function applications". They vanish into thing air as soon as you look at them.
Following a minimal example of an observation (that kind of astonished me):
type Vector = V of float*float
// complete unfolding of type is OK
let projX (V (a,_)) = a
// also works
let projX' x =
match x with
| V (a, _) -> a
// BUT:
// partial unfolding is not Ok
let projX'' (V x) = fst x
// consequently also doesn't work
let projX''' x =
match x with
| V y -> fst y
What is the reason that makes it impossible to match against a partially deconstructed type?
Some partial deconstructions seem to be ok:
// Works
let f (x,y) = fst y
EDIT:
Ok, I now understand the "technical" reason of the behavior described (Thanks for your answers & comments). However, I think that language wise, this behavior feels a bit "unnatural" compared to rest of the language:
"Algebraically", to me, it seems strange to distinguish a type "t" from the type "(t)". Brackets (in this context) are used for giving precedence like e.g. in "(t * s) * r" vs "t * (s * r)". Also fsi answers accordingly, whether I send
type Vector = (int * int)
or
type Vector = int * int
to fsi, the answer is always
type Vector = int * int
Given those observations, one concludes that "int * int" and "(int * int)" denote exactly the same types and thus that all occurrences of one could in any piece of code be replaced with the other (ref. transparency)... which as we have seen is not true.
Further it seems significant that in order to explain the behavior at hand, we had to resort to talk about "how some code looks like after compilation" rather than about semantic properties of the language which imo indicates that there are some "tensions" between language semantics an what the compiler actually does.
In F#
type Vector = V of float*float
is just a degenerated union (you can see that by hovering it in Visual Studio), so it's equivalent to:
type Vector =
| V of float*float
The part after of creates two anonymous fields (as described in F# reference) and a constructor accepting two parameters of type float.
If you define
type Vector2 =
| V2 of (float*float)
there's only one anonymous field which is a tuple of floats and a constructor with a single parameter. As it was pointed out in the comment, you can use Vector2 to do desired pattern matching.
After all of that, it may seem illogical that following code works:
let argsTuple = (1., 1.)
let v1 = V argsTuple
However, if you take into account that there's a hidden pattern matching, everything should be clear.
EDIT:
F# language spec (p 122) states clearly that parenthesis matter in union definitions:
Parentheses are significant in union definitions. Thus, the following two definitions differ:
type CType = C of int * int
type CType = C of (int * int)
The lack of parentheses in the first example indicates that the union case takes two arguments. The parentheses
in the second example indicate that the union case takes one argument that is a first-class tuple value.
I think that such behavior is consistent with the fact that you can define more complex patterns at the definition of a union, e.g.:
type Move =
| M of (int * int) * (int * int)
Being able to use union with multiple arguments also makes much sense, especially in interop situation, when using tuples is cumbersome.
The other thing that you used:
type Vector = int * int
is a type abbreviation which simply gives a name to a certain type. Placing parenthesis around int * int does not make a difference because those parenthesis will be treated as grouping parenthesis.
I've another issue with my Haskell. I'm given the following data constructor from a problem,
type Point = (Int, Int)
data Points = Lines Int Int
| Columns Int Int
| Union Points Points
| Intersection Points Points
It's about points on a grid starting (0,0) and (x,y) has x as the horizontal distance from the origin and y as the vertical distance from the origin.
I tried to define a function "Lines" from this, which, given Lines x y would evaluate all points with vertical distance x ~ y on the grid.
e.g.
> Lines 2 4
(0,2)(1,2)(2,2)(3,2)....
(0,3)(1,3)(2,3)(3,3)....
(0,4)(1,4)(2,4)(3,4)....
and so on. Well what I did, was,
Lines :: Int -> Int -> Points
Lines lo hi = [ (_, y) | lo <= y && y <= hi ]
But Haskell complains that;
Invalid type signature Lines :: Int -> Int -> Points.
Should be of the form ::
what's this mean? "Points" is defined above already...surely "Int" "Points" are regarded as "types"? I don't see the problem, why is Haskell confused?
Functions must not start with a capital letter. So you need to use lines, not Lines. This is probably the source of the error message you're seeing.
The syntax [ ... ] is for creating a list of results, but your type signature claims that the function returns Points, which isn't any kind of list. If you meant to return a list of Point values, that's the [Point] type.
I have literally no idea what your implementation of Lines is even trying to do. The syntax makes no sense to me.
OK, so taking your comments into account...
You can generate a list of numbers between lo and hi by writing [lo .. hi].
You say an "arbitrary" value can go in X, but you need to pin down exactly what that means. Your example seems to suggest you want the numbers from 0 upwards, forever. The way to generate that list is [0 .. ]. (Not giving an upper limit makes the list endless.)
Your example suggests you want a list of lists, with the inner list containing all points with the same Y-coordinate paired with all possible X-coordinates.
So here is one possible way to do that:
type Point = (Int, Int)
lines :: Int -> Int -> [[Point]]
lines lo hi = [ [(x,y) | x <- [0..]] | y <- [lo .. hi] ]
That's perhaps a teeny bit hard to read, with all those opening and closing brackets, so perhaps I can make it slightly cleaner:
lines lo hi =
let
xs = [0..]
ys = [lo .. hi]
in [ [(x,y) | x <- xs] | y <- ys]
If you run this, you get
> lines 2 4
[[(0,2), (1,2), (2,2), ...],
[(0,3), (1,3), (2,3), ...],
[(0,4), (1,4), (2,4), ...]]
In other words, the outer list has 3 elements (Y=2, Y=3 and Y=4), and each of the three inner lists is infinitely long (every possible positive X value).