random line in file - language-agnostic

This question was given to me during an interview. The interview is long over, but I'm still thinking about hte problem and its bugging me:
You have a language that contains the following tools: a rand() function, while and for loops, if statements, and a readline() method (similar to python's readline()). Given these tools, write an algorithm that returns a random line in the file. You don't know the size of the file, and you can only loop over the file's contents once.

I don't know the desired answer, but my solution would be the following:
chosen_line = ""
lines = 0
while (current_line = readline()):
if (rand(0, lines) == 0):
chosen_line = current_line
lines++
return chosen_line
Edit: A good explanation why this works was posted in this comment.

One method, guaranteeing a uniform distribution:
(1) Read the file line-by-line into an array (or similar, e.g. python list)
(2) Use rand() to select a number between 0 and largest index in the array.
Another, not guaranteeing a uniform distribution:
Read each line. On each read, also call rand(). If over a threshold, return the line.

Although similar to Marcin's third option, Luc's implementation always returns the first line, while parsing the whole file.
It should be something like:
chosen_line = ""
treshold = 90
max = 100
while chosen_line == "":
current_line = readline()
if (rand(0, max) > treshold):
chosen_line = current_line
print chosen_line
You could also return current_line in the case no line was chosen and you read the whole file.

Related

Lua - Match pattern for CSV import to array, that factors in empty values (two commas next to each other)

I have been using the following Lua code for a while to do simply csv to array conversions, but everything previously had a value in every column, but this time on a csv formatted bank statement there are empty values, which this does not handle.
Here’s an example csv, with debit and credits.
Transaction Date,Transaction Type,Sort Code,Account Number,Transaction Description,Debit Amount,Credit Amount,Balance
05/04/2022,DD,'11-70-79,6033606,Refund,,10.00,159.57
05/04/2022,DEB,'11-70-79,6033606,Henry Ltd,30.00,,149.57
05/04/2022,SO,'11-70-79,6033606,NEIL PARKS,20.00,,179.57
01/04/2022,FPO,'11-70-79,6033606,MORTON GREEN,336.00,,199.57
01/04/2022,DD,'11-70-79,6033606,WORK SALARY,,100.00,435.57
01/04/2022,DD,'11-70-79,6033606,MERE BC,183.63,,535.57
01/04/2022,DD,'11-70-79,6033606,ABC LIFE,54.39,,719.20
I’ve tried different patterns (https://www.lua.org/pil/20.2.html), but none seem to work, I’m beginning to think I can’t fix this via the pattern as it’ll break how it works for the rest? I appreciate it if anyone can share how they would approach this…
local csvfilename = "/mnt/nas/Fireflyiii.csv"
local MATCH_PATTERN = "[^,]+"
local function create_array_from_file(csvfilename)
local file = assert(io.open(csvfilename, "r"))
local arr = {}
for line in file:lines() do
local row = {}
for match in string.gmatch(line, MATCH_PATTERN) do
table.insert(row, match)
end
table.insert(arr, row)
end
return arr
end

Need help rounding Mysql results that are returned from a python function

I am relatively new to python and I am working on creating a program in my fun time to automatically generate a sales sheet. It has several functions that pull the necessary data from a database, and reportlab and a few other tools to place the results onto the generated pdf. I am trying to round the results coming from the Mysql server. However, I have hit a point where I am stuck and all the ways I have tried to round the results throw an error code and do not work. I need a few examples to look at so I can see how this would work and any relevant feedback that would help me learn.
I have tried to use the mysql round function to round the results but that failed. I have also tried to round the results as part of the function that generates the unit cost itself. However, that has failed as well.
A large amount of the code has been deleted due to the security hole it would generate. Code provided is to show what I have done so far. Print result line is to verify that the code is working during development. It is not throwing any erroneous results and will be removed during the last stage of the project.
def upcpsfunc(self, upc):
mycursor = self.mydb.cursor()
command = "Select Packsize from name"" where UPC = %(Upc)s"
mycursor.execute(command, {'Upc': upc})
result = mycursor.fetchone()
print(result[0])
return result[0]
def unitcost(self,upc):
#function to generate unit cost
mycursor = self.mydb.cursor()
command = "Select Concat((Cost - Allow)/Packsize) as total from name
where UPC = %(Upc)s"
mycursor.execute(command, {'Upc': upc})
result = mycursor.fetchone()
print (result[0])
return result[0]
As for the expected results, I would prefer the mysql command round the results before it sends it to Reportlab for placement. So far the results are 4 or 5 digits, which is not ideal. I want the results to have two decimal places, since it would be money. The desired output is 7.50 instead 7.5025
The round function can be used to round numbers:
>>> round(7.5025, 2)
7.5
To get the extra 0 on the end, you can use the following code:
>>> def round_money(n):
s = str(round(n, 2))
if len(s) == 1: # exact dollar
return s + ".00"
elif len(s) == 3: # exact x10 cents
return s + "0"
return s
>>> round_money(6)
'6.00'
>>> round_money(7.5025)
'7.50'
Note that this function returns a string, because 7.50 cannot be represented by an integer in python.
Just as an alternative way to the one already provided, you can do the same thing with string formatting (it'll truncate the decimals though, so you can still round beforehand):
>>> '{:,.2f}'.format(0)
'0.00'
>>> '{:,.2f}'.format(15342.62412)
'15,342.62'

How to debug/dump Go variable while building with cgo?

I'm trying to write a MySQL UDF in Go with cgo, in which I have a basic one functioning, but there's little bits and pieces that I can't figure out because I have no idea what some of the C variables are in terms of Go.
This is an example that I have written in C that forces the type of one of the MySQL parameters to an int
my_bool unhex_sha3_init(UDF_INIT *initid, UDF_ARGS *args, char *message) {
if (args->arg_count != 2) {
strcpy(message, "`unhex_sha3`() requires 2 parameters: the message part, and the bits");
return 1;
}
args->arg_type[1] = INT_RESULT;
initid->maybe_null = 1; //can return null
return 0;
}
And that works fine, but then I try to do the same/similar thing with this other function in Go like this
//export get_url_param_init
func get_url_param_init(initid *C.UDF_INIT, args *C.UDF_ARGS, message *C.char) C.my_bool {
if args.arg_count != 2 {
message = C.CString("`get_url_param` require 2 parameters: the URL string and the param name")
return 1
}
(*args.arg_type)[0] = C.STRING_RESULT
(*args.arg_type)[1] = C.STRING_RESULT
initid.maybe_null = 1
return 0
}
With this build error
./main.go:24: invalid operation: (*args.arg_type)[0] (type uint32 does
not support indexing)
And I'm not totally sure what that means. Shouldn't this be a slice of some sort, not a uint32?
And this is where it'd be super helpful have some way of dumping the args struct somewhere somehow (maybe even in Go syntax as a super plus) so that I can tell what I'm working with.
Well I used spew to dump the variable contents to a tmp file inside the init function (commenting out the lines that made it not compile) and I got this
(string) (len=3) "%#v"
(*main._Ctype_struct_st_udf_args)(0x7ff318006af8)({
arg_count: (main._Ctype_uint) 2,
_: ([4]uint8) (len=4 cap=4) {
00000000 00 00 00 00 |....|
},
arg_type: (*uint32)(0x7ff318006d18)(0),
args: (**main._Ctype_char)(0x7ff318006d20->0x7ff3180251b0)(0),
lengths: (*main._Ctype_ulong)(0x7ff318006d30)(0),
maybe_null: (*main._Ctype_char)(0x7ff318006d40)(0),
attributes: (**main._Ctype_char)(0x7ff318006d58->0x7ff318006b88)(39),
attribute_lengths: (*main._Ctype_ulong)(0x7ff318006d68)(2),
extension: (unsafe.Pointer) <nil>
})
Alright so huge help with #JimB who stuck with me even though I'm clearly less adept with Go (and especially CGO) but I've got a working version of my UDF, which is an easy and straight forward (and fast) function that pulls a single parameter out of a URL string and decodes it correctly and what not (e.g. %20 gets returned as a space, basically how you would expect it to work).
This seemed incredibly tricky with a pure C UDF because I don't really know C (as well as I know other languages), and there's a lot that can go wrong with URL parsing and URL parameter decoding, and native MySQL functions are slow (and there's not really a good, clean way to do the decoding either), so Go seemed like the better-than-perfect candidate for this kind of problem, for strong performance, ease of writing, and wide variety of easy to use built ins & third party libraries.
The full UDF and it's installation/usage instructions are here https://github.com/StirlingMarketingGroup/mysql-get-url-param/blob/master/main.go
First problem was debugging output. And I did that by Fprintfing to a tmp file instead of the standard output, so that I could check the file to see variable dumps.
t, err := ioutil.TempFile(os.TempDir(), "get-url-param")
fmt.Fprintf(t, "%#v\n", args.arg_type)
And then after I got my output (I was expecting args.arg_type to be an array like it is in C, but instead was a number) I needed to convert the data referenced by that number (the pointer to the start of the C array) to a Go array so I could set it's values.
argsTypes := *(*[2]uint32)(unsafe.Pointer(args.arg_type))
argsTypes[0] = C.STRING_RESULT
argsTypes[1] = C.STRING_RESULT

EOF Error During Dict Slice

I am trying to compile monthly data in to an existing JSON file that I loaded via import json. Initially, my json data just had one property which is 'name':
json_data['features'][1]['properties']
>>{'name':'John'}
But the end result with the monthly data I want is like this:
json_data['features'][1]['properties']
>>{'name':'John',
'2016-01': {'x1':0, 'x2':0, 'x3':1, 'x4':0},
'2016-02': {'x1':1, 'x2':0, 'x3':1, 'x4':0}, ... }
My monthly data are on separate tsv files. They have this format:
John 0 0 1 0
Jane 1 1 1 0
so I loaded them via import csv and parsed through a list of urls and set about placing them in a collective dictionary like so:
file_strings = ['2016-01.tsv', '2016-02.tsv', ... ]
collective_dict = {}
for i in strings:
with open(i) as f:
tsv_object = csv.reader(f, delimiter='\t')
collective_dict[i[:-4]] = rows[0]:rows[1:5] for rows in tsv_object
I checked how things turned out by slicing collective_dict like so:
collective_dict['2016-01']['John'][0]
>>'0'
Which is correct; it just needs to be cast into an integer.
For my next feat, I attempted to assign all of the monthly data to the respective json members as part of their external properties:
for i in file_strings:
for j in range(len(json_data['features'])):
json_data['features'][j]['properties'][i[:-4]] = {}
json_data['features'][j]['properties'][i[:-4]]['x1'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][0])
json_data['features'][j]['properties'][i[:-4]]['x2'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][1])
json_data['features'][j]['properties'][i[:-4]]['x3'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][2])
json_data['features'][j]['properties'][i[:-4]]['x4'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][3])
Here I got an arrow pointing at the last few characters:
Syntax Error: unexpected EOF while parsing
It is a pretty complicated slice, I suppose user error is not to be ruled out. However, I did double and triple check things. I also looked up this error. It seems to come up with input() related calls. I'm left a bit confused, I don't see how I made a mistake (although I'm already mentally prepared to accept that).
My only guess was that something somewhere was not a string. When I checked collective_dict and json_data, everything that was supposed to be a string was a string ('John', 'Jane' et all). So, I guess it's something else.
I made the problem as simple as I could while keeping the original structure of the data and for loops and so forth. I'm using Python 3.6.
Question
Why am I getting the EOF error? How can I build my external properties data without encountering such an error?
Here I have rewritten your last code block to:
for i in file_strings:
file_name = i[:-4]
for j in range(len(json_data['features'])):
name = json_data['features'][j]['properties']['name']
file_dict = json_data['features'][j]['properties'][file_name] = {}
for x in range(4):
x_string = 'x{}'.format(x+1)
file_dict[x_string] = int(collective_dict[file_name][name][x])
from:
for i in file_strings:
for j in range(len(json_data['features'])):
json_data['features'][j]['properties'][i[:-4]] = {}
json_data['features'][j]['properties'][i[:-4]]['x1'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][0])
json_data['features'][j]['properties'][i[:-4]]['x2'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][1])
json_data['features'][j]['properties'][i[:-4]]['x3'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][2])
json_data['features'][j]['properties'][i[:-4]]['x4'] = int(collective_dict[i[:-4]][json_data['features'][j]['properties']['name']][3])
That is just to make it a bit more readable, but that shouldn't change anything.
A thing I noticed in your other part of code is the following:
collective_dict[i[:-4]] = rows[0]:rows[1:5] for rows in tsv_object
The thing I refer to is the = rows[0]:rows[1:5] for rows in tsv_object part. In my IDE, that does not work, and I'm not sure if that is a typo in your question or of that is actually in your code, but I imagine you want it to actually be
collective_dict[i[:-4]] = {rows[0]:rows[1:5] for rows in tsv_object}
or something like that. I'm not sure if that could confuse the parser think that there is an error at the end of the file.
The ValueError: Invalid literal for int()
If your tsv-data is
John 0 0 1 0
Jane 1 1 1 0
Then it should be no problem to do int() of the string value. E.g.: int('42') will become an int with value 42. However, if you have an error in one, or several, lines of your files, then use something like this block of code to figure out which file and line it is:
file_strings = ['2016-01.tsv', '2016-02.tsv', ... ]
collective_dict = {}
for file_name in file_strings:
print('Reading {}'.format(file_name))
with open(file_name) as f:
tsv_object = csv.reader(f, delimiter='\t')
for line_no, (name, *x_values) in enumerate(tsv_object):
if len(x_values) != 4:
print('On line {}, there is only {} values!'.format(line_no, len(x_values)))
try:
intx = [int(x) for x in x_values]
except ValueError as e:
# Catch "Invalid literal for int()"
print('Line {}: {}'.format(line_no, e))

Confused about this nested function

I am reading the Python Cookbook 3rd Edition and came across the topic discussed in 2.6 "Searching and Replacing Case-Insensitive Text," where the authors discuss a nested function that is like below:
def matchcase(word):
def replace(m):
text = m.group()
if text.isupper():
return word.upper()
elif text.islower():
return word.lower()
elif text[0].isupper():
return word.capitalize()
else:
return word
return replace
If I have some text like below:
text = 'UPPER PYTHON, lower python, Mixed Python'
and I print the value of 'text' before and after, the substitution happens correctly:
x = matchcase('snake')
print("Original Text:",text)
print("After regsub:", re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE))
The last "print" command shows that the substitution correctly happens but I am not sure how this nested function "gets" the:
PYTHON, python, Python
as the word that needs to be substituted with:
SNAKE, snake, Snake
How does the inner function replace get its value 'm'?
When matchcase('snake') is called, word takes the value 'snake'.
Not clear on what the value of 'm' is.
Can any one help me understand this clearly, in this case?
Thanks.
When you pass a function as the second argument to re.sub, according to the documentation:
it is called for every non-overlapping occurrence of pattern. The function takes a single match object argument, and returns the replacement string.
The matchcase() function itself returns the replace() function, so when you do this:
re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)
what happens is that matchcase('snake') returns replace, and then every non-overlapping occurrence of the pattern 'python' as a match object is passed to the replace function as the m argument. If this is confusing to you, don't worry; it is just generally confusing.
Here is an interactive session with a much simpler nested function that should make things clearer:
In [1]: def foo(outer_arg):
...: def bar(inner_arg):
...: print(outer_arg + inner_arg)
...: return bar
...:
In [2]: f = foo('hello')
In [3]: f('world')
helloworld
So f = foo('hello') is assigning a function that looks like the one below to a variable f:
def bar(inner_arg):
print('hello' + inner_arg)
f can then be called like this f('world'), which is like calling bar('world'). I hope that makes things clearer.