ASCII number pattern matching in binary file in Python 2 - binary

I'm trying to read a binary file and match number(s) in each of its records.
If the number matches then the record is to be copied to another file.
the number should be present in between 24th to 36th byte of each record.
The script takes the numbers as arguments. Here's the script I'm using:
#!/usr/bin/env python
# search.py
import re
import glob
import os
import binascii
list = sys.argv[1:]
list.sort()
rec_len=452
filelist = glob.glob(os.getcwd() + '*.bin')
print('Input File : %s' % filelist)
for file in filelist:
outfile = file + '.out'
f = open(file, "rb")
g = open(outfile, "wb")
for pattern in list:
print pattern
regex_search = re.compile(pattern).search
while True:
buf = f.read(rec_len)
if len(buf) == 0:
break
else:
match = regex_search(buf)
match2=buf.find(pattern)
#print match
#print match2
if ((match2 != -1) | (match != None )):
g.write(buf)
f.close()
g.close()
print ("Done")
I'm running it like:
python search.py 1234 56789
I'm using python 2.6.
The code is not matching the number.
I also tried using binascii to convert the number to binary before matching but even then it didn't return any record.
If I give any string it works correctly but If i give any number as argument it doesn't match.
Where am I going wrong?

You are depleting the filebuffer by reading all the bytes in the check for the first pattern. Hence, the 2nd pattern will never be matched (an attempt will not even be made), because you've already reached the end of the file, by reading all the records during the for-loop run of the first pattern.
That means that if your first pattern is nowhere to be found in those records, your script will not give you any output.
Consider adding f.seek(0) at the end of the while loop, or changing the order of the two loop constructs such that you first read a record from the file, then match the regex for each of the patterns in the argument list.
Also, try not to shadow the python builtins by using list for the name of an array. It won't give you problems in the code you've shown, but it's definitely something you should be aware of.

Related

Save json list in a text file

I got a json log file that i rearrange to be correct, after this i am trying to save the results to the same file. The results are a list. but the problem that i am unable to save and will give me the following error:
write() argument must be str, not list
Here is the code it self:
import regex as re
import re
f_name = 'test1.txt'
splitter = r'"Event\d+":{(.*?)}' # a search pattern to capture the stuff in braces
#Open the file as Read.
with open(f_name, 'r') as src:
data = src.readlines()
# tokenize the data source...
tokens = re.findall(splitter, str(data))
#print(tokens)
# now we can operate on the tokens and split them up into key-value pairs and put them into a list
result = []
for token in tokens:
# make an empty dictionary to hold the row elements
line_dict = {}
# we can split the line (token) by comma to get the key-value pairs
pairs = token.split(',')
for pair in pairs:
# another regex split needed here, because the timestamps have colons too
splitter = r'"(.*)"\s*:\s*"(.*)"' # capture two groups of things in quotes on opposite sides of colon
parts = re.search(splitter, pair)
key, value = parts.group(1), parts.group(2)
line_dict[key] = value
# add the dictionary of line elements to the result
result.append(line_dict)
with open(f_name, 'w') as src:
for line in result:
src.write(result)
i.e the code it self was not written by me -> Log file management with python (thanks AirSquid)
Thanks for the assistance, New at Python here.
Tried to import json and use json.dump, also tried to append the text, but in most cases i end up with just [] or empty file.

Python Spark- How to output empty DataFrame to csv file (Only output header)?

I want to output empty dataframe to csv file. I use these codes:
df.repartition(1).write.csv(path, sep='\t', header=True)
But due to there is no data in dataframe, spark won't output header to csv file.
Then I modify the codes to:
if df.count() == 0:
empty_data = [f.name for f in df.schema.fields]
df = ss.createDataFrame([empty_data], df.schema)
df.repartition(1).write.csv(path, sep='\t')
else:
df.repartition(1).write.csv(path, sep='\t', header=True)
It works, but I want to ask whether there are a better way without count function.
df.count() == 0 will make your driver program retrieve the count of all your dataframe partitions across the executors.
In your case I would use df.take(1).isEmpty (Spark >= 2.1). Still slow, but preferable to a raw count().
Only header:
cols = '\t'.join(df.columns)
with open('./cols.csv', 'w') as f:
f.write(cols)

Python 2.7: Load a JSON file search for a value, replace it, and save as new JSON

As mentioned in the title, i'm trying to make a simple py script that can be run from terminal to do the following:
Find all JSON files in current working directory and nested folders (this part works well)
Load said files
Recursively search them for a specific value or a substring
If the value is matching, replace it with a new established value by the user
Once finished, save all modified json files to a "converted" folder in the current directory.
That said, the issue is when i try the recursive search method posted below, since i'm pretty much new to python i would appreciate any help with this issue, what i suppose it is... either the json files i'm using or the search method i'm employing.
Simplifying the issue, the value i search for never matches with anything inside the object, be that a key or purely some string value. Tried multiple methods to perform a recursive search but can't get a match.
For example: taking in account the sample json, i want to replace the value "selectable_parts" or "static_parts" or even deeper in the structure "1h_mod310_door_00" but seems like my method of searching can't reach this value in "object[object][children][0][children][5][name]" (hope this helps).
Sample JSON: (https://drive.google.com/open?id=0B2-Bn2b0ujjVdW5YVGg3REg3OWs)
"""KEYWORD REPLACING MODULE."""
import os
import json
# functions
def get_files():
"""lists files"""
exclude = set(['.vscode', 'sample'])
json_files = []
for root, dirs, files in os.walk(os.getcwd(), topdown=True):
dirs[:] = [d for d in dirs if d not in exclude]
for name in files:
if name.endswith('.json'):
json_files.append(os.path.join(root, name))
return json_files
def load_files(json_files):
"""works files"""
for js_file in json_files:
with open(js_file) as json_file:
loaded_json = json.load(json_file)
replace_key_value(loaded_json, os.path.basename(js_file))
def write_file(data_file, new_file_name):
"""writes the file"""
if not os.path.exists('converted'):
os.makedirs('converted')
with open('converted/' + new_file_name, 'w') as json_file:
json.dump(data_file, json_file)
def replace_key_value(js_file, js_file_name):
"""replace and initiate save"""
recursive_replace(js_file, SKEY, '')
# write_file(js_file, js_file_name)
def recursive_replace(data, match, repl):
"""search for needed value and replace its value"""
for key, value in data.items():
if value == match:
print data[key]
print "AHHHHHHHH"
elif isinstance(value, dict):
recursive_replace(value, match, repl)
# main
print "\n" + '- on ' + os.getcwd()
NEW_DIR = raw_input('Work dir (leave empty if current): ')
if not NEW_DIR:
print NEW_DIR
NEW_DIR = os.getcwd()
else:
print NEW_DIR
os.chdir(NEW_DIR)
# get_files()
JS_FILES = get_files()
print '- files on ' + os.getcwd()
# print "\n".join(JS_FILES)
SKEY = raw_input('Value to search: ')
RKEY = raw_input('Replacement value: ')
load_files(JS_FILES)
The issue was the way i navigated the json obj because the method didn't considerate if it was a dict or a list (i believe...).
So to answer my own question here's the recursive search i'm using to check the values:
def get_recursively(search_dict, field):
"""
Takes a dict with nested lists and dicts,
and searches all dicts for a key of the field
provided.
"""
fields_found = []
for key, value in search_dict.iteritems():
if key == field:
print value
fields_found.append(value)
elif isinstance(value, dict):
results = get_recursively(value, field)
for result in results:
if SEARCH_KEY in result:
fields_found.append(result)
elif isinstance(value, list):
for item in value:
if isinstance(item, dict):
more_results = get_recursively(item, field)
for another_result in more_results:
if SEARCH_KEY in another_result:
fields_found.append(another_result)
return fields_found
# write_file(js_file, js_file_name)
Hope this helps someone.

Reading NaN values from .csv files with decode_csv()

I have .csv file with integer values, that can have NA value which represents missing data.
Example file:
-9882,-9585,-9179
-9883,-9587,NA
-9882,-9585,-9179
When trying to read it with
import tensorflow as tf
reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read_up_to(filename_queue, 1)
record_defaults = [[0], [0], [0]]
data, ABL_E, ABL_N = tf.decode_csv(value, record_defaults=record_defaults)
It throws following error later on sess.run(_) on the 2nd iteration
InvalidArgumentError (see above for traceback): Field 5 in record 32400 is not a valid int32: NA
Is there a way to interpret string "NA" while reading csv as NaN or similar value in TensorFlow?
I recently ran into the same problem. I solved it by reading the CSV as strings, replacing every occurrence of "NA" with some valid value, then converting it to float
# Set up reading from CSV files
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
NUM_COLUMNS = XX # Specify number of expected columns
# Read values as string, set "NA" for missing values.
record_defaults = [[tf.cast("NA", tf.string)]] * NUM_COLUMNS
decoded = tf.decode_csv(value, record_defaults=record_defaults, field_delim="\t")
# Replace every occurrence of "NA" with "-1"
no_nan = tf.where(tf.equal(decoded, "NA"), ["-1"]*NUM_COLUMNS, decoded)
# Convert to float, combine to a single tensor with stack.
float_row = tf.stack(tf.string_to_number(no_nan, tf.float32))
But long term I plan switching to tfrecords because reading csv is too slow for my needs

Counting non blank and sum of length of lines in python

Am trying to create a function that takes a filename and it returns a 2-tuple with the number of the non-empty lines in that program, and the sum of the lengths of all those lines. Here is my current program. I made an attempy and got the following code:
def code_metric(file_name):
with open(file_name) as f:
lines = f.read().splitlines()
char_count = sum(map(len,(map(str.strip,filter(None,lines)))))
return len(lines), char_count
Am supposed to use the functionals map, filter, and reduce for this. I had asked the question previously and improved on my answer but its still giving me an error. Here is the link to the previous version of the question:
Old program code
When I run the file cmtest.py which has the following content
import prompt,math
x = prompt.for_int('Enter x')
print(x,'!=',math.factorial(x),sep='')
the result should be
(3,85)
but I keep getting:
(4,85)
Another file colltaz.py to be tested for example:
the result should be:
(73, 2856)
bit I keep getting:
(59, 2796)
Here is a link to the collatz.py file:
Collatz.py file link
Can anyone help me with correcting the code. Am fairly new to python and any help would be great.
Try this:
def code_metric(file_name):
with open(file_name) as f:
lines = [line.rstrip() for line in f.readlines()]
nonblanklines = [line for line in lines if line]
return len(nonblanklines), sum(len(line) for line in nonblanklines)
Examples:
>>> code_metric('collatz.py')
(73, 2856)
>>> code_metric('cmtest.py')
(3, 85)
Discussion
I was able to achieve the desired result for collatz.py only by removing the trailing newline and trailing blanks off the end of the lines. That is done in this step:
lines = [line.rstrip() for line in f.readlines()]
The next step is to remove the blank lines:
nonblanklines = [line for line in lines if line]
We want to return the number of non-blank lines:
len(nonblanklines)
We also want to return the total number of characters on the non-blank lines:
sum(len(line) for line in nonblanklines)
Alternate Version for Large Files
This version does not require keeping the file in memory all at once:
def code_metric2(file_name):
with open(file_name) as f:
lengths = [len(line) for line in (line.rstrip() for line in f.readlines()) if line]
return len(lengths), sum(lengths)
Alternate Version Using reduce
Python's createor, Guido van Rossum, wrote this about the reduce builtin:
So now reduce(). This is actually the one I've always hated most,
because, apart from a few examples involving + or *, almost every time
I see a reduce() call with a non-trivial function argument, I need to
grab pen and paper to diagram what's actually being fed into that
function before I understand what the reduce() is supposed to do. So
in my mind, the applicability of reduce() is pretty much limited to
associative operators, and in all other cases it's better to write out
the accumulation loop explicitly.
Accordingly reduce is no longer a builtin in python3. For compatibility, though, it remains available in the functools module. The code below how reduce can be used for this particular problem:
from functools import reduce
def code_metric3(file_name):
with open(file_name) as f:
lengths = [len(line) for line in (line.rstrip() for line in f.readlines()) if line]
return len(lengths), reduce(lambda x, y: x+y, lengths)
Here is yet another version which makes heavier use of reduce:
from functools import reduce
def code_metric4(file_name):
def fn(prior, line):
nlines, length = prior
line = line.rstrip()
if line:
nlines += 1
length += len(line)
return nlines, length
with open(file_name) as f:
nlines, length = reduce(fn, f.readlines(), (0, 0))
return nlines, length