Converting a ctypes.c_void_p() and ctypes.c_size_t() to bytearray or string? - ctypes

I can't seem to find any simple examples of converting a ctypes.c_void_p() to a string or byte array. Is there any simple one liner that will do this?

Here you go:
import ctypes as ct
# set up some void pointers to valid string and byte array
data = b'some string\0with null in it'
data1 = ct.c_char_p(data)
size1 = ct.c_size_t(len(data))
data2 = (ct.c_ubyte * 8)(1, 2, 3, 4, 5, 6, 7, 8)
size2 = ct.c_size_t(8)
void1 = ct.cast(data1, ct.c_void_p) # void* to nul-terminated string
void2 = ct.cast(data2, ct.c_void_p) # void* to eight bytes of data
print(void1, size1)
print(void2, size2)
# cast void* to the data being pointed to, and retrieve its contents
s = ct.cast(void1, ct.POINTER(ct.c_char * size1.value)).contents
print(s)
print(s.value) # up to nul-termination
print(s.raw) # full array
# cast void* to the data being pointed to, and retrieve its contents
b = ct.cast(void2, ct.POINTER(ct.c_ubyte * size2.value)).contents
print(b)
print(b[2])
print(list(b))
Output:
c_void_p(1678408544912) c_ulonglong(27)
c_void_p(1678408467720) c_ulonglong(8)
<__main__.c_char_Array_27 object at 0x00000186C8F0CBC0>
b'some string'
b'some string\x00with null in it'
<__main__.c_ubyte_Array_8 object at 0x00000186C8F0C840>
3
[1, 2, 3, 4, 5, 6, 7, 8]

I did this
data = bytearray(ctypes.string_at(data_ptr, data_size.value))

Related

Is there a way to take a list of strings and create a JSON file, where both the key and value are list items?

I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)

python pandas: assigning a json data to a data frame entry returns error "Incompatible indexer with Series"

As a newbie to python I'm struggling with an error "Incompatible indexer with Series".
I'm reading a entry from a postgreSQL database:
df_postgresDB = pd.read_sql_query('SELECT * FROM public.json_view',con=<...>)
exampleKey = 'FPB-83160'
jsonCol = 'efforts'
AreasDict = df_postgresDB.loc[exampleKey, jsonCol]
print('AreasDict=', AreasDict)
print('type(AreasDict)=', type(AreasDict))
...output:
AreasDict= {'4G NeVe': 0, '4G FT ET': 400, '4G C-Plane': 800, 'MANO BTSSM': 0}
type(AreasDict)= <class 'dict'>
The column in the postgreSQL data base shows type 'jsonb':
This 'AreasDict' is used in the function of another project I want to call and re-use for my project. But in my project, I need to build up the data from another source. So I create a data frame and try to assign that 'AreasDict' ()...
column_names = ['issue_key', jsonCol]
df = pd.DataFrame(index=range(1,2), columns=column_names)
df.iloc[0, 0] = exampleKey
df.iloc[0, 1] = AreasDict
... and with the last code line I get that error
ValueError: Incompatible indexer with Series
What do I do wrong?
In pandas non scalar values are poorly supported - many function should failed.
Solution is convert to list for list of dictionary:
jsonCol = 'j'
exampleKey = 'key'
AreasDict= {'4G NeVe': 0, '4G FT ET': 400, '4G C-Plane': 800, 'MANO BTSSM': 0}
column_names = ['issue_key', jsonCol]
df = pd.DataFrame(index=range(1,2), columns=column_names)
df.iloc[0, 0] = exampleKey
df.iloc[0, 1] = [AreasDict]
print (df)
issue_key j
1 key [{'4G NeVe': 0, '4G FT ET': 400, '4G C-Plane':...

Using Microsoft.FSharpLu to serialize JSON to a stream

I've been using the Newtonsoft.Json and Newtonsoft.Json.Fsharp libraries to create a new JSON serializer and stream to a file. I like the ability to stream to a file because I'm handling large files and, prior to streaming, often ran into memory issues.
I stream with a simple fx:
open Newtonsoft.Json
open Newtonsoft.Json.FSharp
open System.IO
let writeToJson (path: string) (obj: 'a) : unit =
let serialized = JsonConvert.SerializeObject(obj)
let fileStream = new StreamWriter(path)
let serializer = new JsonSerializer()
serializer.Serialize(fileStream, obj)
fileStream.Close()
This works great. My problem is that the JSON string is then absolutely cluttered with stuff I don't need. For example,
let m =
[
(1.0M, None)
(2.0M, Some 3.0M)
(4.0M, None)
]
let makeType (tup: decimal * decimal option) = {FieldA = fst tup; FieldB = snd tup}
let y = List.map makeType m
Default.serialize y
val it : string =
"[{"FieldA": 1.0},
{"FieldA": 2.0,
"FieldB": {
"Case": "Some",
"Fields": [3.0]
}},
{"FieldA": 4.0}]"
If this is written to a JSON and read into R, there are nested dataframes and any of the Fields associated with a Case end up being a list:
library(jsonlite)
library(dplyr)
q <- fromJSON("default.json")
x <-
q %>%
flatten()
x
> x
FieldA FieldB.Case FieldB.Fields
1 1 <NA> NULL
2 2 Some 3
3 4 <NA> NULL
> sapply(x, class)
FieldA FieldB.Case FieldB.Fields
"numeric" "character" "list"
I don't want to have to handle these things in R. I can do it but it's annoying and, if there are files with many, many columns, it's silly.
This morning, I started looking at the Microsoft.FSharpLu.Json documentation. This library has a Compact.serialize function. Quick tests suggest that this library will eliminate the need for nested dataframes and the lists associated with any Case and Field columns. For example:
Compact.serialize y
val it : string =
"[{
"FieldA": 1.0
},
{
"FieldA": 2.0,
"FieldB": 3.0
},
{
"FieldA": 4.0
}
]"
When this string is read into R,
q <- fromJSON("compact.json")
x <- q
x
> x
FieldA FieldB
1 1 NA
2 2 3
3 4 NA
> sapply(x, class)
FieldA FieldB
"numeric" "numeric
This is much simpler to handle in R. and I'd like to start using this library.
However, I don't know if I can get the Compact serializer to serialize to a stream. I see .serializeToFile, .desrializeStream, and .tryDeserializeStream, but nothing that can serialize to a stream. Does anyone know if Compact can handle writing to a stream? How can I make that work?
The helper to serialize to stream is missing from the Compact module in FSharpLu.Json, but you should be able to do it by following the C# example from
http://www.newtonsoft.com/json/help/html/SerializingJSON.htm. Something along the lines:
let writeToJson (path: string) (obj: 'a) : unit =
let serializer = new JsonSerializer()
serializer.Converters.Add(new Microsoft.FSharpLu.Json.CompactUnionJsonConverter())
use sw = new StreamWriter(path)
use writer = new JsonTextWriter(sw)
serializer.Serialize(writer, obj)

How to decode a csv file with long lines in tensorflow with tf.decode_csv?

How to decode a csv file with long lines(e.g., with many items per line so as not realistic to list them one by one for output) with tf.TextLineReader() and tf.decode_csv?
The typical usage is:
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [1,1,1,1,1]
a,b,c,d,e = tf.decode_csv(records=value,record_defaults=record_defaults, field_delim=" ")
When we have thousands of items in a line, it's impossible to assign them one by one as (a,b,c,d,e) above, can all the items be decoded to a list or something like that?
Lets say you have 1800 columns of data. You can use this as record default:
record_defaults=[[1]]*1800
and then use
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
to read them.
Well, tf.decode_csv returns a list, so you can simply do:
record_defaults = [[1], [1], [1], [1], [1]]
all_columns = tf.decode_csv(value, record_defaults=record_defaults)
all_columns
Out: [<tf.Tensor 'DecodeCSV:0' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:1' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:2' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:3' shape=() dtype=int32>,
<tf.Tensor 'DecodeCSV:4' shape=() dtype=int32>
]
You can then evaluate it as usual:
sess = tf.Session()
sess.run(all_columns)
Out: [1, 1, 1, 1, 1]
Note that you need to pass a rank 1 record_defaults. If you have some problems with hanging queue.
Here is the way I am mixing differents dtypes in the record_defaults:
record_defaults = [tf.constant(.1, dtype=tf.float32) for count in range(100)] # 5 fp32 features
record_defaults.extend([tf.constant(1, dtype=tf.int32) for count in range(2)]) # 2 int32 features

find function matlab in numpy/scipy

Is there an equivalent function of find(A>9,1) from matlab for numpy/scipy. I know that there is the nonzero function in numpy but what I need is the first index so that I can use the first index in another extracted column.
Ex: A = [ 1 2 3 9 6 4 3 10 ]
find(A>9,1) would return index 4 in matlab
The equivalent of find in numpy is nonzero, but it does not support a second parameter.
But you can do something like this to get the behavior you are looking for.
B = nonzero(A >= 9)[0]
But if all you are looking for is finding the first element that satisfies a condition, you are better off using max.
For example, in matlab, find(A >= 9, 1) would be the same as [~, idx] = max(A >= 9). The equivalent function in numpy would be the following.
idx = (A >= 9).argmax()
matlab's find(X, K) is roughly equivalent to numpy.nonzero(X)[0][:K] in python. #Pavan's argmax method is probably a good option if K == 1, but unless you know apriori that there will be a value in A >= 9, you will probably need to do something like:
idx = (A >= 9).argmax()
if (idx == 0) and (A[0] < 9):
# No value in A is >= 9
...
I'm sure these are all great answers but I wasn't able to make use of them. However, I found another thread that partially answers this:
MATLAB-style find() function in Python
John posted the following code that accounts for the first argument of find, in your case A>9 ---find(A>9,1)-- but not the second argument.
I altered John's code which I believe accounts for the second argument ",1"
def indices(a, func):
return [i for (i, val) in enumerate(a) if func(val)]
a = [1,2,3,9,6,4,3,10]
threshold = indices(a, lambda y: y >= 9)[0]
This returns threshold=3. My understanding is that Python's index starts at 0... so it's the equivalent of matlab saying 4. You can change the value of the index being called by changing the number in the brackets ie [1], [2], etc instead of [0].
John's original code:
def indices(a, func):
return [i for (i, val) in enumerate(a) if func(val)]
a = [1, 2, 3, 1, 2, 3, 1, 2, 3]
inds = indices(a, lambda x: x > 2)
which returns >>> inds [2, 5, 8]
Consider using argwhere in Python to replace MATLAB's find function. For example,
import numpy as np
A = [1, 2, 3, 9, 6, 4, 3, 10]
np.argwhere(np.asarray(A)>=9)[0][0] # Return first index
returns 3.
import numpy
A = numpy.array([1, 2, 3, 9, 6, 4, 3, 10])
index = numpy.where(A >= 9)
You can do this by first convert the list to an ndarray, then using the function numpy.where() to get the desired index.