Open heavy string-text dictionaries in python - json

I try to convert an heavy .txt which is the string of a dicionary like (a part):
"{'A45171': {'Gen_n': 'Putative uncharacterized protein', 'Srce': 'UniProtKB', 'Ref': 'GO_REF:0000033', 'Tax': 'NCBITaxon:684364', 'Gen_bl': 'BATDEDRAFT_15336', 'Gen_id': 'UniProtKB:F4NTD6', 'Ev_n': 'IBA', 'GO_n': 'ergosterol biosynthetic process', 'GO': 'GO:0006696', 'Org': 'Batrachochytrium dendrobatidis JAM81', 'Type': 'protein', 'Ev_e': 'ECO:0000318', 'Con': 'GO_Central'}, 'A43886': {'Gen_n': 'Uncharacterized protein', 'Srce': 'UniProtKB', 'Ref': 'GO_REF:0000002', 'Tax': 'NCBITaxon:9823', 'Gen_bl': 'RDH8', 'Gen_id': 'UniProtKB:F1S3H8', 'Ev_n': 'IEA', 'GO_n': 'estrogen biosynthetic process', 'GO': 'GO:0006703', 'Org': 'Sus scrofa', 'Type': 'protein', 'Ev_e': 'ECO:0000501', 'Con': 'InterPro'}}"
I've tryed ast module:
import ast
dic_gene_definitions = open("Gene_Ontology/output_data/dic_gene_definitions.txt", "r")
dic_gene_definitions = dic_gene_definitions.read()
dic_gene_definitions = ast.literal_eval(dic_gene_definitions)
Which weight 22Mb and when don't crush, it runs so slow.
I really wants to open an 500 Mb files...
I've look json module which can open so faster, but in heavy dictionary string it crash also (not with short examples).
Any solution...?
Thank you so much.

I've look some non-RAM memory consuming method.
By use in Ubuntu's terminal:
sudo swapon -s
we can appreciate the RAM memory consumed by different opperations:
Filename Type Size Used Priority
/swapfile file 2097148 19876 -1
By to operate with this example file (500Mb) by make an dictionary, for example, best way is to open it from normal text tabulated data format, and operate with minour RAM consumption:
with open("Gene_Ontology/output_data/GO_annotations_dictionary.txt", "r") as handle:
for record in handle.read().splitlines():
anote = record.split("\t")
ast module is fine but not by large files.

Related

How to incorporate projected columns in scanner into new dataset partitioning

Let's say I load a dataset
myds=ds.dataset('mypath', format='parquet', partitioning='hive')
myds.schema
# On/Off_Peak: string
# area: string
# price: decimal128(8, 4)
# date: date32[day]
# hourbegin: int32
# hourend: int32
# inflation: string rename to Inflation
# Price_Type: string
# Reference_Year: int32
# Case: string
# region: string rename to Region
My end goal is to resave the dataset with the following projection:
projection={'Region':ds.field('region'),
'Date':ds.field('date'),
'isPeak':pc.equal(ds.field('On/Off_Peak'),ds.scalar('On')),
'Hourbegin':ds.field('hourbegin'),
'Hourend':ds.field('hourend'),
'Inflation':ds.field('inflation'),
'Price_Type':ds.field('Price_Type'),
'Area':ds.field('area'),
'Price':ds.field('price'),
'Reference_Year':ds.field('Reference_Year'),
'Case':ds.field('Case'),
}
I make a scanner
scanner=myds.scanner(columns=projection)
Now I try to save my new dataset with
ds.write_dataset(scanner, 'newpath',
partitioning=['Reference_Year', 'Case', 'Region'], partitioning_flavor='hive',
format='parquet')
but I get
KeyError: 'Column Region does not exist in schema'
I can work around this by changing my partitioning to ['Reference_Year', 'Case', 'region'] to match the non-projected columns (and then later changing the name of all those directories) but is there a way to do it directly?
Suppose my partitioning needed the compute for more than just the column name changing. Would I have to save a non-partitioned dataset in one step to get the new column and then do another save operation to create the partitioned dataset?
EDIT: this bug has been fixed in pyarrow 10.0.0
It looks like a bug to me. It's as if write_dataset is looking at the dataset_schema rather than the projected_schema
I think you can get around it by calling to_reader on the scanner.
table = pa.Table.from_arrays(
[
pa.array(['a', 'b', 'c'], pa.string()),
pa.array(['a', 'b', 'c'], pa.string()),
],
names=['region', "Other"]
)
table_dataset = ds.dataset(table)
columns={
"Region": ds.field('region'),
"Other": ds.field('Other'),
}
scanner = table_dataset.scanner(columns=columns)
ds.write_dataset(
scanner.to_reader(),
'newpath',
partitioning=['Region'], partitioning_flavor='hive',
format='parquet')
I've reported the issue here

Useful way to convert string to dictionary using python

I have the below string as input:
'name SP2, status Online, size 4764771 MB, free 2576353 MB, path /dev/sde, log 210 MB, port 5660, guid 7478a0141b7b9b0d005b30b0e60f3c4d, clusterUuid -8650609094877646407--116798096584060989, disks /dev/sde /dev/sdf /dev/sdg, dare 0'
I wrote function which convert it to dictionary using python:
def str_2_json(string):
str_arr = string.split(',')
#str_arr{0} = name SP2
#str_arr{1} = status Online
json_data = {}
for i in str_arr:
#remove whitespaces
stripped_str = " ".join(i.split()) # i.strip()
subarray = stripped_str.split(' ')
#subarray{0}=name
#subarray{1}=SP2
key = subarray[0] #key: 'name'
value = subarray[1] #value: 'SP2'
json_data[key] = value
#{dict 0}='name': SP2'
#{dict 1}='status': online'
return json_data
The return turns the dictionary into json (it has jsonfiy).
Is there a simple/elegant way to do it better?
You can do this with regex
import re
def parseString(s):
dict(re.findall('(?:(\S+) ([^,]+)(?:, )?)', s))
sample = "name SP1, status Offline, size 4764771 MB, free 2406182 MB, path /dev/sdb, log 230 MB, port 5660, guid a48134c00cda2c37005b30b0e40e3ed6, clusterUuid -8650609094877646407--116798096584060989, disks /dev/sdb /dev/sdc /dev/sdd, dare 0"
parseString(sample)
Output:
{'name': 'SP1',
'status': 'Offline',
'size': '4764771 MB',
'free': '2406182 MB',
'path': '/dev/sdb',
'log': '230 MB',
'port': '5660',
'guid': 'a48134c00cda2c37005b30b0e40e3ed6',
'clusterUuid': '-8650609094877646407--116798096584060989',
'disks': '/dev/sdb /dev/sdc /dev/sdd',
'dare': '0'}
Your approach is good, except for a couple weird things:
You aren't creating a JSON anything, so to avoid any confusion I suggest you don't name your returned dictionary json_data or your function str_2_json. JSON, or JavaScript Object Notation is just that -- a standard of denoting an object as text. The objects themselves have nothing to do with JSON.
You can use i.strip() instead of joining the splitted string (not sure why you did it this way, since you commented out i.strip())
Some of your values contain multiple spaces (e.g. "size 4764771 MB" or "disks /dev/sde /dev/sdf /dev/sdg"). By your code, you end up everything after the second space in such strings. To avoid this, do stripped_str.split(' ', 1) which limits how many times you want to split the string.
Other than that, you could create a dictionary in one line using the dict() constructor and a generator expression:
def str_2_dict(string):
data = dict(item.strip().split(' ', 1) for item in string.split(','))
return data
print(str_2_dict('name SP2, status Online, size 4764771 MB, free 2576353 MB, path /dev/sde, log 210 MB, port 5660, guid 7478a0141b7b9b0d005b30b0e60f3c4d, clusterUuid -8650609094877646407--116798096584060989, disks /dev/sde /dev/sdf /dev/sdg, dare 0'))
Outputs:
{
'name': 'SP2',
'status': 'Online',
'size': '4764771 MB',
'free': '2576353 MB',
'path': '/dev/sde',
'log': '210 MB',
'port': '5660',
'guid': '7478a0141b7b9b0d005b30b0e60f3c4d',
'clusterUuid': '-8650609094877646407--116798096584060989',
'disks': '/dev/sde /dev/sdf /dev/sdg',
'dare': '0'
}
This is probably the same (practically, in terms of efficiency / time) as writing out the full loop:
def str_2_dict(string):
data = dict()
for item in string.split(','):
key, value = item.strip().split(' ', 1)
data[key] = value
return data
Assuming these fields cannot contain internal commas, you can use re.split to both split and remove surrounding whitespace. It looks like you have different types of fields that should be handled differently. I've added a guess at a schema handler based on field names that can serve as a template for converting the various fields as needed.
And as noted elsewhere, there is no json so don't use that name.
import re
test = 'name SP2, status Online, size 4764771 MB, free 2576353 MB, path /dev/sde, log 210 MB, port 5660, guid 7478a0141b7b9b0d005b30b0e60f3c4d, clusterUuid -8650609094877646407--116798096584060989, disks /dev/sde /dev/sdf /dev/sdg, dare 0'
def decode_data(string):
str_arr = re.split(r"\s*,\s*", string)
data = {}
for entry in str_arr:
values = re.split(r"\s+", entry)
key = values.pop(0)
# schema processing
if key in ("disks"): # multivalue keys
data[key] = values
elif key in ("size", "free"): # convert to int bytes on 2nd value
multiplier = {"MB":10**6, "MiB":2**20} # todo: expand as needed
data[key] = int(values[0]) * multiplier[values[1]]
else:
data[key] = " ".join(values)
return data
decoded = decode_data(test)
for kv in sorted(decoded.items()):
print(kv)
import json
json_data = json.loads(string)

CSV Reader works, Trouble with CSV writer

I am writing a very simple python script to READ a CSV (no problem) and to write to another CSV (issue):
System info:
Windows 10
Powershell
Python 3.6.5 :: Anaconda, Inc.
Sample Data: Office Events
The purpose is to filter events based on criteria, and to write to another CSV with desired criteria.
For Example:
I would like to read from this CSV and write the events where Registrations (or column 4) is Greater than 0 (remove rows with registrations = 0)
# SCRIPT TO FILTER EVENTS TO BE PROCESSED
import os
import time
import shutil
import os.path
import fnmatch
import csv
import glob
import pandas
# Location of file containing ALL events
path = r'allEvents.csv'
# Writes to writer
writer = csv.writer(open(r'RegisteredEvents' + time.strftime("%m_%d_%Y-%I_%M_%S") + '.csv', "wb"))
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
#writer.writerow([r'Event Name', r'Start Date', r'End Date', r'Registrations', r'Total Revenue', r'ID', r'Status'])
#writer.writerow([b'Event Name', b'Start Date', b'End Date', b'Registrations', b'Total Revenue', b'ID', b'Status'])
def checkRegistrations(file):
reader = csv.reader(file)
data = list(reader)
for row in data:
#if row[3] > str(0):
if row[3] > int(0):
writer.writerow(([data]))
The Error I continue to get is:
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
TypeError: a bytes-like object is required, not 'str'
I have tried using the various commented out statements
For Example:
"" vs r"" vs r'' vs b''
if row[3] > int(0) **vs** if row[3] > str(0)
Every time I execute my script, It creates the file.. so the first csv writer line works (create and open the file)... the second line (to write the headers) is when the error appears...
Perhaps I am getting mixed up with syntax due to python versions, or perhaps I am misusing the CSV library, or (more than likely) I have endless to learn about data type IO and conversion... someone please help!!
I am aware of the excess of import libraries -- script came from another basic script to move files from one location to another based on filename and output a rowcounter for each file being moved.
With that being said, I may be unaware of any missing/ needed libraries
Please let me know if you have any questions, concerns or clarifications
Thanks in advance!
It looks like you are calling:
writer = csv.writer(open('file.csv', 'wb'))
The 'wb' argument is the file mode. The 'b' means that you are opening the file that you are writing to in binary mode. You are then trying to write a string which isn't what it is expecting.
Try getting rid of the 'b' in the 'wb'.
writer = csv.writer(open('file.csv', 'w'))
Let me know if that works for you.

How to specify gunicorn log max size

I'm running gunicorn as:
guiconrn --bind=0.0.0.0:5000 --log-file gunicorn.log myapp:app
Seems like gunicorn.log keeps growing. Is there a way to specify a max size of the log file, so that if it reaches max size, it'll just override it.
Thanks!!
TLDR;
I believe there might be a "python only" solution using the rotating file handler provided in the internal lib of python. (at least 3.10)
To test
I created a pet project for you to fiddle with:
Create the following python file
test_logs.py
import logging
import logging.config
import time
logging.config.fileConfig(fname='log.conf', disable_existing_loggers=False)
while True:
time.sleep(0.5)
logging.debug('This is a debug message')
logging.info('This is an info message')
logging.warning('This is a warning message')
logging.error('This is an error message')
logging.critical('This is a critical message')
Create the following config file
log.conf
[loggers]
keys=root
[handlers]
keys=rotatingHandler
[formatters]
keys=sampleFormatter
[logger_root]
level=DEBUG
handlers=rotatingHandler
[handler_rotatingHandler]
class=logging.handlers.RotatingFileHandler
level=DEBUG
formatter=sampleFormatter
args=('./logs/logs.log', 'a', 1200, 1, 'utf-8')
[formatter_sampleFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
Create the ./logs directory
Run python test_logs.py
To Understand
As you may have noticed already, the setting that allow for this behaviour is logging.handlers.RotatingFileHandler and the provided arguments args=('./logs/logs.log', 'a', 1200, 10, 'utf-8')
RotatingFileHandler is a stream handler writing to a file. That allow for 2 parameters of interest:
maxBytes set arbitrarily at 1200
backupCount set arbitrarily to 10
The behaviour is that upon reaching 1200 Bytes in size, the file is closed, renamed to /logs/logs.log.<a number up to 10> and a new file is opened.
BUT is any of maxBytes or backupCount is 0. No rotation is done !
In Gunicorn
As per the documentation you can feed a config file.
This could look like:
guiconrn --bind=0.0.0.0:5000 --log-config log.conf myapp:app
You will need to tweak it to your existing setup.
On Ubuntu/Linux, suggest to use logrotate to manage your logs, do like this: https://stackoverflow.com/a/55643449/6705684
Since Python>3.3, With RotatingFileHandler, here is my solution(MacOS/Windows/Linux/...) :
import os
import logging
from logging.handlers import RotatingFileHandler
fmt_str = '[%(asctime)s]%(module)s - %(funcName)s - %(message)s'
fmt = logging.Formatter(fmt_str)
def rotating_logger(name, fmt=fmt,
level=logging.INFO,
logfile='.log',
maxBytes=10 * 1024 * 1024,
backupCount=5,
**kwargs
):
logger = logging.getLogger(name)
hdl = RotatingFileHandler(logfile, maxBytes=maxBytes, backupCount=backupCount)
hdl.setLevel(level)
hdl.setFormatter(fmt)
logger.addHandler(hdl)
return logger
more refer:
https://docs.python.org/3/library/logging.handlers.html#rotatingfilehandler

The speed of mongoimport while using -jsonArray is very slow

I have a 15GB file with more than 25 milion rows, which is in this json format(which is accepted by mongodb for importing:
[
{"_id": 1, "value": "\u041c\..."}
{"_id": 2, "value": "\u041d\..."}
...
]
When I'm trying to import it in mongodb with the following command I get speed of only 50 rows per second which is really slow for me.
mongoimport --db wordbase --collection sentences --type json --file C:\Users\Aleksandar\PycharmProjects\NLPSeminarska\my_file.json -jsonArray
When I tried to insert the data into the collection by using python with pymongo the speed was even worse. I also tried increasing the priority of the process but it didn't make any difference.
The next thing that I tried is the same thing but without using -jsonArray and although I got a big speed increase(~4000/sec), it said that the BSON representation of the supplied JSON is too large.
I also tried splitting the file into 5 separate files and importing them from separate consoles into the same collection, but I get speed decrease of all of them to about 20 documents/sec.
While I searched all over the web I saw that people had speeds of over 8K documents/sec and I can't see what do I do wrong.
Is there a way to speed this thing up, or should I convert the whole json file to bson and import it that way, and if so which is the correct way to do both the converting and the importing?
Huge thanks.
I have the exact same problem with a 160Gb dump file. It took me two days to load 3% of the original file with -jsonArray and 15 minutes with these changes.
First, remove the initial [ and trailing ] characters:
sed 's/^\[//; s/\]$/' -i filename.json
Then import without the -jsonArray option:
mongoimport --db "dbname" --collection "collectionname" --file filename.json
If the file is huge, sed will take a really long time and maybe you run into storage problems. You can use this C program instead (not written by me, all glory to #guillermobox):
int main(int argc, char *argv[])
{
FILE * f;
const size_t buffersize = 2048;
size_t length, filesize, position;
char buffer[buffersize + 1];
if (argc < 2) {
fprintf(stderr, "Please provide file to mongofix!\n");
exit(EXIT_FAILURE);
};
f = fopen(argv[1], "r+");
/* get the full filesize */
fseek(f, 0, SEEK_END);
filesize = ftell(f);
/* Ignore the first character */
fseek(f, 1, SEEK_SET);
while (1) {
/* read chunks of buffersize size */
length = fread(buffer, 1, buffersize, f);
position = ftell(f);
/* write the same chunk, one character before */
fseek(f, position - length - 1, SEEK_SET);
fwrite(buffer, 1, length, f);
/* return to the reading position */
fseek(f, position, SEEK_SET);
/* we have finished when not all the buffer is read */
if (length != buffersize)
break;
}
/* truncate the file, with two less characters */
ftruncate(fileno(f), filesize - 2);
fclose(f);
return 0;
};
P.S.: I don't have the power to suggest a migration of this question but I think this could be helpful.