Can you import a file by using a variable - json

I have a python script that uses json to store data. In the data, there are also file names, so I was wondering if I could import a file using a variable. Example~
file = "apps/messanger"
import file as msg
If this isn't possible, I would have confirmed my hypothesis and just import all of my files separately. But, if it is possible, I would like to know how just because it would make my life easier.
Thanks for any help!
-Jester

I'm not too good with python but when you handle files you normally use
file = open("path to file", 'r or w') # r for read, w for write
file.close() # when you are done with the file you must close it
If you are going to name it msg, then change the variable from file to msg, like
msg = open("apps/messenger", 'r')
msg.close() # when finished with the file

Related

How can I write to JSON file, without deleting all the content in it?

import json
f = open("filename.json", "w")
data = {"username": "justausername"}
json.dump(data, f)
When I run this code, all the data in the "filename.json" is replaced by "{'username': 'justausername'}". Please help!
Read the file
Parse the JSON to a data structure
Modify the data structure instead of creating a new one
Serialise the data structure back to JSON
Write it to the file
Consider using a real database instead so that you get benefits like automatic protection for concurrent edits. SQLite is a good choice if you want a single file to store the data in.
import json
with open("filename.json", "r") as f: # reading a file
data = json.load(f) # deserialization
data["username"] = "justausername" # modifying the python object
with open("filename.json", "w") as f:
json.dump(data, f) # serializing back to the original file

I am trying to parse 350 files on my local disk and store the data into database as json objects

I am parsing 350 txt files having json data using python. I am able to retrieve 62 of those object and store them on mysql database, but after that I am getting an error saying JSONDecodeError: ExtraData
Python:
import os
import ast
import json
import mysql.connector as mariadb
from mysql.connector.constants import ClientFlag
mariadb_connection = mariadb.connect(user='root', password='137800000', database='shaproject',client_flags=[ClientFlag.LOCAL_FILES])
cursor = mariadb_connection.cursor()
sql3 = """INSERT INTO shaproject.alttwo (alttwo_id,responses) VALUES """
os.chdir('F:/Code Blocks/SEM 2/DM/Project/350/For Merge Disqus')
current_list_dir=os.listdir()
print(current_list_dir)
cur_cwd=os.getcwd()
cur_cwd=cur_cwd.replace('\\','/')
twoid=1
for every_file in current_list_dir:
file=open(cur_cwd + "/" + every_file)
utffile=file.read()
data=json.loads(utffile)
for i in range(0,len(data['response'])):
data123 = json.dumps(data['response'][i])
tup=(twoid,data123)
print(sql3+str(tup))
twoid+=1
cursor.execute(sql3+str(tup)+";")
print(tup)
mariadb_connection.commit()
I have searched online and found that multiple dump statements are resulting in this error. But I am unable to resolve it.
You want to use glob.
Rather than os.listdir(), which is too permissive,
use glob to focus on just the *.json files.
Print out the name of the file before asking .loads() to parse it.
Rename any badly formatted files to .txt rather than .json, in order to skip them.
Note that you can pass the open file directly to .load(), if you wish.
Closing open files would be a good thing.
Rather than a direct assignment (with no close()!)
you would be better off with with:
with open(cur_cwd + "/" + every_file) as file:
data = json.load(file)
Talking about current current working directory seems
both repetitive and redundant.
It would suffice to call it cwd.

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work

Extracting text from plain HTML and write to new file

I'm extracting a certain part of a HTML document (to be fair: basis for this is an iXBRL document which means I do have a lot of written formatting code inside) and write my output, the original file without the extracted part, to a .txt file. My aim is to measure the difference in document size (how much KB of the original document refers to the extracted part). As far as I know there shouldn't be any difference in HTML to text format, so my difference should be reliable although I am comparing two different document formats. My code so far is:
import glob
import os
import contextlib
import re
#contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def extractor():
os.chdir(r"F:\Test")
with stdout2file("FileShortened.txt"):
for file in glob.iglob('*.html', recursive=True):
with open(file) as f:
contents = f.read()
extract = re.compile(r'(This is the beginning of).*?Until the End', re.I | re.S)
cut = extract.sub('', contents)
print(file.split(os.path.sep)[-1], end="| ")
print(cut, end="\n")
extractor()
Note: I am NOT using BS4 or lxml because I am not only interested in HTML text but actually in ALL lines between my start and end-RegEx incl. all formatting code lines.
My code is working without problems, however as I have a lot of files my FileShortened.txt document is quickly going to be massive in size. My problem is not with the file or the extraction, but with redirecting my output to various txt-file. For now, I am getting everything into one file, what I would need is some kind of a "for each file searched, create new txt-file with the same name as the original document" condition (arcpy module?!)?
Somehting like:
File1.html --> File1Short.txt
File2.html --> File2Short.txt
...
Is there an easy way (without changing my code too much) to invert my code in the sense of printing the "RegEx Match" to a new .txt file instead of "everything except my RegEx match"?
Any help appreciated!
Ok, I figured it out.
Final Code is:
import glob
import os
import re
from os import path
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.glob("*.html"): # iterates over all files in the directory ending in .html
with open(file) as f, open((file.rsplit(".", 1)[0]) + ".txt", "w") as out:
contents = f.read()
extract = re.compile(r'Start.*?End', re.I | re.S)
cut = extract.sub('', contents)
out.write(cut)
out.close()
extractor()

How to check encoding of a CSV file

I have a CSV file and I wish to understand its encoding. Is there a menu option in Microsoft Excel that can help me detect it
OR do I need to make use of programming languages like C# or PHP to deduce it.
You can use Notepad++ to evaluate a file's encoding without needing to write code. The evaluated encoding of the open file will display on the bottom bar, far right side. The encodings supported can be seen by going to Settings -> Preferences -> New Document/Default Directory and looking in the drop down.
In Linux systems, you can use file command. It will give the correct encoding
Sample:
file blah.csv
Output:
blah.csv: ISO-8859 text, with very long lines
If you use Python, just use a print() function to check the encoding of a csv file. For example:
with open('file_name.csv') as f:
print(f)
The output is something like this:
<_io.TextIOWrapper name='file_name.csv' mode='r' encoding='utf8'>
You can also use python chardet library
# install the chardet library
!pip install chardet
# import the chardet library
import chardet
# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open("test.csv", 'rb') as file:
print(chardet.detect(file.read()))
Use chardet https://github.com/chardet/chardet (documentation is short and easy to read).
Install python, then pip install chardet, at last use the command line command.
I tested under GB2312 and it's pretty accurate. (Make sure you have at least a few characters, sample with only 1 character may fail easily).
file is not reliable as you can see.
Or you can execute in python console or in Jupyter Notebook:
import csv
data = open("file.csv","r")
data
You will see information about the data object like this:
<_io.TextIOWrapper name='arch.csv' mode='r' encoding='cp1250'>
As you can see it contains encoding infotmation.
CSV files have no headers indicating the encoding.
You can only guess by looking at:
the platform / application the file was created on
the bytes in the file
In 2021, emoticons are widely used, but many import tools fail to import them. The chardet library is often recommended in the answers above, but the lib does not handle emoticons well.
icecream = '🍦'
import csv
with open('test.csv', 'w') as f:
wf = csv.writer(f)
wf.writerow(['ice cream', icecream])
import chardet
with open('test.csv', 'rb') as f:
print(chardet.detect(f.read()))
{'encoding': 'Windows-1254', 'confidence': 0.3864823918622268, 'language': 'Turkish'}
This gives UnicodeDecodeError while trying to read the file with this encoding.
The default encoding on Mac is UTF-8. It's included explicitly here but that wasn't even necessary... but on Windows it might be.
with open('test.csv', 'r', encoding='utf-8') as f:
print(f.read())
ice cream,🍦
The file command also picked this up
file test.csv
test.csv: UTF-8 Unicode text, with CRLF line terminators
My advice in 2021, if the automatic detection goes wrong: try UTF-8 before resorting to chardet.
In Python, You can Try...
from encodings.aliases import aliases
alias_values = set(aliases.values())
for encoding in set(aliases.values()):
try:
df=pd.read_csv("test.csv", encoding=encoding)
print('successful', encoding)
except:
pass
As it is mentioned by #3724913 (Jitender Kumar) to use file command (it also works in WSL on Windows), I was able to get encoding information of a csv file by executing file --exclude encoding blah.csv using info available on man file as file blah.csv won't show the encoding info on my system.
import pandas as pd
import chardet
def read_csv(path: str, size: float = 0.10) -> pd.DataFrame:
"""
Reads a CSV file located at path and returns it as a Pandas DataFrame. If
nrows is provided, only the first nrows rows of the CSV file will be
read. Otherwise, all rows will be read.
Args:
path (str): The path to the CSV file.
size (float): The fraction of the file to be used for detecting the
encoding. Defaults to 0.10.
Returns:
pd.DataFrame: The CSV file as a Pandas DataFrame.
Raises:
UnicodeError: If the encoding of the file cannot be detected with the
initial size, the function will retry with a larger size (increased by
0.20) until the encoding can be detected or an error is raised.
"""
try:
byte_size = int(os.path.getsize(path) * size)
with open(path, "rb") as rawdata:
result = chardet.detect(rawdata.read(byte_size))
return pd.read_csv(path, encoding=result["encoding"])
except UnicodeError:
return read_csv(path=path, size=size + 0.20)
Hi, I just added a function to find the correct encoding and read the csv in the given file path. Thought it would be useful
Just add the encoding argument that matches the file you`re trying to upload.
open('example.csv', encoding='UTF8')