How to get paperspace of each entity in a DXF file? - dxf

I would like to know on which paperspace each entity in a DXF file would be located.
I tried to find this by the code below, but the layout name I got is always the same one for each entity, while there are multiple number of paperspace in the DXF file.
How can I get the paperspace for each entity in a DXF file?
# Read a DXF file
doc = ezdxf.readfile(file_path, encoding='shift-jis')
# Print names of Paperspaces
print(doc.layout_names_in_taborder())
# -> ['Model', '0A_Cover', '0B_Number', '0C_List', '01', '01H', '02', '02H', '03', '03H', '04', '04H', '05', '05H', '06', '06H', '07', '07H']
# Print the name of paperspace for each entity in the DXF file
for e in doc.entities:
# Paperspace Name
print(e.drawing.layout().name)
# -> 0A_Cover
# 0A_Cover
# 0A_Cover
# 0A_Cover
# 0A_Cover
# ...
# ...
# The paperspace name is always '0A_Cover' for all entities.
enter image description here

Get the layout you want by the name as shown in tabs:
layout = doc.layout('0A_Cover')
Iterate over entitites of this layout:
for e in layout:
print(str(e))
or use an entity query:
lines = layout.query('LINE')
for line in lines:
print(str(line))
doc.entities contains only the entities of the model space and the actual active paper space also known as the ENTITIES section of the DXF file.
To iterate over all entities in layouts (including 'Model') use:
for name in doc.layout_names_in_taborder():
for e in doc.layout(name):
print(str(e))
To iterate over all entities in layouts and blocks use:
for e in doc.chain_layouts_and_blocks():
print(str(e))
Iterate over all DXF entities of a DXF document use the entity database:
for e in doc.entitydb.values():
print(str(e))

Related

Loading Multiple CSV files across all subfolder levels with Wildcard file name

I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please
Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:

Removing data from a json file on bases of their value

I had produced a script to parse some blast files from different samples. As I wanted to know the genes that all the samples had it commum I created a list, and a dictionary to count them. I have also produced a json file from the dictionary. Now I want to removed those genes whose counts are less than 100, as this is the number of samples, either from the dictionary or from the json file but I don't know how to.
This is part of the code:
###to produce a dictionary with the genes, and their repetitions
for extracted_gene in matches:
if extracted_gene in matches_counts:
matches_counts[extracted_gene]+=1
else:
matches_counts[extracted_gene]=1
print matches_counts #check point
#if matches_counts[extracted_gene]==100:
#print extracted_gene
#to convert a dictionary into a txt file and format it with json
with open('my_gene_extraction_trial.txt', 'w') as file:
json.dump(matches_counts,file, sort_keys=True, indent=2, separators=(',',':'))
print 'Parsing has finished'
I had tried different ways to do so:
a) ignoring the else statement but then it will give me an empty dict
b)trying to print only the ones whose values is 100, but it does not get printed
c) I read the documentation about json but I only can see how to delete elements by objects but not by values.
Can I anyone help me with this issue, please? This is getting me mad!
This is what it should look like:
# matches (list) and matches_counts (dict) already defined
for extracted_gene in matches:
if extracted_gene in matches_counts:
matches_counts[extracted_gene] += 1
else: matches_counts[extracted_gene] = 1
print matches_counts #check point
# Create a copy of the dict of matches to remove items from
counts_100 = matches_counts.copy()
for extracted_gene in matches_counts:
if matches_counts[extracted_gene] < 100:
del counts_100[extracted_gene]
print counts_100
Let me know if you still get errors.

Extracting greek characters from technical PDF documents when using Python 3

I'm currently trying to construct a database of chemicals used in a university department, and their hazard classes. I then wish to output to a csv file. One step is to pull all the synonyms for the various chemicals from standard PDFs, such as this for gamma hexalactone:
sample PDF
At the moment, the code I'm using to extract the text just loses the greek characters which I need to transfer. It looks like this:
pdfReader = PyPDF2.PdfFileReader(inpathf) txtObj = '' for pageNum in range (0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
txtObj += str(pageObj.extractText())
inpathf.close()
outputf.write(txtObj)
outputf.close()
return txtObj
Parameters are extracted from ~2000 PDFs and stored in a dictionary before being transferred to a csv file:
def Outfile_csv(outfile, dict1, length):
outputfile = open((outfile) + '.csv', 'w', newline ='')
output_list = []
outputWriter = csv.writer(outputfile)
outputWriter.writerow(['PDF file', 'Name', 'Synonyms', 'CAS No.', 'H statements',
'TWA limits /ppm', 'STEL limits /ppm'])
for r in range (0, length):
output_list =[]
for s in range (0,7):
if s == 0 or s == 3:
output_list.append(str((dict1[s][r])).encode('utf-8'))
else:
output_list.append(str(dict1[s][r]))
outputWriter.writerow(output_list)
outputfile.close()
I also can't read out to the CSV in cases where there are greek characters - those data are simply not placed in the csv file. Many thanks for any help - a day playing with codecs and the contents of stackexchange has not helped yet. I'm using Python 3.4 and Windows 8.

Parse txt file with shell

I have a txt file containing the output from several commands executed on a networking equipment. I wanted to parse this txt file so i can sort and print on an HTML page.
What is the best/easiest way to do this? Export every command to an array and then print array with sort on the HTML code?
Commands are between lines and they're tabular data. example:
*********************************************************************
# command 1
*********************************************************************
Object column1 column2 Total
-------------------------------------------------------------------
object 1 526 9484 10010
object 2 2 10008 10010
Object 3 0 20000 20000
*********************************************************************
# command 2
*********************************************************************
(... tabular data ...)
Can someone suggest any code or file where see how to make this work?
Thanks!
This can be easily done in Python with this example code:
f = open('input.txt')
rulers = 0
table = []
for line in f.readlines():
if '****' in line:
rulers += 1
if rulers == 2:
table = []
elif rulers > 2:
print(table)
rulers = 0
continue
if line == '\n' or '----' in line or line.startswith('#'):
continue
table.append(line.split())
print(table)
It just prints list of lists of the tabular values. But it can be formatted to whatever HTML or another format you need.
Import into your spreadsheet software. Export to HTML from there, and modify as needed.

How do I feed in my own data into PyAlgoTrade?

I'm trying to use PyAlogoTrade's event profiler
However I don't want to use data from yahoo!finance, I want to use my own but can't figure out how to parse in the CSV, it is in the format:
Timestamp Low Open Close High BTC_vol USD_vol [8] [9]
2013-11-23 00 800 860 847.666666 886.876543 853.833333 6195.334452 5248330 0
2013-11-24 00 745 847.5 815.01 860 831.255 10785.94131 8680720 0
The complete CSV is here
I want to do something like:
def main(plot):
instruments = ["AA", "AES", "AIG"]
feed = yahoofinance.build_feed(instruments, 2008, 2009, ".")
Then replace yahoofinance.build_feed(instruments, 2008, 2009, ".") with my CSV
I tried:
import csv
with open( 'FinexBTCDaily.csv', 'rb' ) as csvfile:
data = csv.reader( csvfile )
def main( plot ):
feed = data
But it throws an attribute error. Any ideas how to do this?
I suggest to create your own Rowparser and Feed, which is much easier than it sounds, have a look here: yahoofeed
This also allows you to work with intraday data and cleanup the data if needed, like your timestamp.
Another possibility, of course, would be to parse your file and save it, so it looks like a yahoo feed. In your case, you would have to adapt the columns and the Timestamp.
Step A: follow PyAlgoTrade doc on GenericBarFeed class
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.16
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.17
Note
- The CSV file must have the column names in the first row.
- It is ok if the Adj Close column is empty.
- When working with multiple instruments:
--- If all the instruments loaded are in the same timezone, then the timezone parameter may not be specified.
--- If any of the instruments loaded are in different timezones, then the timezone parameter should be set.
addBarsFromCSV( instrument, path, timezone = None )
Loads bars for a given instrument from a CSV formatted file. The instrument gets registered in the bar feed.
Parameters:
(string) instrument – Instrument identifier.
(string) path – The path to the CSV file.
(pytz) timezone – The timezone to use to localize bars.Check pyalgotrade.marketsession.
Next:
A BarFeed loads bars from CSV files that have the following format:
Date Time, Open, High, Low, Close, Volume, Adj Close
2013-01-01 13:59:00,13.51001,13.56,13.51,13.56789,273.88014126,13.51001
Step B: implement a documented CSV-file pre-formatting
Your CSV data will need a bit of sanity ( before will be able to be used in PyAlgoTrade methods ),however it is doable and you can create an easy transformator either by hand or with a use of a powerful numpy.genfromtxt() lambda-based converters facilities.
This sample code is intended for an illustration purpose, to see immediately the powers of converters for your own transformations, as CSV-structure differs.
with open( getCsvFileNAME( ... ), "r" ) as aFH:
numpy.genfromtxt( aFH,
skip_header = 1, # Ref. pyalgotrade
delimiter = ",",
# v v v v v v
# 2011.08.30,12:00,1791.20,1792.60,1787.60,1789.60,835
# 2011.08.30,13:00,1789.70,1794.30,1788.70,1792.60,550
# 2011.08.30,14:00,1792.70,1816.70,1790.20,1812.10,1222
# 2011.08.30,15:00,1812.20,1831.50,1811.90,1824.70,2373
# 2011.08.30,16:00,1824.80,1828.10,1813.70,1817.90,2215
converters = { 0: lambda aString: mPlotDATEs.date2num( datetime.datetime.strptime( aString, "%Y.%m.%d" ) ), #_______________________________________asFloat ( 1.0, +++ )
1: lambda aString: ( ( int( aString[0:2] ) * 60 + int( aString[3:] ) ) / 60. / 24. ) # ( 15*60 + 00 ) / 60. / 24.__asFloat < 0.0, 1.0 )
# HH: :MM HH MM
}
)
You can use pyalgotrade.barfeed.addBarsFromSequence with list comprehension to feed in data from CSV row by row/bar by bar. Basically you create a bar from each row, pass OHLCV as init parameters and extra columns with additional data in a dictionary. You can try something like this (with all the required imports):
data = pd.DataFrame(index=pd.date_range(start='2021-11-01', end='2021-11-05'), columns=['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'ExtraCol1', 'ExtraCol3', 'ExtraCol4', 'ExtraCol5'], data=np.random.rand(5, 10))
feed = yahoofeed.Feed()
feed.addBarsFromSequence('instrumentID', data.index.map(lambda i:
BasicBar(
i,
data.loc[i, 'Open'],
data.loc[i, 'High'],
data.loc[i, 'Low'],
data.loc[i, 'Close'],
data.loc[i, 'Volume'],
data.loc[i, 'Adj Close'],
Frequency.DAY,
data.loc[i, 'ExtraCol1':].to_dict())
).values)
The input data frame was created with random values to make this example easier to reproduce, but the part where the bars are added to the feed should work the same for data frames from CSVs given that the valid column names are used.