File corruption when inserting images to an existing word file - python-docx

The following code applied to an existing file works for 2 images but beyond the file is marked as corrupted (though it can be recovered perfectly in word):
import docx
docTemplate = "TestTemplate.docx"
# docx job: add test subsections + images
doc_docx = docx.Document(docTemplate)
#doc_docx = docx.Document()
p = doc_docx.add_paragraph()
wp = p.add_run()
wp.add_picture('image.png')
wp.add_break()
wp.add_picture('image.png')
wp.add_break()
wp.add_picture('image.png')
doc_docx.save('TestFile2.docx')
The content of doc_docx.part.blob is available on pastebin

It appears that the docTemplate document was already containing some objects. Thanks to the answer from #Tores76 on python-docx github I could solve the "file corruption" issue. This means it was likely due to a duplicate docPr id
# fix id
doc_element = doc_docx._part._element
docPrs = doc_element.findall('.//' + qn('wp:docPr'))
for docPr in docPrs:
docPr.set('id',str(int(docPr.get('id'))+100000))

Related

How can I update a csv file through a code which constitutes of creating a folder holding the respective csv file without facing FileExistsError?

I have made a code of creating a folder that shall contain the output of the same code in a csv file. But when I wish to make amendments to the code so as to modify the output obtained in the csv file, I do not wish to run into FileExistsError. Is there any way I can do that? Sorry if the query is a foolish one, as I am just beginning to learn Python. Here's my code:
path = Path('file\location\DummyFolder')
path.mkdir(parents=True)
fpath = (path / 'example').with_suffix('.csv')
colours = ['red','blue', 'green', 'yellow']
colour_count = 0
with fpath.open(mode='w', newline='') as csvfile:
fieldnames = ['number', 'colour']
thewriter = csv.DictWriter(csvfile, fieldnames=fieldnames)
thewriter.writeheader()
for colour in colours:
colour_count+=1
thewriter.writerow({'number':colour_count, 'colour':colour})

Create footer in word docx by creating a footer.xml file in the word - folder of the docx.zip via python?

No idea how I can do this considering the (randomly?) generated rsids in the xml code, anyone has a solution?
from docx.text.run import Run
from docx import Document
doc = Document('/Users/cezi/Desktop/ME.docx')
p = doc.sections[0].footer.paragraphs[0]
for run in p.runs:
if ' ' in run.text:
new_run_element = p._element._new_r()
run._element.addprevious(new_run_element)
new_run = Run(new_run_element, run._parent)
new_run.text = "left"
new_run.add_tab()
new_run.add_text("Page")
p.add_run().add_tab()
p.add_run("right")
doc.save("HOW.docx")

.text is scrambled with numbers and special keys in BeautifuSoup

Hello I am currently using Python 3, BeautifulSoup 4 and, requests to scrape some information from supremenewyork.com UK. I have implemented a proxy script (that I know works) into the script. The only problem is that this website does not like programs to scrape this information automatically and so they have decided to scramble this script which I think makes it unusable as text.
My question: is there a way to get the text without using the .text thing and/or is there a way to get the script to read the text? and when it sees a special character like # to skip over it or to read the text when it sees & skip until it sees ;?
because basically how this website scrambles the text is by doing this. Here is an example, the text shown when you inspect element is:
supremetshirt
Which is supposed to say "supreme t-shirt" and so on (you get the idea, they don't use letters to scramble only numbers and special keys)
this  is kind of highlighted in a box automatically when you inspect the element using a VPN on the UK supreme website, and is different than the text (which isn't highlighted at all). And whenever I run my script without the proxy code onto my local supremenewyork.com, It works fine (but only because of the code, not being scrambled on my local website and I want to pull this info from the UK website) any ideas? here is my code:
import requests
from bs4 import BeautifulSoup
categorys = ['jackets', 'shirts', 'tops_sweaters', 'sweatshirts', 'pants', 'shorts', 't-shirts', 'hats', 'bags', 'accessories', 'shoes', 'skate']
catNumb = 0
#use new proxy every so often for testing (will add something that pulls proxys and usses them for you.
UK_Proxy1 = '51.143.153.167:80'
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'https://' + UK_Proxy1 + '',
}
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get(cUrl, proxies=proxies).text
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
#name = item_soup.find('h1', itemprop='name')
style = item_soup.find('p', itemprop='model').text
#style = item_soup.find('p', itemprop='model')
print (alt +(' --- ')+ name +(' --- ')+ style)
#print(alt)
#print(str(name))
#print (str(style))
When I run this script I get this error:
name = item_soup.find('h1', itemprop='name').text
AttributeError: 'NoneType' object has no attribute 'text'
And so what I did was I un-hash-tagged the stuff that is hash-tagged above, and hash-tagged the other stuff that is similar but different, and I get some kind of str error and so I tried the print(str(name)). I am able to print the alt fine (with every script, the alt is not scrambled), but when it comes to printing the name and style all it prints is a None under every alt code is printed.
I have been working on fixing this for days and have come up with no solutions. can anyone help me solve this?
I have solved my own answer using this solution:
thetable = soup5.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
alt = item.find('img')['alt']
name = item.h1.a.text
color = item.p.a.text
print(alt,' --- ', name, ' --- ',color)

Search files recursively using google drive rest

I am trying to grab all the files created under a parent directory. The parent directory has a lot of sub directories followed by files in those directories.
parent
--- sub folder1
--- file1
--- file2
Currently I am grabbing all the ids of sub folders and constructing a query such as q: 'subfolder1id' in parents or 'subfolder2id' in parents to find the list of files. Then I issue these in batches. If I have 100 folders, I issue 10 search queries for a batch size of 10.
Is there a better way of querying the files using google drive rest api that will get me all the files with one query?
Here is an answer to your question.
Same idea from your scenario:
folderA____ folderA1____folderA1a
\____folderA2____folderA2a
\___folderA2b
There 3 alternative answers that I think you can get an idea from.
Alternative 1. Recursion
The temptation would be to list the children of folderA, for any
children that are folders, recursively list their children, rinse,
repeat. In a very small number of cases, this might be the best
approach, but for most, it has the following problems:-
It is woefully time consuming to do a server round trip for each sub folder. This does of course depend on the size of your tree, so if
you can guarantee that your tree size is small, it could be OK.
Alternative 2. The common parent
This works best if all of the files are being created by your app (ie.
you are using drive.file scope). As well as the folder hierarchy
above, create a dummy parent folder called say "MyAppCommonParent". As
you create each file as a child of its particular Folder, you also
make it a child of MyAppCommonParent. This becomes a lot more
intuitive if you remember to think of Folders as labels. You can now
easily retrieve all descdendants by simply querying MyAppCommonParent
in parents.
Alternative 3. Folders first
Start by getting all folders. Yep, all of them. Once you have them all
in memory, you can crawl through their parents properties and build
your tree structure and list of Folder IDs. You can then do a single
files.list?q='folderA' in parents or 'folderA1' in parents or
'folderA1a' in parents.... Using this technique you can get
everything in two http calls.
Alternative 2 is the most effificient, but only works if you have
control of file creation. Alternative 3 is generally more efficient
than Alternative 1, but there may be certain small tree sizes where 1
is best.
scope = ['https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('your JSON credentials' % path, scope)
service = build('drive', 'v3', credentials=credentials)
folder_tree = "NAME OF THE FOLDER YOU WANT TO START YOUR SEARCH"
folder_ids = {}
folder_ids['NAME OF THE FOLDER YOU WANT TO START YOUR SEARCH'] = folder_id
def check_for_subfolders(folder_id):
new_sub_patterns = {}
folders = service.files().list(q="mimeType='application/vnd.google-apps.folder' and parents in '"+folder_id+"' and trashed = false",fields="nextPageToken, files(id, name)",pageSize=400).execute()
all_folders = folders.get('files', [])
all_files = check_for_files(folder_id)
n_files = len(all_files)
n_folders = len(all_folders)
old_folder_tree = folder_tree
if n_folders != 0:
for i,folder in enumerate(all_folders):
folder_name = folder['name']
subfolder_pattern = old_folder_tree + '/'+ folder_name
new_pattern = subfolder_pattern
new_sub_patterns[subfolder_pattern] = folder['id']
print('New Pattern:', new_pattern)
all_files = check_for_files(folder['id'])
n_files =len(all_files)
new_folder_tree = new_pattern
if n_files != 0:
for file in all_files:
file_name = file['name']
new_file_tree_pattern = subfolder_pattern + "/" + file_name
new_sub_patterns[new_file_tree_pattern] = file['id']
print("Files added :", file_name)
else:
print('No Files Found')
else:
all_files = check_for_files(folder_id)
n_files = len(all_files)
if n_files != 0:
for file in all_files:
file_name = file['name']
subfolders[folder_tree + '/'+file_name] = file['id']
new_file_tree_pattern = subfolder_pattern + "/" + file_name
new_sub_patterns[new_file_tree_pattern] = file['id']
print("Files added :", file_name)
return new_sub_patterns
def check_for_files(folder_id):
other_files = service.files().list(q="mimeType!='application/vnd.google-apps.folder' and parents in '"+folder_id+"' and trashed = false",fields="nextPageToken, files(id, name)",pageSize=400).execute()
all_other_files = other_files.get('files', [])
return all_other_files
def get_folder_tree(folder_id):
global folder_tree
sub_folders = check_for_subfolders(folder_id)
for i,sub_folder_id in enumerate(sub_folders.values()):
folder_tree = list(sub_folders.keys() )[i]
print('Current Folder Tree : ', folder_tree)
folder_ids.update(sub_folders)
print('****************************************Recursive Search Begins**********************************************')
try:
get_folder_tree(sub_folder_id)
except:
print('---------------------------------No furtherance----------------------------------------------')
return folder_ids
folder_ids = get_folder_tree(folder_id)

Storing a BLOB in MySQL Database

I have an image uploaded to a server with location like : opt/glassfish/domains/domain1/applications/j2ee-modules/SmartbadgePortal/images/2badge.jpg
I am trying to read the contents of the image rather than the image information. I searched a lot and could get the following solution for it :
File uploadedFile = new File(path);
System.out.println("Uploaded File is *** : " + uploadedFile);
item.write(uploadedFile);
Image image = null;
image = ImageIO.read(uploadedFile);
System.out.println("Image Contents is *** : " + image);
However, the when I used System.out to print "image". I get :
Image Contents is * : BufferedImage#10d7a81: type = 5 ColorModel: #pixelBits = 24 numComponents = 3 color space = java.awt.color.ICC_ColorSpace#722270 transparency = 1 has alpha = false isAlphaPre = false ByteInterleavedRaster: width = 418 height = 387 #numDataElements 3 dataOff[0] = 2|#]
But , this is not what I need. I need the contents of the image and need to store it in a BLOB column in MySQL. Please help as I am been trying various methods like ByteArrayInputStream ,but every time I see only this info rather than image itself.
Although it's not the answer you're looking for, my recommendation is to store the image in your server's file system and saving the file name (and maybe directory) in your DB. Storing an image in a BLOB cell is usually not a good idea unless there's a specific reason.