python-docx(verion 0.8.11) Inline Picture - python-docx

Am using windows 10 and python-docx(verion 0.8.11). How do I add_picture and implement Textwrap format = "square"?
2) Can we insert floating Picture?
Tried this...
from docx import Document
document = Document()
# Add a picture to the document
picture = document.add_picture('picture1.png')
# Set the text wrap type to 'square'
picture.text_wrap = True
# Save the document
document.save('document06Jan.docx')
No Errors but, picture wrapFormat is still InLine with Text

Related

How to hyperlink the text files into HTML file using R?

I have an HTML output file and there is a column named "Description" in this file. I want to link locally saved text files to some of the entries of this column when a value is Report data does not match.
Snapshot the HTML file is following:
So, there are dedicated texts files to row no: 12, 16, 17, 18, 19, 20, which I want to link them to the Description column.
Line of Codes to generating HTML file is:
library(xtable)
extract1 <- result[,list(TestCaseID, breadcrumb, Discription),]
print(xtable(extract1), type = "html", file = "extracted.html")
How to do the linking of text files. Please let me know if any modification required in the question. Thanks in Advance!!!
I recommend that you perform a pre-processing according to your requirements. Because the names of the text files may change later, they should be provided as a separate column.
If the text file link is not required think of a conditional processing for NA's later.
The sample below is based on a main list. Text files reside in a subfolder.
The trick is the use of HTML tag href and use of the sanitize.text.function as shown below for your test cases.
You'll need to create some dummy text files like gauge-D00.txt, gauge-D01.txt, etc. in a subfolder to try the example.
# --------------------------------------------------------
# gauge main ID list
#---------------------------------------------------------
# ID,location,description,textfile
# D00,nature reserve,Otternhagener Moor,../gauge-D00.txt
# D01,nature reserve,Helstorfer Moor,../gauge-D01.txt
# FER,benchmark,Negenborner Weg,../gauge-FER.txt
#----------------------------------------------------------
# text files reside in /data-develop-text-file-link/
# ---------------------------------------------------------
library (xtable)
gaugelist <- structure(list(
ID = structure(1:3, .Label = c("D00", "D01", "FER"), class = "factor"),
location = structure(c(2L, 2L, 1L), .Label = c("benchmark", "nature reserve"), class = "factor"),
description = structure(c(3L, 1L, 2L), .Label = c("Helstorf", "Negenborn", "Otternhagen"), class = "factor"),
textfile = structure(c(2L, 3L, 1L), .Label = c("../gauge-FER.txt", "../gauge-D00.txt", "../gauge-D01.txt"), class = "factor")),
class = "data.frame", row.names = c(NA, -3L))
head(gaugelist)
# set HTML tag for linking to local file --------------------------------------------
gaugelist$description <- paste("", gaugelist$description, "")
head(gaugelist)
# remove textfile column from data.frame --------------
gaugelist$textfile <- NULL
head(gaugelist)
# print HTML table and sanitize by using your own function (add subfolder) ---------------------------------------
print(xtable(gaugelist), type = "html",
sanitize.text.function = function(str) gsub("..", "./data-develop-text-file-link", str, fixed = TRUE),
file = "gauge-list.html")
Edit:
It is slightly better to reference the current directory ./data-develop-text-file-link with ./. I edited this for gsub handling but this makes no difference.
The structure of HTML and text files described in my answer above and only exemplarily hinted at as a thought is based on a website structure. The HTML table is located at any root node and the text files are located in a directory below it. So there is later the possibility to upload the file to a server or to leave it locally on the PC.
That's why I worked with relative links, which work for me in all browsers.
Please note that absolute paths to text files seem to be a problem with Microsoft Edge and Internet Explorer. TEST: Copy the link with the right mouse button and paste it into Edge's address text box and the text file will open. I couldn't find any problems with Firefox and Chrome when testing with e.g. C:\Users\%USERNAME%\Documents or D:_working\ e.g.:
# print HTML table and sanitize by using your own function (add subfolder) ---------------------------------------
print(xtable(gaugelist), type = "html",
sanitize.text.function = function(str) gsub("..", "file:///C:/Users/webma/Documents/data-develop-text-file-link", str, fixed = TRUE),
file = "gauge-list.html")

Understanding DetectedBreak in google OCR full text annotations

I am trying to convert the full-text annotations of google vision OCR result to line level and word level which is in Block,Paragraph,Word and Symbol hierarchy.
However, when converting symbols to word text and word to line text, I need to understand the DetectedBreak property.
I went through This documentation.But I did not understand few of the them.
Can somebody explain what do the following Breaks mean? I only understood LINE_BREAK and SPACE.
EOL_SURE_SPACE
HYPHEN
LINE_BREAK
SPACE
SURE_SPACE
UNKNOWN
Can they be replaced by either a newline char or space ?
The link you provided has the most detailed explanation available of what each of these stands for. I suppose the best way to get a better understanding is to run ocr on different images and compare the response with what you see on the corresponding image. The following python script runs DOCUMENT_TEXT_DETECTION on an image saved in GCS and prints all detected breaks except from the ones you have no trouble understanding (LINE_BREAK and SPACE), along with the word immediately preceding them to enable comparison.
import sys
import os
from google.cloud import storage
from google.cloud import vision
def detect_breaks(gcs_image):
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/path/to/json'
client = vision.ImageAnnotatorClient()
feature = vision.types.Feature(
type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)
image_source = vision.types.ImageSource(
image_uri=gcs_image)
image = vision.types.Image(
source=image_source)
request = vision.types.AnnotateImageRequest(
features=[feature], image=image)
annotation = client.annotate_image(request).full_text_annotation
breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
word_text = ""
for page in annotation.pages:
for block in page.blocks:
for paragraph in block.paragraphs:
for word in paragraph.words:
for symbol in word.symbols:
word_text += symbol.text
if symbol.property.detected_break.type:
if symbol.property.detected_break.type == breaks.SPACE or symbol.property.detected_break.type == breaks.LINE_BREAK:
word_text = ""
else:
print word_text,symbol.property.detected_break
word_text = ""
if __name__ == '__main__':
detect_breaks(sys.argv[1])

.text is scrambled with numbers and special keys in BeautifuSoup

Hello I am currently using Python 3, BeautifulSoup 4 and, requests to scrape some information from supremenewyork.com UK. I have implemented a proxy script (that I know works) into the script. The only problem is that this website does not like programs to scrape this information automatically and so they have decided to scramble this script which I think makes it unusable as text.
My question: is there a way to get the text without using the .text thing and/or is there a way to get the script to read the text? and when it sees a special character like # to skip over it or to read the text when it sees & skip until it sees ;?
because basically how this website scrambles the text is by doing this. Here is an example, the text shown when you inspect element is:
supremetshirt
Which is supposed to say "supreme t-shirt" and so on (you get the idea, they don't use letters to scramble only numbers and special keys)
this  is kind of highlighted in a box automatically when you inspect the element using a VPN on the UK supreme website, and is different than the text (which isn't highlighted at all). And whenever I run my script without the proxy code onto my local supremenewyork.com, It works fine (but only because of the code, not being scrambled on my local website and I want to pull this info from the UK website) any ideas? here is my code:
import requests
from bs4 import BeautifulSoup
categorys = ['jackets', 'shirts', 'tops_sweaters', 'sweatshirts', 'pants', 'shorts', 't-shirts', 'hats', 'bags', 'accessories', 'shoes', 'skate']
catNumb = 0
#use new proxy every so often for testing (will add something that pulls proxys and usses them for you.
UK_Proxy1 = '51.143.153.167:80'
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'https://' + UK_Proxy1 + '',
}
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get(cUrl, proxies=proxies).text
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
#name = item_soup.find('h1', itemprop='name')
style = item_soup.find('p', itemprop='model').text
#style = item_soup.find('p', itemprop='model')
print (alt +(' --- ')+ name +(' --- ')+ style)
#print(alt)
#print(str(name))
#print (str(style))
When I run this script I get this error:
name = item_soup.find('h1', itemprop='name').text
AttributeError: 'NoneType' object has no attribute 'text'
And so what I did was I un-hash-tagged the stuff that is hash-tagged above, and hash-tagged the other stuff that is similar but different, and I get some kind of str error and so I tried the print(str(name)). I am able to print the alt fine (with every script, the alt is not scrambled), but when it comes to printing the name and style all it prints is a None under every alt code is printed.
I have been working on fixing this for days and have come up with no solutions. can anyone help me solve this?
I have solved my own answer using this solution:
thetable = soup5.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
alt = item.find('img')['alt']
name = item.h1.a.text
color = item.p.a.text
print(alt,' --- ', name, ' --- ',color)

How can I display a TemporaryUploadedFile from Django in HTML as an image?

In Django, I have programmed a form in which you can upload one image. After uploading the image, the image is passed to another method with the type TemporaryUploadedFile, after executing the method it is given to the HTML page.
What I would like to do is display that TemporaryUploadedFile as an image in HTML. It sounds quite simple to me but I could not find the answer on StackOverflow or on Google to the question: How to display a TemporaryUploadedFile in HTML without having to save it first, hence my question.
All help is appreciated.
Edit 1:
To give some more information about the code and the variables while debugging.
input_image = next(iter(request.FILES.values()))
output_b64 = (input_image.content_type, str(base64.b64encode(input_image.read()), 'utf8'))
Well, you can encode the image to base64 and use a data url as the value for src.
A base64 data url looks like this:
<img src="data:image/png;base64,SGLAFdsfsafsf098sflf">
\_______/ \__________________/
| |
File type base64 encoded data
Read the Mozilla docs for more on data urls.
Here's some relevant code:
import base64
def my_view(request):
# assuming `image` is a <TemporaryUploadedFile object>
image_b64 = base64.b64encode(image.read())
image_b64 = image_b64.decode('utf8') # convert bytes to string
image_type = image.content_type # png or jpeg or something else
return render('template', {'image_b64': image_b64, 'image_type': image_type})
Then in your template:
<img src="data:{{ image_type }};base64,{{ image_b64 }}">
I want to thank xyres for pushing me in the right direction. As you can see, I used some parts of his solution in the code below:
# As input I take one image from the form.
temp_uploaded_file = next(iter(request.FILES.values()))
# The TemporaryUploadedFile is converted to a Pillow Image
input_image = pil_image.open(temp_uploaded_file)
# The input image does not have a name so I set it afterwards. (This step, of course, is not mandatory)
input_image.filename = temp_uploaded_file.name
# The image is saved to an InMemoryFile
output = BytesIO()
input_image.save(output, format=img.format)
# Then the InMemoryFile is encoded
img_data = str(base64.b64encode(output.getvalue()), 'utf8')
output_b64 = ('image/' + img.format, img_data)
# Pass it to the template
return render(request, 'visualsearch/similarity_output.html', {
"output_image": output_b64
})
In the template:
<img id="output_image" src="data:{{ image.0 }};base64,{{ image.1 }}">
The current solution works but I don't think it is perfect because I expect that it can be done with less code and faster, so if you know how this can be done better you are welcome to post your answer here.

How to set Font size and font family in document using docx4j

FileReader fr=new FileReader("E://HtmlToDoc//LETTER.html" );
BufferedReader br=new BufferedReader(fr);
while( (s=br.readLine())!= null ){
html=html+s;}
html="<html><body>"+html.substring(html.indexOf("<body>"));
/************************ Setting Page Size **********************************/
Docx4jProperties.getProperties().setProperty("docx4j.PageSize", "B4JIS");
String papersize= Docx4jProperties.getProperties().getProperty("docx4j.PageSize", "B4JIS");
String landscapeString = Docx4jProperties.getProperties().getProperty("docx4j.PageOrientationLandscape", "true");
boolean landscape= Boolean.parseBoolean(landscapeString);
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.createPackage(PageSizePaper.valueOf(papersize), landscape);
AlternativeFormatInputPart afiPart = new AlternativeFormatInputPart(new PartName("/hw.html"));
afiPart.setBinaryData(html.getBytes());
//afiPart.setBinaryData(fileContent);
afiPart.setContentType(new ContentType("text/html"));
Relationship altChunkRel = wordMLPackage.getMainDocumentPart().addTargetPart(afiPart);
// .. the bit in document body
CTAltChunk ac = Context.getWmlObjectFactory().createCTAltChunk();
ac.setId(altChunkRel.getId() );
wordMLPackage.getMainDocumentPart().addObject(ac);
// .. content type
wordMLPackage.getContentTypeManager().addDefaultContentType("html", "text/html");
wordMLPackage.save(new java.io.File("E://HtmlToDoc//" + "test.docx"));
This is my code converts from HTML to Word document. How to set font size and font family for this word document.
You're pulling HTML content in to a Word document as a collection of AltChunk instances, which means the HTML exists in the docx 'bundle' as separate file, and not as part of the flow of the actual Word document.
If you want to manipulate the imported content as native MS Word content, you need to import the source XHTML instead. This means that docx4j takes the mark-up and (some) related styles, converting them into the various constituent parts of a docx file (for example, table, text, run and paragraph elements). Once you have content imported in that way, you can style it as you would any other docx entities.