I want to read header and footer text for a docx file in Python. I am using python-docx module.
I found this documentation - http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html
But I do not think it has been implemented yet. I also see that there is a "feature-headers" branch in github for python-docx - https://github.com/danmilon/python-docx/tree/feature-headers
Seems like this feature never got into master branch. Anyone used this feature? Can you help me on how to use it?
Thank you very much.
There is a better solution to this problem :
Method Used to extract
using MS XML Word document
just zip the word document using zip module, It will give you access to xml format of word document, then you can use simple xml node extraction for text.
Following is the working code that extracts Header, Footer, Text Data from a docx file.
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
"""
Take the path of a docx file as argument, return the text in unicode.
"""
document = zipfile.ZipFile(path)
contentToRead = ["header2.xml", "document.xml", "footer2.xml"]
paragraphs = []
for xmlfile in contentToRead:
xml_content = document.read('word/{}'.format(xmlfile))
tree = XML(xml_content)
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
textData = ''.join(texts)
if xmlfile == "footer2.xml":
extractedTxt = "Footer : " + textData
elif xmlfile == "header2.xml":
extractedTxt = "Header : " + textData
else:
extractedTxt = textData
paragraphs.append(extractedTxt)
document.close()
return '\n\n'.join(paragraphs)
print(get_docx_text("E:\\path_to.docx"))
Related
I'm still green in programing and trying to ajust and learn python but I am struggling with reading a csv file and using the content of the file as a value property
I have looked and googled to death and all the solutions puts out
['Deon:app2018:value:1538402685271'] ore a virtical result
example:
session = file content of csv
here is the closes i got
Code:
import urllib.request
import csv
with open('F:\test\session\main\data\credentials\session_id.csv','r') as file:
session_ID = csv.reader(file)
for row in session_ID:
session = "".join(row)
print(session)
url = 'http://webrates.app.com/rates/connect.html?id='+session
print(url)
What i get
Deon:app2018:value:1538402685271
http://webrates.app.com/rates/connect.html?id=
What i want
Deon:app2018:value:1538402685271
http://webrates.app.com/rates/connect.html?id=Deon:app2018:value:1538402685271
Kind Regards
Deon
after lots of trail and error
this solved my problem
import csvlib.request
import csv
with open('F:\test\session\main\data\credentials\session_id.csv','r') as file:
id = file.read() + '\n'
url = 'http://webrates.app.com/rates/connect.html?id='+ id
print(url)
I am trying to adapt some python code from an awesome guide for dark web scanning/graph creation.
I have thousands of json files created with Onionscan, and I have this code that should wrap everything in a gephi graph. Unfortunately, this code is old, as the Json files are now formatted differently and this code does not work anymore:
code (partial):
import glob
import json
import networkx
import shodan
file_list = glob.glob("C:\\test\\*.json")
graph = networkx.DiGraph()
for json_file in file_list:
with open(json_file,"rb") as fd:
scan_result = json.load(fd)
edges = []
if scan_result('linkedOnions') is not None:
edges.extend(scan_result['linkedOnions'])
In fact, at this point I get "KeyError", because linkedOnions is one-level nested like this:
"identifierReport": {
"privateKeyDetected": false,
"foundApacheModStatus": false,
"serverVersion": "",
"relatedOnionServices": null,
"relatedOnionDomains": null,
"linkedOnions": [many urls here]
could you please help me fix the code above?
I would be VERY grateful :)
Lorenzo
this is the correct way to read nested JSON.
if scan_result['identifierReport']['linkedOnions'] is not None:
edges.extend(scan_result'identifierReport']['linkedOnions'])
Try this it will work for you if your JSON file is correct format
try:
scan_result = json.load(fd)
edges = []
if scan_result('linkedOnions') is not None:
edges.extend(scan_result['linkedOnions'])
except Exception,e:
#print your message or log
print e
I am using a python3.x script to save a string to a text file:
nN = "hello"
f = open("file.txt", "w")
f.write(nN)
f.close()
and now I am trying to parse the content of an h2 element from a website (page scraping works fine) and I am getting an error when I am trying this:
nN = driver.find_element_by_id("title")
f = open("file.txt", "w")
f.write(nN)
f.close()
where the html line is:
<h2 id="title">hello</h2>
The error is:
write() argument must be str, not WebElement
I tried converting the nN into a string using the following:
f.write(str(nN))
and the new error is:
invalid syntax
It looks like you are using Selenium and then using the webdriver to parse the html content?
The reason the string conversion is not working is because the nN is a Selenium/html object that probably is a dictionary or a list. You could try simply f.write(nN.text) and according to the documentation the .text version of nN should work.
To the larger issue of parsing html though, I would recommend using Beautiful Soup. Do pip3 install BeautifulSoup4 and then to import from bs4 import BeautifulSoup. Then as example:
with open('file.html','r') as f:
htmltext = f # change as necessary, just needs to be string
soup = BeautifulSoup(htmltext,'lxml')
h2found = soup.find('h2',id="title")
print(h2found)
print(h2found.text)
Beautiful Soup has great documentation and is the standard and best library to use for parsing html.
I am trying to read json response from this link. But its not working! I get the following error:
ValueError: No JSON object could be decoded.
Here is the code I've tried:
import urllib2, json
a = urllib2.urlopen('https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?key=AIzaSyDkEX-f1JNLQLC164SZaobALqFv4PHV-kA&screenshot=true&snapshots=true&locale=en_US&url=https://www.economicalinsurance.com/en/&strategy=mobile&filter_third_party_resources=false&callback=_callbacks_._DElanZU7Xh1K')
data = json.loads(a)
I made these changes:
import requests, json
r=requests.get('https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?key=AIzaSyDkEX-f1JNLQLC164SZaobALqFv4PHV-kA&screenshot=true&snapshots=true&locale=en_US&url=https://www.economicalinsurance.com/en/&strategy=mobile&filter_third_party_resources=false')
json_data = json.loads(r.text)
print json_data['ruleGroups']['USABILITY']['score']
A Quick question - Construct Image link .
I able to get here : -
from selenium import webdriver
txt = json_data['screenshot']['data']
txt = str(txt).replace('-','/').replace('_','/')
#then in order to construct the image link i tried : -
image_link = 'data:image/jpeg;base64,'+txt
driver = webdriver.Firefox()
driver.get(image_link)
The problem is i am not getting the image, also the len(object_original) as compared len(image_link) differs . Could anybody please advise the right elements missing in my constructed image link ?. Thank you
Here is API link - https://www.google.co.uk/webmasters/tools/mobile-friendly/ Sorry added it late .
Two corrections need to be made to your code:
The url was corrected (as mentioned by Felix Kling here). You have to remove the callback parameter from the GET request you were sending.
Also, if you check the type of the response that you were fetching earlier you'll notice that it wasn't a string. It was <type 'instance'>. And since json.loads() accepts a string as a parameter variable you would've got another error. Therefore, use a.read() to fetch the response data in string.
Hence, this should be your code:
import urllib2, json
a = urllib2.urlopen('https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?key=AIzaSyDkEX-f1JNLQLC164SZaobALqFv4PHV-kA&screenshot=true&snapshots=true&locale=en_US&url=https://www.economicalinsurance.com/en/&strategy=mobile&filter_third_party_resources=false')
data = json.loads(a.read())
Answer to your second query (regarding the image) is:
from base64 import decodestring
arr = json_data['screenshot']['data']
arr = arr.replace("_", "/")
arr = arr.replace("-","+")
fh = open("imageToSave.jpeg", "wb")
fh.write(str(arr).decode('base64'))
fh.close()
Here, is the image you were trying to fetch - Link
Felix Kling is right about the address, but I also created a variable that holds the URL. You can try this out to and it should work:
import urllib2, json
url = "https://www.googleapis.com/pagespeedonline/v3beta1/mobileReady?key=AIzaSyDkEX-f1JNLQLC164SZaobALqFv4PHV-kA&screenshot=true&snapshots=true&locale=en_US&url=https://www.economicalinsurance.com/en/&strategy=mobile&filter_third_party_resources=false"
response = urllib2.urlopen(url)
data = json.loads(response.read())
print data
From Strip HTML from strings in Python, I got help with code
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
for strip_tags(html), do I put an html file name as the parameter. I have a local html file called CID-Sync-0619.html at C:\Python34.
This is my code so far:
Extractfile = open("ExtractFile.txt" , "w")
Extractfile.write(strip_tags(CID-Sync-0619.html))
The entire is actually really long, but they are irrelevant to my question. I want to open another file and extract the text inside the html file to write inside that text file. How do I pass the html file as a parameter? Any help would be appreciated.