Python-docx how to identify the type of data in a given run - python-docx

Using Python docx I need to know which kind of content is embeded in a given run. If len(myrun.text) > 0 I know for sure that myrun has nothing else but text. But when a run contains no text, it can either be a ghost run (former text object which content has been deleted) or an image, or any other object.

Related

How to extract text from a Google drawing?

I am given a Google drawing containing an application component architecture. The drawings contain text used to manually populate a fairly long parameter file. The parameter file is then used to create an AWS database instance. I'm hoping to automate this tedious and error prone process by extracting the desired values from the drawing and populating the parameter file.
I'm using Python and just getting started with this effort. I've been able to download the file using the mime type "image/svg+xml" but it appears the text is rendered as a vector drawing. I've also downloaded the file as a PDF, but I still can't seem to get the text.
I'm not a master of Google drawings. From what I've read the drawings are very simple and don't support anything like a tag that one might use to find important data. I suspect I'm barking up the wrong tree.
Is it possible to extract text from a Google drawing? If so, what would be the general process flow and what mime type would I use?

getting data from nutch in plain text format

I am using apache nutch to crawl websites. When I am using readseg command to read content in a segment, I am getting in the format like below:
Is there any way to get web data in plain text format?
when I am using readseg command on parse text I am getting in this way
The readseg command dumps (by default) the raw content fetched from the URLs. This is the entire HTML content transfered. If you want to get the textual content, you need to wait until after the content is parsed. This means that you need to execute an entire crawl cycle (or the ./bin/nutch parse command).
Check the different options on the readseg command (https://wiki.apache.org/nutch/bin/nutch_readseg), if you're already executing the parse step, you probably only care for the parsed content, so you could avoid printing everything else.

Angular 5 : How to integrate html data (which is a formatted text) in a .docx file?

I'm still a bit newbie in the code game, and i would like some advices from senpai.
Context :
I'm making a angular 5 app which has a form, which is using also QuillJS, a rich text editor for only one question (the previous questions are simple input field for strings or numbers). My goal is to allow my users to download the form and the text from QuillJS they completed, on a .docx file (Word). And of course i'm doing this because i want to keep the formatted text from QuillJs, otherwise i would have just get a good ol' string.
Issue :
The point is, i'm already building a docx file for the first questions of the form and the only method i found for now to put my html string from QuillJs in a Word readable data type, is to use html-docx-js library.
This post even explain how. But, BUT, i don't want to use saveAs function (see the post), that create a file and put the content in it. I want to put the content in the docx file i'm already creating.
So here is my question, how would you, senpai, do it ?
The thing is that i've got a Blob file (cf post), but i don't know how to put it in my docx file. I tried to see if FileReader function could do the job, but well... i don't get how to integrate this special Blob file type (which is : application/vnd.openxmlformats-officedocument.wordprocessingml.document) in the docx file.
Maybe there is another way, i'm open to any suggestions, i don't mind at all to change my way of doing.
Thank you. Save internet, give me a tip.
The official documentation for html-docx-js does not state any other options than the asBlob method. I suggest two options:
Decoding the DOCX:
The Blob filetype is not special. The blob is just binary representation of the docx. I found in SE question that the docs in fact zipped XML document. You could unzip it using JSZip or other JS solution, then read it using FileReader and try to deal with it in a DOM manner. I'm not qualified to go into details how that could work.
Adding HTML to the user input first and then outputting it as a whole
This is changing the way you want to do it. In this way, I would first create formatted HTML with the data you collected in other parts of the questionnaire. Then you append the rich data from the rich editor. At last you take this HTML data and save it into single file using the asBlob function.
The second solution will maybe strip some customization from your original approach, but it seems much faster to implement.

Method of identifying plaintext files as scripts

I am creating a filter for files coming onto a Unix machine. I only want to allow plain text files that do not look like scripts to pass through.
For checking plain text I am checking the executable bit of the file and using the -T file test from perl. (I understand this is not 100%, but it will catch the binary files I most want to avoid). I think this will be sufficient, but any suggestions are welcome.
My main question is in recognizing when a plain text file is a script. Every script I've ever written has started out with a #! line, so my first thought is to read in the file's first line and block any containing that. Are there common non-script plain text files that start with the #! line that I will flag with a false-positive? Are there better/additional methods of identifying a script?
That's what the file command (see Wikipedia) is for. It recognizes much more than just the she-bang (#!), and can tell you what kind of script it is, if any.

how to save text in mysql for dynamic site

i am making a site and in the middle of the screen i want to project text. I want the owner of the site to change the text dynamically whenever he wants.
I want the owner to save the text in mysql. The question is what is the correct procedure:
let the owner write a doc file -> upload it to server -> with php save the doc file inside mysql ( with the help of load data infile ??? )
or
have a textarea in a form -> send the text with post method -> and save it directly in mysal ( but that way i can't have formatted text )
Sorry if my question is vague but i don't have any idea how to approach the whole issue.
Any advice would be really helpful.
First off, you generally won't have formatted text by saving, for instance, a Word doc to a database and retrieving it.. Word formats in a different format than HTML.
If you want formatted text, I would suggest using a WYSIWYG editor that allows a user to input formatted text into a normal form textarea, and saving the resulting text to the SQL server. You can find many of these editors searching for "wysiwyg html editor" on Google.