From Strip HTML from strings in Python, I got help with code
class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.fed = []
def handle_data(self, d):
self.fed.append(d)
def get_data(self):
return ''.join(self.fed)
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
for strip_tags(html), do I put an html file name as the parameter. I have a local html file called CID-Sync-0619.html at C:\Python34.
This is my code so far:
Extractfile = open("ExtractFile.txt" , "w")
Extractfile.write(strip_tags(CID-Sync-0619.html))
The entire is actually really long, but they are irrelevant to my question. I want to open another file and extract the text inside the html file to write inside that text file. How do I pass the html file as a parameter? Any help would be appreciated.
Related
I have a view that should generate a temporary JSON file and save this TempFile to the database. The content to this file, a dictionary named assets, is created using DRF using serializers. This file should be written to the database in a model called CollectionSnapshot.
class CollectionSnapshotCreate(generics.CreateAPIView):
permission_classes = [MemberPermission, ]
def create(self, request, *args, **kwargs):
collection = get_collection(request.data['collection_id'])
items = Item.objects.filter(collection=collection)
assets = {
"collection": CollectionSerializer(collection, many=False).data,
"items": ItemSerializer(items, many=True).data,
}
fp = tempfile.TemporaryFile(mode="w+")
json.dump(assets, fp)
fp.flush()
CollectionSnapshot.objects.create(
final=False,
created_by=request.user,
collection_id=collection.id,
file=ContentFile(fp.read(), name="assets.json")
)
fp.close()
return JsonResponse({}, status=200)
Printing assets returns the dictionary correctly. So I am getting the dictionary normally.
Following the solution below I do get the file saved to the db, but without any content:
copy file from one model to another
Seems that json.dump(assets, fp) is failing silently, or I am missing something to actually save the content to the temp file prior to sending it to the database.
The question is: why is the files in the db empty?
I found out that fp.read() throws content based on the current pointer in the file. At least, that is my understanding. So after I dump assets dict as json the to temp file, I have to bring back the cursor to the beggining using fp.seek(0). This way, when I call fp.read() inside file=ContentFile(fp.read(), ...) it actually reads all the content. It was giving me empty because there was nothing to read since the cursor was at the end of the file.
fp = tempfile.TemporaryFile(mode="w+")
json.dump(assets, fp)
fp.flush()
fp.seek(0) // important
CollectionSnapshot.objects.create // stays the same
I've been trying to figure out how to do this for a while without using a framework (I can make this work with Flask for example) but I haven't found anything as of yet. I have two html scripts and a python cgi script. In essence I have the first html file wherein the user enters a string that I read into my python cgi script which in turn does a number of things to finally give me a bunch of strings and json file that I need to pass to another html file and be able to read them there as well.
So far the first half works, and I can open the second html with a redirect which is not elegant but nothing else has worked with the following code:
#!/Users/<username>/opt/anaconda3/bin//python
import pandas as pd
import numpy as np
import cgi
import cgitb
import sys
cgitb.enable()
# Create instance of FieldStorage
form = cgi.FieldStorage()
# Get data from fields
protein_name = form.getvalue('protein_name')
####### function search_results takes in protein_name and gives me the data ###
####### I need to pass to the html file: results.html #########################
if ((search_results(protein_name)!="No protein entered")&(search_results(protein_name)!="No results found")):
all_vars = search_results(protein_name)
##### all_vars is a tuple of strings like gene_name, json files and integers
print("Content-type: text/html","\n\n")
print ('''
<head><meta http-equiv="refresh" content="0;URL='http://localhost/results.html'" /></head>
''')
Any suggestions on how to proceed? Any help is appreciated, thanks!
As you have probably discovered, your current technique issues a redirect to 'results.html', but otherwise discards any results. I don't know exactly what your goals are, but one approach would be to treat 'results.html' as a simple template. Your script will populate it and return it in response to each request. In the example below, 'results.html' can contain arbitrary HTML, along with the line '##RESULTS##', which will be replaced with your output.
#!/Users/<username>/opt/anaconda3/bin/python
import sys
import pandas as pd
import numpy as np
import cgi
import cgitb
cgitb.enable()
def process_results(results):
if results=='No protein entered' or results=='No results found':
return results
# else do something with results, e.g., format into an HTML table
buf = '<table>\n'
for result in results:
buf += f'<tr><td>{cgi.escape(str(result))}</td></tr>\n'
buf += '</table>\n'
return buf
form = cgi.FieldStorage()
protein_name = form.getvalue('protein_name')
results = search_results(protein_name)
print("Content-type: text/html\n")
with open('results.html') as template:
for line in (x.rstrip() for x in template):
if line == '##RESULTS##':
print(process_results(results))
else:
print(line)
Good luck.
I am using Python3 and Scrapy.
I have a simple spider (shown bellow) for which I want to save as items the response.url and the response.text. I would like to save the response.text to Notepad++ as a JSON. Is there any way it can be saved with a nested structure? Such as the one that appears in the native HTML og the page?
class Spider1(scrapy.Spider):
name = "Spider1"
allowed_domains = []
start_urls = ['http://www.uam.es/']
def parse(self, response):
items = Spider1Item()
items['url'] = response.url
items['body'] = response.text
yield items
pass
EDIT:
Here is a snippet of the target structure I would like when I export to a JSON.
HTML snippet of target structure
I'm uploading a json file via flask, but I'm having trouble actually reading what is in the file.
# named fJson b/c of other json imports
from flask import json as fJson
#app.route('/upload', methods=['GET', 'POST'])
def upload():
if request.method == 'POST':
file = request.files['file']
# data = fJson.load(file)
# myfile = file.read()
I'm trying to deal with this by using the 'file' variable. I looked at http://flask.pocoo.org/docs/0.10/api/#flask.json.load, but I get the error "No JSON object could be decoded". I also looked at Read file data without saving it in Flask which recommended using file.read(), but that didn't work, returns either "None" or "".
Request.files
A MultiDict with files uploaded as part of a POST or PUT request. Each file is stored as FileStorage object. It basically behaves like a standard file object you know from Python, with the difference that it also has a save() function that can store the file on the filesystem.
http://flask.pocoo.org/docs/0.10/api/#flask.Request.files
You don't need use json, just use read(), like this:
if request.method == 'POST':
file = request.files['file']
myfile = file.read()
For some reason the position in the file was at the end. Doing
file.seek(0)
before doing a read or load fixes the problem.
Say I have a function that reads a .txt file and creates arrays based on the columns of the data within that file. What I have right now inside the function looks like:
data = open("some_file_name.txt","r")
But if I want to change the .txt file that the function reads I have to manually go into the code and type in the new file name before running it again. Instead, how can I pass any file name to the function so it looks like:
my_function(/filepath/some_file_name.txt):
data = open("specified_file_name.txt","r")
I think you want
def my_function(filepath):
data = open(filepath, "r")
...
and then
my_function("/filepath/some_file_name.txt")
or better:
def my_function(data):
...
and then
with open("/filepath/some_file_name.txt", "rb") as data:
my_function(data)
The latter version lets you pass in any file-like object to my_function().
Update: if you want to get fancy and allow file names or file handles:
def my_func(data):
if isinstance(data, basestring):
with open(data, 'rb') as f:
return my_func(f)
...