Selenium, Python 3, simple scraping text from Erowid LSD experiences? - html

Based off of an answer on here about a similar thing, I tried to scrape the text of Erowid trip experiences. The URL has a bunch of trip links. I want to click each link and then print the 'report-text-surround' element, which is the trip text.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.erowid.org/experiences/exp.cgi?S1=2&S2=-3&C1=9&Str=')
#I tried to get hrefs by xpath, knowing that each trip links starts with 'exp.php?ID'.
view_links = driver.find_elements_by_xpath("""//*[contains(text(), 'exp.php?ID')]""")
for index, view in enumerate(view_links):
html = view.get_attribute('innerHTML')
href = html.split('"')[1]
view_links[index] = href
#And then visit each href and get the data
for href in view_links:
driver.get(href)
#I know this is the element containing the trip text.
trip_text = driver.find_elements_by_class_name('report-text-surround')
for trip in trip_text:
print (trip.text.encode('utf-8'))

So you are pretty close but there are just 2 small mistakes.
trip_text = driver.find_elements_by_class_name('report-text-surround')
for trip in trip_text:
print (trip.text.encode('utf-8'))
Your driver.find_elements_by_class_name should not be plural, as there is only one on the page. It has a lot of elements, but only one class ('report-text-surround'). This means you're going to get all the text at once, you could change this but you'd have to go through the child elements or get the elements seperately.
You can change that entire section to this:
text = (driver.find_element_by_class_name('report-text-surround').text).encode('utf-8')
print(text);
That will give you all of the text in the entire article. An easy way to split this up after would be to split each part of the text by \n\n.

Related

Docs Inserted Image always before all text

Making a simple app script that puts images and Text into a Google Doc separated by 2 Columns, for whatever reason, no matter the way I try it the images are always above the text (Inline) in the Doc, even though they should be layered (Inline),
//Replace QR Code
let qrText = editLocalBody.findText("{{qrCode}}");
let setImagePlace = qrText.getElement().asText().replaceText("{{qrCode}}", "");
let qrCodeImage = setImagePlace.getParent().asParagraph().insertInlineImage(0, qrCodeBlob);
From what I've seen this should insert an image wherever the text was previously located, but when it runs this it's always in the wrong spot, somehow above the text it was suppost to be in!
//Edit - To Show The Progression Of What Is Suppose To Happen And What Actually Happens:
I'm making QR Code badges for a propriety system that runs integrated tightly with Google, so I'm using appscript to get an entry from a google form containing an amount of badges (With relevent data) and autofill a Google Doc Accordingly.
// Loop Start
I fill my template with a text line that has key words in it I can select and replace later, with a keyword it can use to insert another this (This Part Works)
I first edit (findText("{{qrCode}}");) the QR Code, replacing (.replaceText) the keyword for it to nothing ("")
I then get the parent of the piece of code I ran above, which is a block of text (I think all the text in the Doc, I think this is where the issue lies, it puts it above the text because it's just one 'paragraph' or not multiple 'bodies' of text, if I could separate this I think it would work!) As a paragraph, and insert An Inline Image at Child Index (0, of the image ,qrCodeBlob)
I've debugged this script quite a bit, so I know It's that final line that inserting images fails, it sees all the text as 'one'.
// I want this (In Descending Order, each it's own full line):
Image
Text
Image
Text
//What It Gives Me (In Descending Order, each it's own full line):
Image
Image
Text
Text
let qrCodeImage = setImagePlace.getParent().asParagraph().insertInlineImage(0, qrCodeBlob);

Selenium, using find_element but end up with half the website

I finished the linked tutorial and tried to modify it to get somethings else from a different website. I am trying to get the margin table of HHI but the website is coded in a strange way that I am quite confused.
I find the child element of the parent that have the text with xpath://a[#name="HHI"], its parent is <font size="2"></font> and contains the text I wanted but there is a lot of tags named exactly <font size="2"></font> so I can't just use xpath://font[#size="2"].
Attempt to use the full xpath would print out half of the website content.
the full xpath:
/html/body/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[3]/td/pre/font/table/tbody/tr/td[2]/pre/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font/font
Is there anyway to select that particular font tag and print the text?
website:
https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm
Tutorial
https://www.youtube.com/watch?v=PXMJ6FS7llk&t=8740s&ab_channel=freeCodeCamp.org
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import pandas as pd
# prepare it to automate
from datetime import datetime
import os
import sys
import csv
application_path = os.path.dirname(sys.executable) # export the result to the same file as the executable
now = datetime.now() # for modify the export name with a date
month_day_year = now.strftime("%m%d%Y") # MMDDYYYY
website = "https://www.hkex.com.hk/eng/market/rm/rm_dcrm/riskdata/margin_hkcc/merte_hkcc.htm"
path = "C:/Users/User/PycharmProjects/Automate with Python – Full Course for Beginners/venv/Scripts/chromedriver.exe"
# headless-mode
options = Options()
options.headless = True
service = Service(executable_path=path)
driver = webdriver.Chrome(service=service, options=options)
driver.get(website)
containers = driver.find_element(by="xpath", value='') # or find_elements
hhi = containers.text # if using find_elements, = containers[0].text
print(hhi)
Update:
Thank you to Conal Tuohy, I learn a few new tricks in Xpath. The website is written in a strange way that even with the Xpath that locate the exact font tag, the result would still print all text in every following tags.
I tried to make a list of different products by .split("Back to Top") then slice out the first item and use .split("\n"). I will .split() the lists within list until it can neatly fit into a dataframe with strike prices as index and maturity date as column.
Probably not the most efficient way but it works for now.
product = "HHI"
containers = driver.find_element(by="xpath", value=f'//font[a/#name="{product}"]')
hhi = containers.text.split("Back to Top")
# print(hhi)
hhi1 = hhi[0].split("\n")
df = pd.DataFrame(hhi1)
# print(df)
df.to_csv(f"{product}_{month_day_year}.csv")
You're right that HTML is just awful! But if you're after the text of the table, it seems to me you ought to select the text node that follows the B element that follows the a[#name="HHI"]; something like this:
//a[#name="HHI"]/following-sibling::b/following-sibling::text()[1]
EDIT
Of course that XPath won't work in Selenium because it identifies a text node rather than an element. So your best result is to return the font element that directly contains the //a[#name="HHI"], which will include some cruft (the Back to Top link, etc) but which will at least contain the tabular data you want:
//a[#name="HHI"]/parent::font
i.e. "the parent font element of the a element whose name attribute equals HHI"
or equivalently:
//font[a/#name="HHI"]
i.e. "the font element which has, among its child a elements, one whose name attribute equals HHI"

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩)and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)
Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]
Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

position oriented text and image save in Django/mysql

I'm creating a blog application similar to scoopwhoop or mensxp.
Basically I want to create my database in such a way that I can assign image a particular position in the article.
look at this page https://www.scoopwhoop.com/Move-Over-Tony-Stark-Marvels-New-Iron-Man-Is-A-15YearOld-Black-Girl/
or http://www.mensxp.com/health/weight-loss/31372-5-rules-of-fat-loss-that-most-people-ignore.html
you see, in these pages we have some text then a relevant image then again some text and relevant image to just above text and so on.
I mean it should make sense that a particular image comes just before or after the related text.
Currently I'm doing this way
class Post(TimeStamp):
title = models.CharField(max_length=127)
text = models.TextField(verbose_name='full text description')
# some more fields
class Pictures(TimeStamp):
image = models.ImageField(upload_to=upload_to_image_path)
post = models.ForeignKey(Post, related_name="picture")
this schema will create two tables one for blog post and other of images used in posts.
now here I can randomly put images in a post... like count no of words in a post and number of images associated with this post and then use basic math to divide the text in equal length and put images after every blog in frontend. but it wont solve the problem.
I tried to use django_summernote as well but created some other problems so discarded that option.
How do you think I should design my schema so that I can solve this problem and may be i should be able to use django admin smoothly.
I would throw out the Pictures() class altogether. The images should be a part of your post body. So an example entry would be...
title = "How to program an awesome site with Django!"
text = "<p>This is some text.</p><img src='image link'><p>This is some text after the image</p>"
Basically, the images are just as much a part of the post body as the text. There is no need to create a separate database table.
So in summary, this should be your database setup...
class Post(TimeStamp):
title = models.CharField(max_length=127)
text = models.TextField(verbose_name='full text description')
# some more fields
And the post body as a whole should all be contained within the text field.

(AS3) Getting an HTML-specific character index in a textfield after word wrap

I didn't know how to phrase the title, so sorry about that. If you have a better title suggestion, let me know and I'll change it.
I've got a chunk of text that is displayed as HTML in a TextField. An example of this text is this:
1
<font size="30" color="#FF0000">When your only tool is a hammer, all problems start looking like nails.</font>
</br>
2
<i>99 percent of lawyers give the rest a bad name.</i>
<b>Artificial intelligence is no match for natural stupidity.</b>
<u>The last thing I want to do is insult you. But it IS on the list.</u>
</br>
3<showimage=Images/image1.jpg>
I don't have a solution, but I do admire the problem.
The only substitute for good manners is fast reflexes.
Support bacteria - they're the only culture some people have.
</br>
4
Letting the cat out of the bag is a whole lot easier than putting it back in.
Well, here I am! What are your other two wishes?
Most of the tags are basic, meant to display what I can do formatting wise. However, since Adobe Air has a sandbox that prevents inline images (via the <img src='foo.png'> tag), I've had to come up with another way to display images.
Basically, I intend on having an image displayed somewhere on the screen, and as the user scrolls the image will change based on where in the text they have scrolled to. The image can be a background image, a slideshow on the right, anything really.
In the snippet above, look for my custom tag <showimage=Images/image1.jpg>. I want to get the local y position of that tag once the TextField is rendered as HTML and word wrapped. The trouble is, when I query the y position of the tag (using getCharBoundaries), I can only either search for the tag when I render the text as a .text instead of a .htmlText. If I search for the tag in the TextField after rendering it as .htmlText, it doesn't get found because the tags are hidden and replaced with formatting.
The trouble with the y value I get before rendering the HTML is that the y value will be different due to font sizes, tags being hidden and word wrap changing the line and y value that the tag is located at.
How do I get the correct y value of an HTML tag once the HTML has been rendered?
I've considered using a different style tag, maybe something like &&&&&showImage=Images/image1.jpg&&&&, but that seems like a cop-out and I'd still run into problems if multiple of those tags were in a block of text and the tags were removed, followed by word wrap that shifts lines in a pretty unpredictable way.
myTextField.textHeight tells you the height of the text in pixels. So you can split the string on whatever you're looking for, put the text before your target in the textField and get the textHeight, then put the rest of the text in.
Here's some example code - tMain is the name of the textField:
var iTextHeight: int = 0;
var sText: String = '<font size="30" color="#FF0000">When your only tool is a hammer, all problems start looking like nails.</font></br><i>99 percent of lawyers give the rest a bad name.</i><b>Artificial intelligence is no match for natural stupidity.</b><u>The last thing I want to do is insult you. But it IS on the list.</u></br><showimage=Images/image1.jpg> I don\'t have a solution, but I do admire the problem. The only substitute for good manners is fast reflexes. Support bacteria - they\'re the only culture some people have. </br>Letting the cat out of the bag is a whole lot easier than putting it back in. Well, here I am! What are your other two wishes?';
var aStringParts: Array = sText.split("<showimage=Images/image1.jpg>");
for (var i = 0; i < aStringParts.length; i++) {
if (i == 0) {
tMain.htmlText = aStringParts[i];
trace("height of text: " + tMain.textHeight);
} else {
tMain.appendText(aStringParts[i]);
}
}
sText gets split on the tag you're looking for (removes the text you're looking for and breaks remaining text into an array). The text leading up to the tag is put in the textField and the textHeight is traced. Then the rest of the text is put in the textField. This gives you the y pixel number you need to arrange things.
Let me know of any questions you have.
Instead of going through the trouble of parsing your image tag, have you tried playing with HTMLLoader and using the loadString method? This should load everything in its proper place including the image using the img tag.
private var htmlLoader:HTMLLoader;
private function loadHtml(content:String):void
{
htmlLoader = new HTMLLoader(); //Constructor
htmlLoader.addEventListener(Event.COMPLETE, handleHtmlLoadComplete); //Handle complete
htmlLoader.loadString(content); //Load html from string
}
private function handleHtmlLoadComplete(e:Event):void
{
htmlLoader.removeEventListener(Event.COMPLETE, handleHtmlLoadComplete); //Always remove event listeners!
htmlLoader.width = htmlLoader.contentWidth; //Set width and height container
htmlLoader.height = htmlLoader.contentHeight;
addChild(htmlLoader); //Add to stage
}
Another approach is to search your html string for <showImage ..> tags and replace these with shortcodes e.g [showImage ..] , before inserting the htmlString in a textField. Then this is NOT xml but text and you can retrieve the y value (that is if i understand correctly your issue).
Then the rest of your code can take it from there.
(ps using HtmlLoader seems nice alternative though)