How to extract a description part from website with proper spacing? - html

I have accessed the website with beautiful Soup and retrieved the description part(div class) but since it was in bulleted points. I receive an output like this without any spacings between points(Not Readable):
DESCRIPTION:
COVID-19 ProjectionsGovernment-mandated social distancingHospital resource useAll bedsICU bedsInvasive ventilatorsDeaths per dayTotal deaths
Actually I have both normal paragraph and bullet points so I cannot use li or ul to retrieve bullet points alone.
This is my program for this description part:
def DESCRIPTION(self):
print('\n'+"DESCRIPTION: ")
for j in Data_Set_Info.soup.select('.iH9v7b'):
k = j.get_text()
print ('\n'+k)
The HTML code for this webpage is:
<div class="iH9v7b"><p>COVID-19 Projections</p><ul><li>Government-mandated social distancing</li><li>Hospital resource use</li><ul><li>All beds</li><li>ICU beds</li><li>Invasive ventilators</li></ul><li>Deaths per day</li><li>Total deaths</li></ul><p></p></div>
The webpage is:https://datasetsearch.research.google.com/search?query=health&docid=B2%2BtssYi2L2wvQwVAAAAAA%3D%3D
In this website there are different dataset and each dataset have different description. I need to get all description in a proper spacing with single program. Thanks in Advance

If you just want to get all the text with spaces in between, you can specify the character used to join text from different elements as an argument to get_text, like so:
k = j.get_text(' ')
If you want to be able to preserve (potentially nested) lists in the output then you'll need to recursively search through j.contents. A one-size-fits-all solution is unlikely to work for that purpose and will probably need a bit of experimentation.
Documentation links:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children

Related

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

R, Regex, and Matching the Choice of a Qualtrics Response Column

When you export response data from Qualtrics as a CSV, the 2nd row of the data contains strings with the question stem (shortened if necessary), followed by a dash, followed by that response column's corresponding choice. As an example, if my question were "Please select all of the fruit you enjoy:", in my response data the second row of a response column to this question might contain something like "Please select all of the fruit you enjoy:-Blueberries".
Qualtrics shortens the question stem if it is longer than 100 characters. If it is more than 100 characters, the stem is cut off after the 99th character, "..." is appended, and then the dash, and then the choice text.
I am trying to retrieve the text that is after this dash. However, that's difficult, because both the choice text and the question text could contain dashes. I have thought of two different approaches I could take in attempting to select just the choice text:
I have the question text, and can reliably programmatically retrieve it based on the response column name. However, the question text doesn't always match exactly, because Qualtrics removes any HTML styling in the Question text in the response data, but not in the Qualtrics survey file that I am getting the question text from. For questions that don't have any HTML styling, I was thinking about trying to use the question text to somehow match up to and including the dash between the question text and the choice text. I think regex could handle this case fine, but this clearly doesn't work without heavy modification for any questions that have HTML components.
The alternative I think might be more reliable. Strip the question text from the QSF file of any HTML tags, and then count how many "-" characters appear in the question text. Call that n, and then match the 2nd-row-response-entry for up to the n+1th dash, remove it, and what's remaining is my choice text.
I think the 2nd option is much more likely to work consistently, since the first option leaves me with a case where I have to try and strip html from the question text in exactly the same way Qualtrics does, unless I use fuzzy matching (which I know nothing about). However, the second option is also unclear to me.
an example csv response set
For example, the first question's question text looks like this in the QSF:
"<div style=\"text-align: center;\">Click to write the question text
<span style=\"font-size: 10.8333px;\">thsi<sup>tasdf<em>werasfd</em></sup>
<em>sdfad</em></span><br />\n </div>"
I would appreciate both of the following: advice on which option (or a suggestion for another) you think has the most chance for success, and help with the regex in R for matching the text up to the n+1th "-" character.
Here's a solution that counts the dashes in the question, locates the nth dash in the text (if any) and drops the preceding characters, and then keeps the substring that follows the next dash in the text.
stem_text <- "Please--select your extracurriculars"
s <- "<em>Please</em>--select your extracurriculars-student-athletics"
# count dashes in question stem
stem_dash_n <- length(gregexpr("-", stem_text)[[1]])
# locate dashes in string
s_dashes <- gregexpr("-", s)[[1]]
sub_start <- ifelse(length(s_dashes), s_dashes[stem_dash_n], 1)
s_sub <- substr(s, sub_start + 1, nchar(s))
sub("[^\\-]*\\-(.*)", "\\1", s_sub, perl = TRUE)
# [1] "student-athletics"
Assumptions: based on your description, length(s_dashes) >= stem_dash_n, so s_dashes[stem_dash_n] exists; the same number of dashes appear in the known stems and their representations in the text; and there is always a dash separating the stem and response choice.

What is regex expression for a string after I use nokogiri to scrape

I have this string and it is in an html document of 100 other names that are formatted the same:
<li>Physical education sed<span class="meta"><ul><li>15184745922</li></ul></span>
</li>
And I want to save 'Physical education sed under a name column and '15184745922' under a number column.
I was wondering how do you do this in Ruby.
In nokogiri I can get only the li's by doing this:
puts page.css("ul li").text
but then it comes out all in one word:"Physical education sed15184745922"
I was thinking regex is the way to go but I am stumped with that.
I did split it on the li
full_contact = page.css("ul li")[22]
split_contact_on_li = full_contact.to_s.split(/(\W|^)li(\W|$)/).map(&:to_sym)
puts split_contact_on_li
and I get this
<
>
Physical education sed<span class="meta"><ul>
<
>
15184745922<
/
>
</ul></span>
<
/
>
The same number of lines will be shown for each contact_info and the name is always the third line before the span class and the number is always the 6th line.
There is an instance where there might be an email address instead on the 6th line put not often.
So should I match the second and the third angular bracket and pull the information up to the third and fourth bracket then shove it into an array called name and number?
You shouldn't use a regex to parse xhtml since the regex engine might mess up things, you should use a html parser instead. However, if you want to use a regex, you can use a regex like this:
<li>(.*?)<.*?<li>(.*?)<
Working demo
The idea behind this regex is to use capturing groups (using paretheses) to capture the content you want. So, for you sample input the match information is:
MATCH 1
Group 1. [4-26] `Physical education sed`
Group 2. [53-64] `15184745922`
For example;
#!/usr/bin/env ruby
string = "<li>Physical education sed<span class=\"meta\"><ul><li>15184745922</li></ul></span></li>"
one, two = string.match(/<li>(.*?)<.*?<li>(.*?)</i).captures
p one #=> "Physical education sed"
p two #=> "15184745922"
Why don't you just do a regex on the string "physical education sed15184745922"? You can match on the first digit, and get back the number and the preceding text.
I don't know how to use Ruby, but if I understand your question correctly I would take advantage of the gsub function (or Ruby's equivalent). It might not be the prettiest approach, but since we just want the text in one variable and the numbers in another, we can just replace the characters we don't want with empty values.
v1 = page.css('ul li').text
v2 = gsub('\d*', '', v1)
v3 = gsub('(^\d)', '', v1)
v1 gets the full text value, v2 replaces all numeric characters with '', and v3 replaces all alpha characters with '', giving us two new variables to put wherever we please.
Again, I don't know how to use Ruby, but in R I know that I could get all of the values from the page using the xpath you provided ("ul li") into a vector and then loop across the vector performing the above steps on each element. I'm not sure if that adequately answers your question, but hopefully the gsub function gets you closer to what you want.
You need to use your HTML parser (Nokogiri) and regular expressions together. First, use Nokogiri to traverse down to the first parent node that contains all the text you need, then regex the text to get what you need.
Also, consider using .xpath instead of .css it provides much more functionality to search and scrape for just what you want. Given your example, you could do like so:
page.xpath("//span[#class='meta']/parent::li").map do |i|
i.text.scan(/^([a-z\s]+)(\d+)$/i).flatten
end
#=> [['Physical education sed', '15184745922'], ['the next string', '1234567890'], ...]
And now you have a two-dimensional array you can iterate over and save each pair.
This bit of xpath business: "//span[#class='meta']/parent::li" is doing what .css can't do, returning the parent node that has the text and specific children nodes you want to scrape against.

How to generate hash from ~200k text/html that would match/compare to similar text?

I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text
ex of matching texts:
"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
...
"2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..
So far I've thought of removing numbers and tags but that wold still leave the random words.
Is there anything out there that dose this?
I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.
The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.
Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)
Any advice is welcomed!
UPDATE:
I'm refering to ahash function as it is described on wikipedia:
A hash function is any algorithm or subroutine that maps large data
sets of variable length to smaller data sets of a fixed length.
the fixed length part is not necessary in my case.
It sounds like you need to utilize a program like diff.
If you are just trying to compare text a hash is not the way to go because slight differences in input cause total and complete differnces in output. (Thus the reason why they are used to encode passwords, and secure text). Character difference programs are pretty complicated, unless you really are interested in how they work and are trying to write your own I would just use a solution like the one that is shown here using sdiff to get a percentage.
Percentage value with GNU Diff
You could use some sort of Levenshtein distance algoritm. this works for small pieces of text, but I'm rather sure that something similar can be applied to large chunks of text.
Ref: http://en.m.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
I've found out that tag order in webpages can create a very distinctive pattern, that remains the same even if portions of text / css / script change. So I've made a string generated by the tag order (ex: html head meta title body div table tr td span bold... => "hhmtbdttsb...") and then I just do exact matches between these strings. I can even apply the Levenshtein distance algorithm and get accurate results.
If I didn't have html, I would have used the punctuation/end-lines for splitting, or something similar.

Formatting a String Array to Display to Users

What is the best format to communicate an array of strings in one string to users who are not geeks?
I could do it like this:
Item1, Item2, Item3
But that becomes meaningless when the strings contain spaces and commas.
I could also do it this way:
"Item1", "Item2", "Item3"
However, I would like to avoid escaping the array elements because escaped characters can be confusing to the uninitiated.
Edit: I should have clarified that I need the formatted string to be one-line. Basically, I have a list of lists displayed in a .Net Winforms ListView (although this question is language-agnostic). I need to show the users a one-line "snapshot" of the list next to the list's name in the ListView, so they get a general idea of what the list contains.
You can pick a character like pipe (|) which are not used much outside programs. It also used in wiki markup for tables which may be intuitive to those who are familiar with wiki markup.
Item1| Item2| Item3
In a GUI or color TUI, shade each element individually. In a monochrome TUI, add a couple of spaces and advance to the next tab position (\t) between each word.
Using JSON, the above list would look like:
'["Item1", "Item2", "Item3"]'.
This is unambiguous and a syntax in widespread use. Just explain the nested syntax a little bit and they'll probably get it.
Of course, if this is to be displayed in a UI, then you don't necessarily want unambiguous syntax as much as you want it to actually look like something intended for the end user. In that case it would depend exactly how you are displaying this to the user.
Display each element as a cell in a table.
How about line breaks after each string? :>
Display each string on a separate line, with line numbers:
1. Make a list
2. Check it twice
3. Say something nice
It's the way people write lists in the real world, y'know :)
Use some kind of typographical convention, for example a bold hashmark and space between strings.
milk # eggs # bread # apples # lettuce # carrots
CSV. Because the very first thing your non-technical user is going to do with delimited data is import it into a spreadsheet.