Using htmlParse to clean text from dataframe

Using htmlParse to clean text from dataframe - html

I have a dataFrame with HTML text in one column and a binary variable in the other column. Looks like this in the first column:
1 <p>I am trying to build a simple map app with Shiny and ggplot2. It works as follow: </p>\n\n<ul>\n<li>user selects a country </li>\n<li>the app loads a shape file and gives a list of input fields for adm1 country regions</li>\n<li>user inputs a numeric value for each region (fields are initially filled with random values) </li>\n<li>all values from input fields are collected in a vector, merged to the map data and given as a <code>fill</code> argument to the <code>ggplot()</code> function</li>\n</ul>\n\n<p>The problem is that ggplot doesn't seem to interpret correctly the input values for each regions. Plus, colors on the map don't change when input values are modified through the UI. I believe the <code>indicator</code> vector fed to the <code>fill</code> argument is not correctly interpreted/passed.</p>\n\n<p>Thank you for your suggestions.</p>\n\n<p><em>Note: in the code below, the shapefiles are sourced on the UCDavis website for reproducibility. I usually store them locally.</... <truncated>
I'm trying to clean, or remove the HTML tags using a for loop, but R is saying this isn't XML code:
for (i in 1:nrow(dataFrame)) {
row <- dataFrame[i,]
htmlParse(dataFrame)
}
Any suggestions?

Related

Map multiple objects in an array in JOLT

There is a field called "item" which was needed only once because from the source I was only getting once. Now I am getting multiple times and my mapping is breaking. So, how should I handle this.
Field name is item.
Current Output-
Expected-
Below is the link with my old input, new input and current spec.
FYI-There are more field in the items tag. This is just for understanding
Link https://drive.google.com/drive/folders/1vvdFBPwaHRVvjttUTQP0jzQlqYlGfjFZ

How to scrape text based on a specific link with BeautifulSoup?

I'm trying to scrape text from a website, but specifically only the text that's linked to with one of two specific links, and then additionally scrape another text string that follows shortly after it.
The second text string is easy to scrape because it includes a unique class I can target, so I've already gotten that working, but I haven't been able to successfully scrape the first text (with the one of two specific links).
I found this SO question ( Find specific link w/ beautifulsoup ) and tried to implement variations of that, but wasn't able to get it to work.
Here's a snippet of the HTML code I'm trying to scrape. This patter recurs repeatedly over the course of each page I'm scraping:
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
The two parts I'm trying to scrape and then store together in a list are the two Chinese-language text strings.
The first of these, 女孩, which means female, is the one I haven't been able to scrape successfully.
This is always preceded by one of these two links:
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19 (Female)
forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=15 (Male)
I've tested a whole bunch of different things, including things like:
gender_containers = soup.find_all('a', href = 'forum.php?mod=forumdisplay&fid=191&filter=typeid&typeid=19')
print(gender_containers.get_text())
But for everything I've tried, I keep getting errors like:
ResultSet object has no attribute 'get_text'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I think that I'm not successfully finding those links to grab the text, but my rudimentary Python skills thus far have failed me in figuring out how to make it happen.
What I want to have happen ultimately is to scrape each page such that the two strings in this code (女孩 and 寻找2003年出生2004年失踪贵州省...)
<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179
...are scraped as two separate variables so that I can store them as two items in a list and then iterate down to the next instance of this code, scrape those two text snippets and store them as another list, etc. I'm building a list of list in which I want each row/nested list to contain two strings: the gender (女孩 or 男孩）and then the longer string, which has a lot more variation.
(But currently I have working code that scrapes and stores that, I just haven't been able to get the gender part to work.)

Sounds like you could use attribute = value css selector with $ ends with operator
If there can only be one occurrence per page
soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text
This is assuming those typeid=19 or typeid=15 only occur at the end of the strings of interest. The "," between the two in the selector is to allow for matching on either.
You could additionally handle possibility of not being present as follows:
from bs4 import BeautifulSoup
html ='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(html,'html.parser')
gender = soup.select_one("[href$='typeid=19'], [href$='typeid=15']").text if soup.select_one("[href$='typeid=19'], [href$='typeid=15']") is not None else 'Not found'
print(gender)
Multiple values:
genders = [item.text for item in soup.select_one("[href$='typeid=19'], [href$='typeid=15']")]

Try the following code.
from bs4 import BeautifulSoup
data='''<em>[女孩]</em> 寻找2003年出生2004年失踪贵州省黔西南布依族苗族自治州贞丰县珉谷镇锅底冲 黄冬冬289179'''
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('em').text)
OutPut:
[女孩]

How to generate numbers for showing sequence in SSRS also it should get rearrenge when any field inbetween is missing

I am facing problem in SSRS report for showing sequense numbers as given in image.
when I searched for this issue I got solution as -- RowNumber("DataSetName")
but problem with this is, it generate numbers like 1,2,3..., but I want these numbers in following forms- 1.1, 1.2 or 1.1.1, 1.2.1.
And another problem for me is above function will work if I am having multiple rows in dataset and that dataset is bind with table to show its data, but In my case I am getting all data in single row and out of that I am showing values in textbox using expressions and if that value is empty I am hiding that textbox.
so I am not getting any solution to show sequence number in textbox along with text and how I can rearrenge that numbers if my inbetween textbox is hidden because of no data.
Please provide me solution for above probem.
Image for reference -
Example of data is :
From above table values from "Subheading1" And "Subheading2" will show inside "Heading first" and "Subheading3" And "Subheading4" will show inside "Heading second".

You can concatenate row numbers as strings as follows:
=RowNumber("HEADING") & "." & RowNumber("SUBHEADING")
To ensure numbers are consecutive remove the relevant rows in the source dataset instead of hiding them in the tablix.

Why it does not showing the year value .in D3plus scatter?

I am using D3plus for data visualization . but in the x axis wrong data is showing instead of what i wrote in .x("year") to show .
http://jsfiddle.net/MituVinci/a77kz0dr/
enter code here
var visualization = d3plus.viz()
.container("#viz") // container DIV to hold the visualization
.data(sample_data) // data to use with the visualization
.type("scatter") // visualization type
.id("Reason") // key for which our data is unique on
.x("year") // key for x-axis
.y("Female") // key for y-axis
.draw()
I also want to resize the width and height of this and also want to show it using an external json file how can i do it ?

Since you have "Reason" as your .id() variable, D3plus is aggregating all data points that have the same "Reason". So the "x" position for "Family Feud" is 2010+2011+2012+2013+2014 or 10060, which is where all of your bubbles are located.
If you want to display each bubble individually, you could create a separate variable called "ReasonYear", concat the text of the Reason and Year fields together and then use .id("ReasonYear") for your visualization.
Use .width() and .height() to control the width and height respectively of your visualization.
Use .data() to load data from an external JSON file
Documentation can be found here: https://github.com/alexandersimoes/d3plus/wiki/Visualizations

D3.js: How to combine 2 datasets in order to create a map and show values on.mouseover?

I would like to combine two datasets on a map in D3.js.
For example:
1st dataset: spatial data in .json.
2nd dataset: Data to the areas in .csv
--> When you hover on the map a tooltip should show a sentences with some data from the 2nd dataset.
I am able to make the map and show a tooltip with data within the .json-file, but how do I insert the 2nd dataset?
A new function within my function that creates the map?
Do I have to take a completely new way?
Should I merge the .json-file with my 2nd dataset before using d3.js?
I appreciate any thoughts! :)

So, I think what you're asking is how to take spatial data from json and join it with some csv data that is loaded separately?
I did something similar with a choropleth map I was drawing and basically I just created a map of topology element ids to data objects and then I did a lookup using the topology element id to get whatever I wanted to associate with the actual drawn map element (I was using this method to set the color for the choropleth based on the fips country code).
So basically, draw the map so that you have an id associated with each map element that you want to be able to hover over. Then, in your mouseover/mouseout handlers, you will use that id to lookup the data you want to show in the tooltip and either use the svg title element or tipsy or manually draw an svg text element or whatever to show the tooltip.
Here's a couple useful references for drawing tooltips:
https://gist.github.com/biovisualize/1016860
http://jsfiddle.net/reblace/6FkBd/2/
From the fiddle:
function mouseover(d) {
d3.select(this).append("text")
.attr("class", "hover")
.attr('transform', function(d){
return 'translate(5, -10)';
})
.text(d.name + ": " + d.id);
}
// Toggle children on click.
function mouseout(d) {
d3.select(this).select("text.hover").remove();
}
Basically, it's appending an SVG text element and offsetting it from the position of the element being hovered over.
And here's a sample of how I look up data in an external map:
// Update the bound data
data.svg.selectAll("g.map path").transition().duration(750)
.style("fill", function(d) {
// Get the feature data from the mapData using the feature code
var val = mapData[d.properties.code];
// Return the colorScale value for the state's value
return (val !== undefined)? data.settings.colorScale(val) : undefined;
});
If your data is static, you can join it into your topojson file (if that's what you're using). https://github.com/mbostock/topojson/wiki/Command-Line-Reference
The client could change my data, so I kept it separate and redrew the map each time the data changed so that the colors would update. Since my data was topojson, I could access the feature id from the map data using d.properties.code (because I had joined the codes into the topojson file using the topojson tool I reference above... but you could use whatever unique id is in the spatial data file you have).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using htmlParse to clean text from dataframe - html

Related

Map multiple objects in an array in JOLT

How to scrape text based on a specific link with BeautifulSoup?

How to generate numbers for showing sequence in SSRS also it should get rearrenge when any field inbetween is missing

Why it does not showing the year value .in D3plus scatter?

D3.js: How to combine 2 datasets in order to create a map and show values on.mouseover?

Categories

Resources