Let's say there is a following graph created in R igraph:
ed <- c(1,2,2,3,3,1,2,4,3,5,4,5,5,6,6,4)
gr <- make_undirected_graph(ed)
plot(gr)
I'm trying to separate edges of the graph into two groups: "supported", i.e. belonging to connected triangles (in the aforementioned example: 1-2, 2-3, 3-1, 4-5, 5-6, 6-4) and "unsupported", i.e. not belonging to connected triangles (2-4, 3-5). Is there any way to do that in igraph?
Here is a solution I found combining triangles() and #GaborCsardi's approach mentioned in my last comment:
tr <- triangles(gr)
edges <- tapply(tr, rep(1:(length(tr)/3), each = 3), function(x) E(gr,path=c(x,x[1])))
However, this solution is still rather inefficient in case of large network.
Related
I am working on a data prep tutorial, using data from this article: https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#
None of the text is hard-coded, everything is dynamic and I don't know where to start. I've tried a few things with packages rvest and xml2 but I can't even tell if I'm making progress or not.
I've used copy/paste ang regexes in notepad++ to get a tabular structure like this:
Target
Attack
AAA News
Fake News
AAA News
Fake News
AAA News
A total disgrace
...
...
Mr. ZZZ
A real nut job
but I'd like to show how to do everything programmatically (no copy/paste).
My main question is as follows: is that even possible with reasonable effort? And if so, any clues on how to get started?
PS: I know that this could be a duplicate, I just can't tell of which question since there are totally different approaches out there :\
I used my free articles allocation at The NY Times for the month, but here is some guidance. It looks like the web page uses several scripts to create and display the page.
If you uses your browser's developer tools and look at the network tab, you will find 2 CSV files:
tweets-full.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-full.csv
tweets-reduced.csv located here: https://static01.nyt.com/newsgraphics/2021/01/10/trump-insult-complete/8afc02d17b32a573bf1ceed93a0ac21b232fba7a/tweets-reduced.csv
It looks like the reduced file creates the table quoted above and the tweets-full is the full tweet. You can download these files directly with read.csv() and the process this information as needed.
Be sure to read the term of service before scraping any webpage.
Here's a programatic approach with RSelenium and rvest:
library(RSelenium)
library(rvest)
library(tidyverse)
driver <- rsDriver(browser="chrome", port=4234L, chromever ="87.0.4280.87")
client <- driver[["client"]]
client$navigate("https://www.nytimes.com/interactive/2021/01/19/upshot/trump-complete-insult-list.html#")
page.source <- client$getPageSource()[[1]]
#Extract nodes for each letter using XPath
Letters <- read_html(page.source) %>%
html_nodes(xpath = '//*[#id="mem-wall"]/div[2]/div')
#Extract Entities using CSS
Entities <- map(Letters, ~ html_nodes(.x, css = 'div.g-entity-name') %>%
html_text)
#Extract quotes using CSS
Quotes <- map(Letters, ~ html_nodes(.x, css = 'div.g-twitter-quote-container') %>%
map(html_nodes, css = 'div.g-twitter-quote-c') %>%
map(html_text))
#Bind the entites and quotes together. There are two letters that are blank, so fall back to NA
map2_dfr(Entities, Quotes,
~ map2_dfr(.x, .y,~ {if(length(.x) > 0 & length(.y)){data.frame(Entity = .x, Insult = .y)}else{
data.frame(Entity = NA, Insult = NA)}})) -> Result
#Strip out the quotes
Result %>%
mutate(Insult = str_replace_all(Insult,"(^“)|([ .,!?]?”)","") %>% str_trim) -> Result
#Take a look at the result
Result %>%
slice_sample(n=10)
Entity Insult
1 Mitt Romney failed presidential candidate
2 Hillary Clinton Crooked
3 The “mainstream” media Fake News
4 Democrats on a fishing expedition
5 Pete Ricketts illegal late night coup
6 The “mainstream” media anti-Trump haters
7 The Washington Post do nothing but write bad stories even on very positive achievements
8 Democrats weak
9 Marco Rubio Lightweight
10 The Steele Dossier a Fake Dossier
The xpath was obtained by inspecting the webpage source (F9 in Chrome), hovering over elements until the correct one was highlighted, right clicking, and choosing copy XPath as shown:
I'm trying to scrape a web page.
I want to get a dataset from two different html nodes; ".table-grosse-schrift" and "td.zentriert.no-border".
url<-paste0("https://www.transfermarkt.co.uk/serie-a/spieltag/wettbewerb/IT1/saison_id/2016/spieltag/4")
tt<-read_html(url[[x]]) %>%html_nodes(".table-grosse-schrift")%>%html_text()%>%as.matrix()
temp1=data.frame(as.character(gsub("\r|\n|\t|\U00A0", "", tt[,])))
temp2<-(read_html(url[[x]]) %>%html_nodes("td.zentriert.no-border") %>% html_text() %>% data.frame())
The problem is that the order of the nodes of ".table-grosse-schrift" on the web page keep changing, so that I cannot match the data from the two nodes.
I found that the solution can be getting two nodes' data at the same time, like this:
tt<-read_html(url[[x]]) %>%html_nodes(".table-grosse-schrift")%>%html_nodes("td.zentriert.no-border")%>%html_text()%>%as.matrix()
But this code does not work.
If I understand correctly, you should be able to use following-sibling to select the next corresponding sibling in the pair of nodes that you need.
The following-sibling axis indicates all the nodes that have the same
parent as the context node and appear after the context node in the
source document. (Source: https://developer.mozilla.org/en-US/docs/Web/XPath/Axes/following-sibling)
I am relatively new to web scraping.
I am having problems with child numbers when web scraping for multiple patents. The child number changes accordingly to the the location of the table in the web page. Sometimes the child is "div:nth-child(17)" and other times it is "div:nth-child(18)" when searching for different patents.
My line of code is this one:
IPCs <-sapply("http://www.sumobrain.com/patents/us/Sonic-pulse-echo-method-apparatus/4202215.html", function(url1){
tryCatch(url1 %>%
as.character() %>%
read_html() %>%
html_nodes("#inner_content2 > div:nth-child(17) > div.disp_elm_value3 > table") %>%
html_table(),
error = function(e){NA}
)
})
When I search for another patent (for example: "http://www.sumobrain.com/patents/us/Method-apparatus-quantitative-depth-differential/4982090.html") the child number changes to (18).
I am planning to analyse more than a thousand patents so I would need a code that work for both child numbers. Is there a CSS selector which allows me to select more children? I have tried the "div:nth-child(n)" and "div:nth-child(*)" but they do not work.
I am also open to using a different method. Does anybody have any suggestions?
Try this pseudo classes :
It's a range between 17 and 18.
nth-child(17):nth-child(-n+18)
I am interested in generating weighted, directed random graphs with node constraints. Is there a graph generator in R or Python that is customizable? The only one I am aware of is igraph's erdos.renyi.game() but I am unsure if one can customize it.
Edit: the customizations I want to make are 1) drawing a weighted graph and 2) constraining some nodes from drawing edges.
In igraph python, you can use link the Erdos_Renyi class.
For constraining some nodes from drawing edges, this is controlled by the p value.
Erdos_Renyi(n, p, m, directed=False, loops=False) #these are the defaults
Example:
from igraph import *
g = Graph.Erdos_Renyi(10,0.1,directed=True)
plot(g)
By setting the p=0.1 you can see that some nodes do not have edges.
For the weights you can do something like:
g.ecount() # to find the number of edges
g.es["weights"] = range(1, g.ecount())
g.es["label"] = weights
plot(g)
Result:
I have plots of 3-axis accelerometer time-series data (t,x,y,z) in separate subplots I'd like to zoom together. That is, when I use the "Zoom to Rectangle" tool on one plot, when I release the mouse all 3 plots zoom together.
Previously, I simply plotted all 3 axes on a single plot using different colors. But this is useful only with small amounts of data: I have over 2 million data points, so the last axis plotted obscures the other two. Hence the need for separate subplots.
I know I can capture matplotlib/pyplot mouse events (http://matplotlib.sourceforge.net/users/event_handling.html), and I know I can catch other events (http://matplotlib.sourceforge.net/api/backend_bases_api.html#matplotlib.backend_bases.ResizeEvent), but I don't know how to tell what zoom has been requested on any one subplot, and how to replicate it on the other two subplots.
I suspect I have the all the pieces, and need only that one last precious clue...
-BobC
The easiest way to do this is by using the sharex and/or sharey keywords when creating the axes:
from matplotlib import pyplot as plt
ax1 = plt.subplot(2,1,1)
ax1.plot(...)
ax2 = plt.subplot(2,1,2, sharex=ax1)
ax2.plot(...)
You can also do this with plt.subplots, if that's your style.
fig, ax = plt.subplots(3, 1, sharex=True, sharey=True)
Interactively this works on separate axes
for ax in fig.axes:
ax.set_xlim(0, 50)
fig.draw()