Read text from webpage without html - html

I'm trying to extract the text of an url using WebClient in C#.
But the content contains html tags and I only want raw text.
My code is as follows:
string webURL = "https://myurl.com";
WebClient wc = new WebClient();
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);
I get the following error with the above code:
'The remote server returned an error: (403) Forbidden.
and change my code to:
string webURL = "https://myurl.com";
WebClient wc = new WebClient();
wc.Headers.Add("user-agent", "Only a Header!");
byte[] rawByteArray = wc.DownloadData(webURL);
string webContent = Encoding.UTF8.GetString(rawByteArray);
The above code has no error, but the result contains html tags. html tags can be removed using Regex:
var result= Regex.Replace(webContent, "<.*?>", String.Empty);
But this method is not accurate and does not good performance. Is there a better way to extract just the text without the html tags from an url?

The Navigate function doesn't block execution. You need to register for the DocumentCompleted event, then you should be able to grab the contents within that.

It's not the way you're using that. First of all you should know you have to use Web Client
Now you can try this code :
WebClient client = new WebClient();
string content = client.DownloadString("https://stackoverflow.com/search?q=web+browser+c%23");

Related

GetCapabilities Query for TileServer Returns Malformed JSON

I have installed TileServer.php. When I navigate to it, I can see my tiles (so it's working).
My issue is when I query for the getCapabilities file the resulting json file is malformed.
The json is prefixed with part of the query string at the start of the json response.
Here is the full query string:
http://<=my ip=>/tileserver/index.html?service=wmts&request=getcapabilities&version=1.0.0
Actual Json Response I Receive
(Notice wmts&request is prefixed to the otherwise valid json)
====JSON===============================
wmts&request([{"name":"190322","type":"overlay","description":"190322","version":"1.1","format":"png","bounds":[174.92249449474565,-36.991878207885335,174.93635413927785,-36.98244705946717],"maxzoom":22,"minzoom":14,"basename":"1313_190322","profile":"mercator","scale":1,"tiles": ...
==================================================
I have tried removing part of the query string to test for the results, oddly enough it grabs the part of the query string again.
Here is the full query string I tested with:
http://<=my ip=>/tileserver/index.html?request=getcapabilities&version=1.0.0
(Actual Json Response I Receive)
====JSON===============================
getcapabilities&version([{"name":"190322","type":"overlay","description":"190322","version":"1.1","format":"png","bounds":[174.92249449474565,-36.991878207885335,174.93635413927785,-36.98244705946717],"maxzoom":22,"minzoom":14,"basename":"1313_190322","profile":"mercator","scale":1,"tiles": ...
=======================================================
I could parse this out I suppose but I would like to find the cause for this issue.
I am using ASP.Net 5.0.
Here is roughly my code:
private static readonly string _tileserver_ip = "http://<my ip>/tileserver/";
HttpClient client = new HttpClient();
client.DefaultRequestHeaders.Accept.Clear();
client.DefaultRequestHeaders.Accept.Add(
new MediaTypeWithQualityHeaderValue("application/json"));
var query = new Dictionary<string, string>
{
["service"] = "wmts",
["request"] = "getcapabilities",
["version"] = "1.0.0"
};
var response = await client.GetAsync(QueryHelpers.AddQueryString(_tileserver_ip, query));
var capabilitiesString = await response.Content.ReadAsStringAsync();
// the result of the query string => "http://<my ip>/tileserver/?service=wmts&request=getcapabilities&version=1.0.0"
EDIT
Opps! Turns out I was requesting the getCapabilities file from the TileServer in the completely wrong way.
I will leave this here encase it helps someone in the future.
Here is the correct URL: http://<= my url =>/tileserver/1.0.0/WMTSCapabilities.xml/wmts
I found the answer and I will leave this post here encase it helps someone in the future.
In my URL I was using index.html as the index page, however I should have been using index .json instead.
As soon as I switched to .json I received the JSON response as I was expecting.
Full URL with query string:
http://<=my ip=>/tileserver/index.json?service=wmts&request=getcapabilities&version=1.0.0

No content when using PostAsync to post JSON

Using the code below, I've managed to create a jsonArray with the following format:[{"id":3},{"id":4},{"id":5}]
var jArray = new JsonArray();
int numOfChildren = 10;
for (int i = 0; i < numOfChildren; i++)
{
if (CONDITION == true)
{
var jObj = new JsonObject();
int id = SOMEID;
jObj.SetNamedValue("id", JsonValue.CreateNumberValue(id));
jArray.Add(jObj);
}
I am now trying to send "JsonArray" to a server using PostAsync as can be seen below:
Uri posturi = new Uri("http://MYURI");
HttpContent content = new StringContent(jArray.ToString(), Encoding.UTF8, "application/json");
System.Net.Http.HttpResponseMessage response = await client.PostAsync(postUri, content);
On the server side of things though, I can see that the post request contains no content. After digging around on the interwebs, it would seem that using jArray.ToString() within StringContent is the culprit, but I'm not understanding why or if that even is the problem in the first place. So, why is my content missing? Note that I'm writing this for UWP aplication that does not use JSON.net.
After much digging, I eventually Wiresharked two different applications, one with my original jArray.ToString() and another using JSON.net's JsonConver.SerializeObject(). In Wireshark, I could see that the content of the two packets was identical, so that told me that my issue resided on the server side of things. I eventually figured out that my PHP script that handled incoming POST requests was too literal and would only accept json posts of the type 'application/json'. My UWP application sent packets of the type 'application/json; charset=utf-8'. After loosening some of my content checking on the server side a bit, all was well.
For those who are looking to serialize json without the use of JSON.net, jsonArray.ToString() or jsonArray.Stringify() both work well.
You should use a Serializer to convert it to string.
Use NewtonSoft JSON Nuget.
string str = JsonConvert.SerializeObject(jArray);
HttpContent content = new StringContent(str, Encoding.UTF8, "application/json");
System.Net.Http.HttpResponseMessage response = await client.PostAsync(postUri, content);

Javascript does not accept HTML String from MVC?

Javascript does not accept HTML String from MVC?
My MVC contoller from where I am sending the HTML template in string from txt file
using (StreamReader sr = new StreamReader(#"D:\Templates\NewGridTemplate.txt"))
{
// Read the stream to a string, and write the string to the console.
obj.sGridTemplate = sr.ReadToEnd().Replace(Environment.NewLine, " ");
//Console.WriteLine(line);
}
return View(obj);
Actual code in csHTMl Javascript
var sHTML=$(#Model.sGridTemplate);
Below is screen shot for error . HTML string is not accepted by Javascript. shows character "<" etc.. Please help let me know what I missed.Image 3
You need to put your content within quotes so that it becomes a Javascript string, otherwise it'll be interpreted as syntax and you get the error because it's not valid JS syntax
var sHTML=$('#Model.sGridTemplate');
And the problematic line in the resulting client-side code should look more like this:
var sHTML=$(' < ...
... than:
var sHTML=$( < ...

VB 2010: How can I get the content of a specific span in a specific class of a webpage?

Let me explain: I want to make a currency converter form that uses online rates so that it's up to date. I plan to use Google, which has a currency converting function built in - type in:
[Any currency symbol][Any numeric amount] + in + [Any currency symbol]
I know how to format the URLs I will be searching for, and which class the span (whose text) I want is in, I simply just don't know how to programmatically take the result out and use it in my form.
Here's the applicable HTML code from a "£1 in $" conversion:
<div class="vk_ans vk_bk curtgt" style="padding-bottom: 4px">
<span style="word-break:break-all">1.68</span>
<span>US Dollar</span></div>
The class is called:
vk_ans vk_bk curtgt
The span text is the first one in the class (the one that contains "1.68")
BTW, I completely comprehend that there are easier-to-use API websites for this purpose, but I want to use Google because:
It will always be up
It's a good chance for me to learn how to grab a specific part a webpage.
I personnaly use http://www.nuget.org/packages/ScrapySharp/2.2.63 for html-scraping, it allows use to use CSS3 selectors to nicely grab what you need.
In your case it would be as simple as this :
Dim doc as HtmlAgilityPack.HtmlDocument = new HtmlAgilityPack.HtmlDocument()
doc.LoadHtml(GetRawHtml(theUrl))
Dim body as HtmlNode= doc.DocumentNode.SelectSingleNode("//body")
Dim yourSpan as HtmlNode = body.CssSelect(".vk_ans.vk_bk.curtgt").First()
dim yourValue as Double = Double.Parse(yourSpan.InnerText)
Function GetRawHtml(url As String) As String
Dim html As String = String.Empty
Dim request As WebRequest = WebRequest.Create(url)
Try
Using response As WebResponse = request.GetResponse()
Using data As Stream = response.GetResponseStream()
Using sr As New StreamReader(data)
html = sr.ReadToEnd()
End Using
End Using
End Using
Catch e As Exception
yourLogger.Error("WebRequest failed at url `{0}`. Error: {1}", url, e.ToString())
End Try
Return html
End Function
Also if the request is slow, you might want to change proxy attribute to nothing, see here : HttpWebRequest is extremely slow!

Replacing string in html dynamically in Android

I am using "loadDataWithBaseUrl(...)" to load a html file, stored in assets, to Webview. that contains a string "Loading..." and a rotating GIF. String "Loading..." is hard coded, and it'll not be localized. How to replace that string dynamically, so that it can be localized?
Please help me to resolve this.
There are various solutions I could think of :
Load a different asset file according to the current language (get the current language using Locale.getDefault()), This way you can translate your HTML files independently.
Use place holders in your HTML file (for instance #loading_message#), then load the asset file in a String, replace all the occurences of the placeholder by the appropriate localised message (String.replaceAll("#loading_message#", getText(R.string.loading_message).toString())), finally load the processed HTML into the WebView using the loadData(String data, String mimeType, String encoding) function.
To load the asset file, you can do something like that:
File f = new File("file:///android_asset/my_file.html");
FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
StringBuffer sb = new StringBuffer();
String eachLine = br.readLine();
while(eachLine != null) {
sb.append(eachLine);
sb.append("\n");
eachLine = br.readLine();
}
// sb.toString is your HTML file as a String
I had a similar problem when using the WebView to show help text that should be translated.
My solution was to add multiple translated HTML files in assets and loading them with:
webView.loadUrl("file:///android_asset/" + getResources().getString(R.string.help_file));
For more details go to: Language specific HTML Help in Android
String str = "Loading ..."
String newStr = str.substring("Loading ".length());
newStr = context.getResourceById(R.string.loading) + newStr;
I hope the code is sufficiently clear to understand the idea: extract the string without "Loading " and concatenate it with the localized version of "Loading" string