extract money value from euro sign using regex in r - extract

i have a column which gives the average daily rate of a room in euros.
FYI: When I upload the csv into R the euro sign turns into \u0080
Question: how do i apply regex to extract the numeric value for the entire column. My column is called: train$average_daily_rate
This is what it looks like:
"99 \u0080"
"113.53 \u0080"
"81.82 \u0080"
I want my ouput to be:
99
113.82
81.82
I have no idea
I thought something like this... "(\d+.\d)\s\s[0-9a-fxA-FX]+"

You should apply this expression to each value: ^.+?(?= \u0080|$), where ?= is a positive lookahead, " \u0080" is what you don't want to capture (escaped so that's why you have the double ) and | makes sure that the sequence is at the end of the string.
PS: Make sure to use meaningful tags for future questions, in this case for example #r and/or #regex instead of "extract"

Related

Julia Box plots, not reading columns where the csv file column that the name has spaces and parenthesis but has no problem reading 1word column title

So here's the code in Julia
using CSV
using DataFrames
using PlotlyJS
df= CSV.read("path", DataFrame)
plot(df, x=:Age, kind="box")
#I DO get the box plot for this one, because in the csv that column is headed with "Age"
plot(df, x=:Annual Income (k$), kind="box")
ERROR: syntax: missing comma or ) in argument list
Stacktrace:
[1] top-level scope
# none:1
#here I get an error asking about syntax, but I don't understand since the x= part is exactly what the column is labeled. If I try 'x=:Annual' I get a box plot of nothing, but the column title is "Annual Income (k$)".
Help is greatly appreciated!
Refrence: https://plotly.com/julia/box-plots/
Try:
plot(df, x=Symbol("Annual Income (k\$)"), kind="box")
The : syntax constructs a Symbol, but only upto the next space. So :Annual Income (k$) says to build the Symbol Symbol("Annual"), but then leaves the Income (k$) parts dangling. Instead you can explicitly construct the Symbol yourself like above.
The backslash before the $ symbol is because Julia uses $ usually for interpolation, and here we want to use the raw $ character itself. You can also do plot(df, x=Symbol(raw"Annual Income (k$)"), kind="box") instead, as no interpolation happens inside raw"" strings.

Function which removes html color from a string with sscanf

I've a big dilemma how can I do a condition to remove this type of color from my string (ex: {dd2e22}) using sscanf, which is only func I want to use. So the string provided will be some random text:
Te{dd2e22}xt is {3f53ec}here
The condition what I tried
sscanf(buf,"%[^\{[0-9a-fA-F]{6,8}\}]s",output);
This isn't working, the result are only first character "T".
Try using the format specifier:
"%*6X"
Analysis:
% -- starts a format specifier.
* -- tells scanf not to assign field to variable.
6x -- says that field is 6 hex digits.
See scanf format specifier
result are only first character "T".
Well, the next character is 'e', which matches the set \{[0-9a-fA-F]{6,8}\ and thus doesn't match the inverted set specified by '^'.
This task can be achieved with a regular expression. The standard library provides you with appropriate tools in the <regex> header.

Select JSON values with special characters

I am looking to detect anomalies in my JSON values.
Here's an example of the data queries via jq
"2014-03-26 01:58:00"
"9019549360"
"109092812_20150626"
"134670164"
""
"97695498"
"680561513"
I would like to display all the values that contain a - or a _ or is blank.
In other words, I'd like to display the following output
"2014-03-26 01:58:00"
"109092812_20150626"
""
Now, I have tried the following:
select (. | contains("-","_"," "))'
This appears to work, but in order to make it more robust, I'd like to expand this to include all special characters.
Your query won't detect empty strings, and will possibly emit the same string more than once. It would be easier to use test, e.g.:
select( length==0 or test("[-_ ]") )
Note also that the preliminary '.' in your query is unnecessary.
Addendum
From one of the comments, it awould appear that you will want to specify "[^a-zA-Z0-9]" or similar as the argument of test.

Regex that allows numbers with commas and two decimals

I'm trying to make a number input field using the pattern attribute since the regular type number didn't support the validations I needed.
Essentially, I want to allow any numbers that make sense, including $, + or - at the start and a % at the end. Also, users should be able to separate their numbers with commas to avoid mistakes on long numbers, but this is not necessary and they should still be able to submit a long number without any type of separation. The field should also allow for decimals.
<input required pattern="[+-]?\$?\d+(,\d{3})*(\.\d+)?%?" type="text" />
I need to allow for the following examples:
Pass:
2000
-20%
2,000
$2,000.00
999,999,999,999,999,999,999.99
Fail:
123e9
Anything that has letters on it
This is the regex that I have so far, but it doesn't seem to work, even for the most basic numbers. I've been using scriptular to test my regex, but that doesn't seem to reflect the results of the actual HTML validation.
Regex: [+-]?\$?\d+(,\d{3})*(\.\d+)?%?
EDIT: For any Ruby on Rails devs, I realized one of my mistakes is that you must escape any backslashes in your regex when you are generating your text_field. So for example, the regex in the answer should look like (?:\\+|\\-|\\$)?\\d{1,}(?:\\,?\\d{3})*(?:\\.\\d+)?%?
Try with following regex.
Regex: (?:\+|\-|\$)?\d{1,}(?:\,?\d{3})*(?:\.\d+)?%?
Explanation:
(?:\+|\-|\$)? matches either + - or $ in-front of a number which is optional as ? quantifier is used.
\d{1,} matches integer part even if it doesn't have ,.
(?:\,?\d{3})* matches multiple occurrences of comma separated digits if present.
(?:\.\d+)? matches optional decimal part.
%? matches optional % character in the end.
?: stands for non-capturing groups. It will match but won't store it for back-referencing.
Regex101 Demo

Finding a string between two strings in a file

This is a bit of a .json file I need to find information in:
"title":
"Spring bank holiday","date":"2012-06-04","notes":"Substitute day","bunting":true},
{"title":"Queen\u2019s Diamond Jubilee","date":"2012-06-05","notes":"Extra bank holiday","bunting":true},
{"title":"Summer bank holiday","date":"2012-08-27","notes":"","bunting":true},
{"title":"Christmas Day","date":"2012-12-25","notes":"","bunting":true},
{"title":"Boxing Day","date":"2012-12-26","notes":"","bunting":true},
{"title":"New Year\u2019s Day","date":"2013-01-01","notes":"","bunting":true},
{"title":"Good Friday","date":"2013-03-29","notes":"","bunting":false},
{"title":"
The file is much longer, but it is one long line of text.
I would like to display what bank holiday it is after a certain date, and also if it involves bunting.
I've tried grep and sed but I can't figure it out.
I'd like something like this:
[command] between [date] and [}] display [title] and [bunting]/[no bunting]
[title] should be just "Christmas Day" or something else
Forgot to mention:
I would like to achieve this in bash shell, either from the prompt or from a short bit of code.
You should use a proper JSON parser in a decent programming language, then you can do a lot of work in a safe way without too much code. How about this little Python code:
#!/usr/bin/env python
import json
with open('my.json') as jsonFile:
holidays = json.load(jsonFile)
for holiday in holidays:
if holiday['date'] > '2012-05-06':
print holiday['date'], ':', holiday['title'], \
("bunting" if holiday['bunting'] else "no bunting")
break # in case you only want one line of output
I could not figure out what exactly the output should be; if you can be more specific, I can adjust my example.
You can try this with awk:
awk -F"}," '{for(i=1;i<=NF;i++){print $i}}' file.json | awk -F"\"[:,]\"?" '$4>"2013-01-01"{printf "%s:%s:%s\n" ,$2,$4,$8}'
Seeing that the json file is one long string we first split this line into multiple json records on },. Then each individual record is split on a combination of ":, characters with an optional closing ". We then only output the line if its after a certain date.
This will find all records after Jan 1 2013.
EDIT:
The 2nd awk splits each individual json record into key-value pairs using a sub-string starting with ", followed by either a : or ,, and an optional ending ".
So in your example it will split on either ",", ":" or ":.
All odd fields are keys, and all even fields are values (hence $4 being the date in your example). We then check if $4(date) is after 2013-01-01.
I noticed i made a mistake on the optional " (should be followed by ? instead of *) in the split which i have now corrected and i also used printf function to display the values.