I am trying to scrape prices from any given URL. I am using CsQuery and for the life of me, I cannot figure out the best way to find all items on a page that might be a price. A bonus would be figuring out the most likely price by size / color of the test and how close it is to the top of the page. I was thinking maybe looking at a Regex solution, but I am not sure if that is the correct way to go with CsQuery.
Well, if a currency sign is present, You might do something like.
(?:\$|\£)(\d+(?!\d*,\d)|\d{1,3}((, ?)\d{3}?)?(\3\d{3}?){0,4})(\.\d{1,2})?(?=[^\d,]|, (?!\d{3,})|$)
(?:\$|\£) -- matches literal currency simbols. You can remove this
if you can't count on the presence of currency symbols,
but it's a great anchor if you can
(\d+ -- matches any number of digits
(?!\d*,\d) as long as not followed by comma digit
|
\d{1,3} -- otherwise matches betweein 1 and 3 digits
(
(, ?) -- looks for a comma followed by a possible space
captures as \3
\d{3}?) -- followed by 3 digits
? -- zero or one times
(\3 -- looks for the same pattern of comma with or without space
\d{3}? -- followed by 3 digits
){0,4}) -- between 0 and 4 times, more on that below
(\. -- literal period
\d{1,2} -- followed by one or two digits
)? -- zero or one times (so, optional)
(?=[^\d,]|, (?!\d{3,})|$)
Another thing you might do is to limit how many repetitions of comma groups there can be, it might help weed out high numbers that aren't likely prices. If you're not expecting anything over 999,999, you might do this (but if you're dealing with foreign currencies, inflation has made some astronomically high--a loaf of bread in Zimbabwe costs fifty million).
For easy reading, I'll show you how to limit the repetitions to 7
Change the 4, (the only 4 in the whole regex) to 6, (the number you want -1, because we look for 1 beforehand to establish comma pattern).
(?:\$|\£)(\d+(?!\d*,\d)|\d{1,3}((, ?)\d{3}?)?(\3\d{3}?){0,6})(\.\d{1,2})?(?=[^\d,]|, (?!\d{3,})|$)
You can see this in action at: https://regex101.com/r/oU2nW2/1
Related
I have column in tableau with following values:
1234
3456
6789
camp-1
camp-2
camp-3
I only want to show filter with values
camp-1
camp-2
camp-3
How can I only select the alphabetic values in filter in tableau?
Your example is not clear about what you want to include and what you want to exclude. To explain better, I took an elaborated example
Case-1 If you want to search/filter for digits at start, use this calculated field
REGEXP_MATCH([Field1], '^[0-9]')
Case-2 If you want to search for numbers anywhere, use this
REGEXP_MATCH([Field1], '(.*)[0-9]')
Case-3 If digits only are required
REGEXP_MATCH([Field1], '^[0-9]+$')
case-4 for alphabet at start use this
REGEXP_MATCH([Field1], '^[:alpha:]')
Results of all matches are shown below
Note Combining numbers anywhere AND alphabet at start you can filter out case1, case2 and case3 only.
Good Luck
If the Tableau column contains a mixture of numbers and text, the column will be a text column and all content will be considered as text. This reduces the problem to that of identifying specific rows that contain non-numeric values.
This requires some string manipulation and comparison. If you know that the structure of the content in those rows is predictable (eg the first character is always a letter when there are non numeric characters in the row) then a simple equation will filter on those rows:
if ascii(left([Text And Numbers],1) )>57 then 'text' else 'number' END
This exploits the observation that the ASCII decimal code for the digit 9 is 57 and most of the ASCII characters with higher codes are letters or punctuation (which is a fair assumption if nothing other than numbers, letters or punctuation are present in your data).
Obviously, if letters and numbers could appear anywhere in the string you need a more complex function but Tableau provides the option to use regular expressions which can code much more complex text analysis like is any alphabetic character present in a string (see this for some ideas of the appropriate regex expressions).
I'm trying to cleanse a data set from erroneous phone number entries. Having trouble making the regular expression for the filter in MySQL.
The structure is the following:
First digit is in 2-9
Second and third digits can be any numeral except they may not be the same number
Forth digit is in 2-9
Fifth and sixth digits can be any numeral except '11'
I've landed on a few rather elaborate reg expressions which didn't quite work; but I'm sure there is a simplistic approach.
A "valid" number might look like:
2028658680
7137038891
My filter usually misses cases such as:
6778914351
7777777777
6178116678
Note that these numbers are completely made up.
This is possible, but it will be long and ugly. With a more robust regex engine you can do lookaround and even conditional statements, but MySQL doesn't support such things as far as I know.
^[2-9](?:0[1-9]|1[02-9]|2[013-9]|3[0-24-9]|4[0-35-9]|5[0-46-9]|6[0-57-9]|7[0-689]|8[0-79]|9[0-8])[2-9](?:1[02-9]|[02-9]1|[02-9]{2})[0-9]{4}$
https://regex101.com/r/qPuS5W/1
Explanation:
[2-9] First digit is any number from 2 to 9.
(?:0[1-9]|1[02-9]|2[013-9]|3[0-24-9]|4[0-35-9]|5[0-46-9]|6[0-57-9]|7[0-689]|8[0-79]|9[0-8]) Non capturing group that contains 10 alternatives starting with each number 0 to 9 followed by any number except that number.
(?:1[02-9]|[02-9]1|[02-9]{2}) Non capturing group that matches either 1 followed by a number that isn't 1, a number that isn't 1 followed by 1, or two numbers that aren't 1.
[0-9]{4} 4 of any number.
I am trying to write one single formula to identify all the patterns in a column/field. For example: Below are the five different patterns
AG 5643 895468 UWEB
7546 695321 IJJK
PE 45612384
8642567921
16724385
Formula for
First pattern: Contains 4 numbers 6 numbers
'*[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] *' This is not working. Can we specify the length? Something like this [0-9]{4} - 4 digit number?
First pattern should pick second one also.
3rd one: first 2 characters are alphabets 8 or 10 digit numbers
4th one: 10 digit number
5th one 8 digit number
Thanks in advance!
If you're working in MySQL you can use regular expressions with the RLIKE filter operator.
For example, WHERE text RLIKE '[0-9]{8}' finds all the rows with any consecutive sequence of eight digits in them anywhere. (http://sqlfiddle.com/#!9/44996/1/0)
WHERE text RLIKE '^[0-9]{8}%' finds the rows consisting of nothing but an eight-digit sequence. (http://sqlfiddle.com/#!9/44996/2/0)
WHERE text RLIKE '^[0-9A-Z]{2} ' finds the rows starting with two letters or digits and then a space. (http://sqlfiddle.com/#!9/44996/3/0)
You get the idea. Regular expressions have a lot of power to them, generally beyond the scope of a SO answer to explain. Beware, though. This is a common saying: If you solve a problem with e regular expression, now you have two problems. You need to be careful with them.
For an application I'm currently building I need a database to store books. The schema of the books table should contain the following attributes:
id, isbn10, isbn13, title, summary
What data types should I use for ISBN10 and ISBN13? My first thoughts where a biginteger but I've read some unsubstantiated comments that say I should use a varchar.
You'll want a CHAR/VARCHAR (CHAR is probably the best choice, as you know the length - 10 and 13 characters). Numeric types like INTEGER will remove leading zeroes in ISBNs like 0-684-84328-5.
ISBN numbers should be stored as strings, varchar(17) for instance.
You need 17 characters for ISBN13, 13 numbers plus the hyphens, and 13 characters for ISBN10, 10 numbers plus hyphens.
ISBN10
ISBN10 numbers, though called "numbers", may contain the letter X. The last number in an ISBN number is a check digit that spans from 0-10, and 10 is represented as X. Plus, they might begin with a double 0, such as 0062472100, and as a numeric format, it might get the leading 00 removed once stored.
84-7844-453-X is a valid ISBN10 number, in which 84 means Spain, 7844 is the publisher's number, 453 is the book number and X (i.e 10) is the control digit. If we remove the hyphens we mix publisher with book id. Is it really important? Depending on the use you'll give to that number. Bibliographic researchers (I've found myself in that situation) might need it for many reasons that I won't go into here, since it has nothing to do with storing data. I would advise against removing hyphens, but the truth is everyone does it.
ISBN13
ISBN13 faces the same issues regarding meaning, in that, with the hyphens you get 4 blocks of meaningful data, without them, language, publisher and book id would become lost.
Nevertheless, the control digit will only be 0-9, there will never be a letter. But should you feel tempted to only store isbn13 numbers (since ISBN10 can automatically and without fail be upgraded to ISBN13), and use int for that matter, you could run into some issues in the future. All ISBN13 numbers begin with 978 or 979, but in the future some 078 might could be added.
A light explanation about ISBN13
A deeper explanation of ISBN
numbers
Sorry for the difficult question.
I have a large set of sequences to be corrected by either/or adding digits or replacing them (never removing anything) that looks like this:
1,2,,3 => 1,7,4,3
4,,5,6 => 4,4,5,6
4,7,8,9 => 4,7,8,9,1
4,7 => 4,8
4,7,1 => 4,7,2
It starts with a padded original sequence, and a sample correction.
I'd like to be able to work on correcting the sequences automatically by calculating the frequencies of the different n-grams being corrected, the first sample would become
1=>1
2=>7
3=>3
1,2=>1,7
2,3=>7,4,3
1,2,3=>1,7,4,3
I'd collect the frequency of these n-grams corrections, and I'm looking for a way to calculate the best way to correct a new input that may or may not be in the sample data.
This seems to be similar to SMT.
Assign known replacements a score, based on the length of the replacement and the number of occurrences. Naively, I would suggest making this score proportional to the square of the length (longer matches being rarer, in most scenarios I can think of) and the square root of the number of occurrences, such that a 4-item sequence has as much weight as a 2-item sequence that occurs 16 times as often. This would need to be adjusted based on your actual situation.
Given a sequence of length M, there are N substrings of lengths 1 to M, where N=M*(M+1)/2, so if the strings are reasonably short then you could iterate over every substring and look up possible replacements. The number of ways to compose the whole string out of these substrings is also proportional to M^2, I think.
For every possible composition of the original string by substrings, add up the total score of the best (highest score) replacement for each substring.
The composition with the highest total score will be (potentially, given my assumptions about the process) the "best" post-replacement result.