get text from url list to csv - csv

i need retrieve text from url list.
I have csv (about 150 000 rows) with ID and URL. On this URL is just plain text without HTML code.
I need write this text to csv with ID from input csv.
Its this possible with wget for example?
Input CSV
9788075020536|http://pemic-books.cz/ASPX/Annotation.aspx?kod=0180853
Output CSV
9788075020536|Učebnice je dílem kolektivu autorů katedry ústavního práva Právnické fakulty Univerzity Karlovy v Praze a externích spolupracovníků. V souladu s tradičním pojetím ústavního práva je obecná státověda podávána jako jeho vstupní a neoddělitelná součást. Kniha je reprintem původního vydání z roku 1998, v nakladatelství Leges vychází poprvé. Na učebnici navazuje Ústavní právo a státověda, 2. díl, Ústavní právo České republiky, který byl vydání nakladatelstvím Leges v roce 2011

Suppose you have the following columns
curlcsv file contents:
0001|columnbefore1|https://www.random.org/integers/?num=1&min=1&max=2&col=1&base=10&format=plain&rnd=new|columnafter1
0002|columnbefore2|https://www.random.org/integers/?num=1&min=3&max=4&col=1&base=10&format=plain&rnd=new|columnafter2
0003|columnbefore3|https://www.random.org/integers/?num=1&min=5&max=6&col=1&base=10&format=plain&rnd=new|columnafter3
Here are the "one-liners" you could use:
gawk:
gawk ' {
match($0, /^(([^|]+[|]){2})([^|]+)([|][^|]+)*$/, arr);
req = "curl -s \""arr[3]"\"";
req | getline res;
print arr[1]""res""arr[4];
}
' curlcsv >result
/^(([^|]+[|]){2}) - 2 here means skip 2 columns (in your case skip 1 column)
([^|]+) - get the contents of the url column
([|][^|]+)* - save the rest column values
result file would look like this:
0001|columnbefore1|2|columnafter1
0002|columnbefore2|3|columnafter2
0003|columnbefore3|5|columnafter3
this approach would hit limits on open files (see JaromírHeimlich comment below)
solution to this limitation problem could be:
split -l 100 curlcsv && ls | grep -v curlcsv | xargs -n 1 gawk ' {
match($0, /^(([^|]+[|]){2})([^|]+)([|][^|]+)*$/, arr);
req = "curl -s \""arr[3]"\"";
req | getline res;
print arr[1]""res""arr[4];
}
' >>../result
place curlcsv to an empty folder since split would create a lot of partial lists in that directory.
sed:
cat curlcsv | sed -e 's/^\(\([^|]\+[|]\)\{2\}\)\([^|]\+\)\([|][^|]\+\)*$/echo "\1"$(curl -s "\3")"\4"/' | bash >result
/^(([^|]+[|]){2}) - 2 here means skip 2 columns (in your case skip 1 column)
In this example sed constructs the bash script to get the result.
Since this solution generates bash commands, it doesn't have the limitation problem of the gawk solution.

Related

Data extraction for specific string

I have a long list of JSON data, with repeats of contents similar to followings.
Due to the original JSON file is too long, I will just shared the hyperlinks here. This is a result generated from a database called RegulomeDB.
Direct link to the JSON file
I would like to extract specific data (eQTLs) from "method": "eQTLs" and "value": "xxxx", and put them into 2 columns (tab delimited) exactly like below.
Note: "value":"xxxx" is extracted right after "method": "eQTLs"is detected.
eQTLs firstResult, secondResult, thirdResult, ...
In this example, the desired output is:
eQTLs EIF3S8, EIF3CL
I've tried using a python script but was unsuccessful.
import json
with open('file.json') as f:
f_json = json.load(f)
print 'f_json[0]['"method": "eQTLs"'] + "\t" + f_json[0]["value"]
Thank you for your kind help.
Maybe you'll find the JSON-parser xidel useful. It can open urls and can manipulate strings any way you want:
$ xidel -s "https://regulomedb.org/regulome-search/?regions=chr16:28539847-28539848&genome=GRCh37&format=json" \
-e '"eQTLs "||join($json("#graph")()[method="eQTLs"]/value,", ")'
eQTLs EIF3S8, EIF3CL
Or with the XPath/XQuery 3.1 syntax:
-e '"eQTLs "||join($json?"#graph"?*[method="eQTLs"]?value,", ")'
Try this:
cat file.json | grep -iE '"method":\s*"eQTLs"[^}]*' -o | cut -d ',' -f 1,5 | sed -r 's/"|:|method|value//gi' | sed 's/\s*eqtls,\s*//gi' | tr '\n' ',' | sed 's/,$/\n/g' | sed 's/,/, /g' | xargs echo -e 'eQTLs\x09'

Arithmetic in web scraping in a shell

so, I have the example code here:
#!/bin/bash
clear
curl -s https://www.cnbcindonesia.com/market-data/currencies/IDR=/USD-IDR |
html2text |
sed -n '/USD\/IDR/,$p' |
sed -n '/Last updated/q;p' |
tail -n-1 |
head -c+6 && printf "\n"
exit 0
this should print out some number range 14000~15000
lets start from the very basic one, what I have to do in order to print result + 1 ? so if the printout is 14000 and increment it to 1 become 14001. I suppose the result of the html2text is not calculatable since it should be something like string output not integer.
the more advance thing i want to know is how to calculate the result of 2 curl results?
What I would do, bash + xidel:
$ num=$(xidel -se '//div[#class="mark_val"]/span[1]/text()' 'https://url')
$ num=$((${num//,/}+1)) # num was 14050
$ echo $num
Output
14051
 Explanations
$((...))
is an arithmetic substitution. After doing the arithmetic, the whole thing is replaced by the value of the expression. See http://mywiki.wooledge.org/ArithmeticExpression
Command Substitution: "$(cmd "foo bar")" causes the command 'cmd' to be executed with the argument 'foo bar' and "$(..)" will be replaced by the output. See http://mywiki.wooledge.org/BashFAQ/002 and http://mywiki.wooledge.org/CommandSubstitution
Bonus
You can compute directly in xidel, thanks Reino using xquery syntax :
$ xidel -s <url> e 'replace(//div[#class="mark_val"]/span[1],",","") + 1'
And to do addition arithmetic of 2 values :
$ xidel -s <url> -e '
let $num:=replace(//div[#class="mark_val"]/span[1],",","")
return $num + $num
'

Extract column data from csv file based on row values

I am trying to use awk/sed to extract specific column data based on row values. My actual files have 15 columns and over 1,000 rows (From a .csv file.)
Simple EXAMPLE: Input; a cdv file with a total of 5 columns and 100 rows. Output; data from column 2 through 5 based on specific row values from column 2. (I have a specific list of the row values I want the operator to filter out. The values are numbers.)
File looks like this:
"Date","IdNo","Color","Height","Education"
"06/02/16","7438","Red","54","4"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
Recently Tried in AWK:
#!/usr/bin/awk -f
#I need to extract a full line when column 2 has a specific 5 digit value
awk '\
BEGIN { awk -F "," \
{
if ( $2 == "19650" ) { \
{print $1 "," $6} \
}
exit }
chmod u+x PPMDfUN.AWK
The operator response:
/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.998.AWK.command ; exit;
/usr/bin/awk: syntax error at source line 3 source file /private/var/folders/1_/drk_nwld48bb0vfvdm_d9n0h0000gq/T/PPMDfUN- 489939602.997.AWK
context is
awk >>> ' <<<
/usr/bin/awk: bailing out at source line 17
logout
Output Example: I want full row lines based if column 2 equals 7439 & 7500.
“Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"
here you go...
$ awk -F, -v q='"' '$2==q"7439"q' file
"06/02/16","7439","Yellow","57","3"
There is not much to explain, other than convenience variable q defined for double quotes helps to eliminate escaping.
awk -F, 'NR<2;$2~/7439|7500/' file
"Date","IdNo","Color","Height","Education"
"06/02/16","7439","Yellow","57","3"
"06/03/16","7500","Red","55","3"

I just want the last 3 characters of a column returned to the original file

first 2lines of my data:
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","123427","456060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
I only want the last 3 characters of column 2 and column 3, I dont want the column header affected.
happy for a solution that can do column2 first and then do column 3
I am fiddling with sed and awk at the minute but have no joy yet.
this is what I want:
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
edit1 this gives me the last 3 digits(+ "), just need to write this back to the orig file?
$ awk -F"," 'NR>1{ print $2}' head_test_real.csv | sed 's/.*\(....\)/\1/'
427"
592"
007"
592"
409"
742"
387"
731"
556"
edit2 this works but i lose the double quotes "123427" goes to 427, i ould like to keep the double quotes.
* NR>1 works on the rows after the 1st row.
$ awk -F, 'NR>1{$2=substr($2,length($2)-3,3)}1' OFS=, head_test_real.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06",427,"456060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
edit3 #Mark tks fro correct answer, and here just for my ref on the quotes.
$ ####csv.QUOTE_ALL
$ cat out.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
$ ####csv.QUOTE_MINIMAL
$ cat out.csv
Rec_Open_Date,MSISDN,IMEI,Data_Volume_Bytes,Device_Manufacturer,Device_Model,Product_Description
2015-10-06,427,060,137765,Samsung Korea,Samsung SM-G900I,$39 Plan
$ ###csv.QUOTE_NONNUMERIC
$ cat out.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
$ ###csv.QUOTE_NONE
$ cat out.csv
Rec_Open_Date,MSISDN,IMEI,Data_Volume_Bytes,Device_Manufacturer,Device_Model,Product_Description
2015-10-06,427,060,137765,Samsung Korea,Samsung SM-G900I,$39 Plan
While awk seems like a natural fit for comma-separated data, it doesn't deal well with the quoted-fields version. I would recommend using a dedicated CSV-processing library like the one that ships with Python (both 2 and 3):
import csv
with open('in.csv','r') as infile:
reader = csv.reader(infile)
with open('out.csv','w') as outfile:
writer = csv.writer(outfile,delimiter=',',quotechar='"',quoting=csv.QUOTE_ALL)
writer.writerow(next(reader))
for row in reader:
row[1] = row[1][-3:]
row[2] = row[2][-3:]
writer.writerow(row)
Put the above code into a file named e.g. fixcsv.py and make the filenames match what you have and want, then just run it with python fixcsv.py (or python3 fixcsv.py).
I set it to quote everything in the output (QUOTE_ALL); if you don't want it to do that, you can set it to QUOTE_MINIMAL, QUOTE_NONNUMERIC or QUOTE_NONE.
The row assignments replace the second and third fields (row[1] and row[2], since the first field is row[0]) with their last three characters ([-3:]). You could also do it arithmetically with e.g. row[1] = int(row[1]) % 1000.
$ awk 'BEGIN{FS=OFS="\",\""} NR>1{for (i=2;i<=3;i++) $i=substr($i,length($i)-2)} 1' file
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"
As with any command, to write back to the original file is just:
command file > tmp && mv tmp file
Perl to the rescue!
perl -pe 's/",".*?(...",")/","$1/ if $. > 1' < input > output
-p reads the input line by line and prints the result
s/regex/replacement/ is a substitution
.*? matches anything (like .*), but the question mark makes it "frugal", i.e. it matches the shortest string possible
(...",") creates a capture group starting three characters before ",", it can be referenced as $1.
$. is the line number, no replacement happens on line 1.
Make sure the first two columns are always quoted and the second column is never shorter than 3 characters.
To modify the third column, you can modify the regex to
perl -pe 's/^("(?:.*?","){2}).*?(...",")/$1$2/ if $. > 1'
# ~
Modify the indicated number to handle any column you like.

package to query tab separated files in bash

I often have to conduct very simple queries on tab separated files in bash. For example summing/counting/max/min all the values in the n-th column. I usually do this in awk via command-line, but I've grown tired of re-writing the same one line scripts over and over and I'm wondering if there is a known package or solution for this.
For example, consider the text file (test.txt):
apples joe 4
oranges bill 3
apples sally 2
I can query this as:
awk '{ val += $3 } END { print "sum: "val }' test.txt
Also, I may want a where clause:
awk '{ if ($1 == "apples") { val += $3 } END { print "sum: "val }' test.txt
Or a group by:
awk '{ val[$1] += $3 } END { for(k in val) { print k": "val[k] } }' test.txt
What I would rather do is:
query 'sum($3)' test.txt
query 'sum($3) where $1 = "apples"' test.txt
query 'sum($3) group by $1' test.txt
#Wintermute posted a link to a great tool for this in the comments below. Unfortunately it does have one drawback:
$ time gawk '{ a += $6 } END { print a }' my1GBfile.tsv
28371787287
real 0m2.276s
user 0m1.909s
sys 0m0.313s
$ time q -t 'select sum(c6) from my1GBfile.tsv'
28371787287
real 3m32.361s
user 3m27.078s
sys 0m1.983s
it also loads the entire file into memory, obviously this will be necessary in some cases, but doesn't work for me as I often work with large files.
Wintermute's answer: Tools like q that can run SQL queries directly on CSVs.
Ed Morton's answer: Refer https://stackoverflow.com/a/15765479/1745001