Count number of lines of code over time in sourcegraph? - sourcegraph

I want to build nice insights graph depicting how the lines count in a file changed over time.

This can be accomplished using the following query:
file:^path/to/file.ext$ \n \n patterntype:regexp
Shoutout to the Code Insights team for their help on this.

Related

Bug in appending CSV file by write.table?

I am running some simulations using R, which at the very end produce a matrix with the outcome. Since I run it under different conditions, I use append in the write.table command which writes it into a CSV file.
For a while, everything seemed to work fine. But then yesterday, something in the output CSV file seemed wrong: the order of the simulations was upside down, and one column looked unrealistic. Additionally, the column title written underneath the troublesome column got a "3" instead of the column name.
Since this was the result of the last simulation, and hence was still saved in the last matrix created, I could check it. The first columns of the output were fine, but then there was lack of correspondence between the last two columns in the file and the real output.
Writing the matrix into a file is the very last command on my program. Here is the command I use:
write.table(gen.dist, "FST2.csv", col.names=TRUE, row.names=F,sep=",", append=T)
gen.dist is a numeric matrix.
Has anyone encountered a similar problem?
Thank you

I need to remove a piece of every line in my json file

I have a json output on my notepad and i know it is not in the correct format. At the end of each line there is a time stamp which is causing the bad format. I want to get rid of it using find and replace since the file is pretty big. The format is as follows :
"eventtimestamp": "05 23 2017 04:01:02"}
The above piece comes in at the end of every line. How can i get rid of it using find a replace or any other way.
All help is appreciated.
Thank you
If you need to alter every line in a consistent way then regex find/replace is a good option. Free tools like atom.io, Notepad++, and plenty of others offer this feature.
Assuming "eventtimestamp" is constant, then a simple regex that says "find everything starting with "eventtimestamp" and up to a '}'" will work.
"eventtimestamp".*(?=})
And "replace" that with an empty string.
ps) here's a demo of the regex in regexr.com--hovering over the parts of the pattern will explain what they do.
If you are not sure that the eventtimestamp field always comes in at the end of a line and/or as the last element of the object, prefer that kind of pattern: "eventtimestamp":\s*"[^"]+",?.
Note the useful surrounded excepted character class pattern "[^"]+" that can be declined with any other delimiter.

creating a list from json csv file using python

I am sorry for asking this question, but i already look through but could not find the answer. I am honestly newbie.I am trying to generate a list of whole word from a json csv file. I already created a list of lines, but then i cannot use split() to generate new list containing separate word (later i need to count word occurrence).
My input file contains twitter information:
twitter data
i tried to write simple code:
myfile=open('fileName','r')
words=[]
for line in myfile:
words.append(line.split())
len(words)=82
I also tried reader=csv.reader(myFile) and reader=csv.DictReader(myFile)
but in all I can get each line, but how to further split the string/line into independent word. Sorry and thank you in advanced.
My data #I change to a different example as maybe last one was bad formatted:
id,flags,expiration,cas,value
493926581610364928,0,0,2635740904247446,"{""contributors"":null,""truncated"":false,""text"":""#xaaronh #blueredandgold If Namco Bandai's One Piece Unlimited World is anything to go by, no local retail release means no eShop either =\\"",""in_reply_to_status_id"":493925918998425600,""id"":493926581610364928,""favorite_count"":0,""source"":""Twitter Web Client"",""retweeted"":false,""coordinates"":null,""entities"":{""symbols"":[],""user_mentions"":[{""id"":139852376,""indices"":[0,8],""id_str"":""139852376"",""screen_name"":""xaaronh"",""name"":""Aaron""},{""id"":74393990,""indices"":[9,24],""id_str"":""74393990"",""screen_name"":""blueredandgold"",""name"":""Leigh""}],""hashtags"":[],""urls"":[]},""in_reply_to_screen_name"":""xaaronh"",""in_reply_to_user_id"":139852376,""retweet_count"":0,""id_str"":""493926581610364928"",""favorited"":false,""user"":{""follow_request_sent"":false,""profile_use_background_image"":true,""default_profile_image"":false,""id"":42302246,""profile_background_image_url_hp"":""hp://pbs.twimg.com/profile_background_images/464279459932020736/v1xnMcrV.jpeg"",""verified"":false,""profile_text_color"":""333333"",""profile_image_url_https"":""hp://pbs.twimg.com/profile_images/490791031487463424/udSldTQ3_normal.png"",""profile_sidebar_fill_color"":""DDEEF6"",""entities"":{""description"":{""urls"":[{""url"":""hp:tttt"",""indices"":[67,89],""expanded_url"":""hp://infernalmonkey.com"",""display_url"":""infernalmonkey.com""}]}},""followers_count"":506,""profile_sidebar_border_color"":""000000"",""id_str"":""42302246"",""profile_background_color"":""1A1B1F"",""listed_count"":22,""is_translation_enabled"":false,""utc_offset"":36000,""statuses_count"":8676,""description"":""I probably tweet about video games and onaholes. Let's be friends! (NSFW)"",""friends_count"":261,""location"":""Sydney, Australia"",""profile_link_color"":""2FC2EF"",""profile_image_url"":""hp://pbs.twimg.com/profile_images/490791031487463424/udSldTQ3_normal.png"",""following"":false,""geo_enabled"":false,""profile_banner_url"":""hp://pbs.twimg.com/profile_banners/42302246/1406105444"",""profile_background_image_url"":""hp://pbs.twimg.com/profile_background_images/464279459932020736/v1xnMcrV.jpeg"",""screen_name"":""infernal_monkey"",""lang"":""en"",""profile_background_tile"":false,""favourites_count"":2018,""name"":""Lance McGill"",""notifications"":false,""url"":null,""created_at"":""Sun May 24 23:20:25 +0000 2009"",""contributors_enabled"":false,""time_zone"":""Sydney"",""protected"":false,""default_profile"":false,""is_translator"":false},""geo"":null,""in_reply_to_user_id_str"":""139852376"",""lang"":""en"",""_id"":""493926581610364928"",""created_at"":""Tue Jul 29 01:10:48 +0000 2014"",""in_reply_to_status_id_str"":""493925918998425600"",""place"":null,""metadata"":{""iso_language_code"":""en"",""result_type"":""recent""}}"
This is not the best solution, just an effort from a noob (me), definitely need further editing for better output. I am using windows OS.
import csv
import json
abc=[]
myList=[]
myDict={}
myFile=open('fileName.csv','r',encoding='utf-8')
myReader=csv.reader(myFile)
header=next(myReader)
for line in myReader:
abc=json.loads(line[4])
myDict=abc
myList.append(myDict['text'])
dct={}
for eachLine in myList:
item=eachLine.split()
for one in item:
if one in dct:
dct[one]+=1
else:
dct[one]=1
finalList=list(dct.items())
finalList.sort()

Search and Replace Text in CSV file using Python

I just started with Python 3.4.2 and trying to find and replace text in csv file.
In Details, Input.csv file contain below line:
0,0,0,13,.\New_Path-1.1.12\Impl\Appli\Library\Module_RM\Code\src\Exception.cpp
0,0,0,98,.\Old_Path-1.1.12\Impl\Appli\Library\Prof_bus\Code\src\Wrapper.cpp
0,0,0,26,.\New_Path-1.1.12\Impl\Support\Custom\Vital\Code\src\Interface.cpp
0,0,0,114,.\Old_Path-1.1.12\Impl\Support\Custom\Cust\Code\src\Config.cpp
I maintained my strings to be searched in other file named list.csv
Module_RM
Prof_bus
Vital
Cust
Now I need to go through each line of Input.csvand replace the last column with the matched string.
So my end result should be like this:
0,0,0,13,Module_RM
0,0,0,98,Prof_bus
0,0,0,26,Vital
0,0,0,114,Cust
I read the input files first line as a list. So text which i need to replace came in line[4]. I am reading each module name in the list.csv file and checking if there is any match of text in line[4]. I am not able to make that if condition true. Please let me know if it is not a proper search.
import csv
import re
with open("D:\\My_Python\\New_Python_Test\\Input.csv") as source, open("D:\\My_Python\\New_Python_Test\\List.csv") as module_names, open("D:\\My_Python\\New_Python_Test\\Final_File.csv","w",newline="") as result:
reader=csv.reader(source)
module=csv.reader(module_names)
writer=csv.writer(result)
#lines=source.readlines()
for line in reader:
for mod in module_names:
if any([mod in s for s in line]):
line.replace(reader[4],mod)
print ("YES")
writer.writerow("OUT")
print (mod)
module_names.seek(0)
lines=reader
Please guide me to complete this task.
Thanks for your support!
At-last i succeeded in solving this problem!
The below code works well!
import csv
with open("D:\\My_Python\\New_Python_Test\\Input.csv") as source, open("D:\\My_Python\\New_Python_Test\\List.csv") as module_names, open("D:\\My_Python\\New_Python_Test\\Final_File.csv","w",newline="") as result:
reader=csv.reader(source)
module=csv.reader(module_names)
writer=csv.writer(result)
flag=False
for row in reader:
i=row[4]
for s in module_names:
k=s.strip()
if i.find(k)!=-1 and flag==False:
row[4]=k
writer.writerow(row)
flag=True
module_names.seek(0)
flag=False
Thanks for people who tried to solve! If you have any better coding practices please do share!
Good Luck!

Tesseract receipt scanning advice needed

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am still coming away unsatisfied. I wanted to pose the problem here and see if anyone else has advice on how to solve this task.
My wife came to me this morning and asked if there was anyway she could easily scan her receipts from Wal-Mart and over time build a history of prices spent in categories and for specific items so that we could do some trending and easily deep dive on where the spending is going. At first I felt like this was a very tall order, but after doing some digging I found a few things that make me feel this is within reach:
Wal-Mart receipts are in general, very well structured and easy to read. They even include the UPC for every item (potential for lookups against a UPC database?) and appear to classify food items with an F or I (not sure what the difference is) and have a tax code column as well that may prove useful should I learn the secrets of what the codes mean.
I further discovered that there is some kind of Wal-Mart item lookup API that I may be able to get access to which would prove useful in the UPC lookup.
They have an app for smart phones that lets you scan a QR code printed on every receipt. That app looks up a "TC" code off the receipt and pulls down the entire itemized receipt from their servers. It shows you an excellent graphical representation of the receipt including thumbnail pictures of all the items and the cost, etc. If this app would simply categorize and summarize the receipt, I would be done! But alas, that's not the purpose of the app ....
The final piece of the puzzle is that you can export a computer generated PNG image of the receipt in case you want to save it and throw away the paper version. This to me is the money shot, as these PNGs are computer created and therefore not subject to the issues surrounding taking a picture or scanning a paper receipt
An example of one of these (slightly edited to white out some areas but otherwise exactly as obtained from the app) is here:
https://postimg.cc/image/s56o0wbzf/
You can see that the important part of the text is perfectly aligned in 5 columns and that is ultimately what this question is about. How to get Tesseract to accurately OCR this into text. I have lots of ideas where to take it from here, but it all starts with the OCR!
The closest I have come myself is this example here:
http://pastebin.com/nuZJBVg8
I used psm6 and a character limiting set to force it to do uppercase + numbers + a few symbols only:
tessedit_char_whitelist 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ#()/*#%-.
At first glance, the OCR seems to almost match. But as you dig deeper you will see that it fails pretty horribly overall. 3s and 8s are almost always wrong. Same with 6s and 5s. Then there are times it just completely skips over characters or just starts to fall apart (like line 31+ in the example). It starts seeing 2s as 1s, or even just missing characters. The SO PIZZA on line 33 should be "2.82" but comes out as "32".
I have tried doing some pre-processing on the image to thicken up the characters and make sure it's pure black and white but none of my efforts got any closer than the raw image from Wal-Mart + the above commands.
Ideally since this is such a well structured PNG which is presumably always the same width I would love if I could define the columns by pixel widths so that Tesseract would treat each column independently. I tried to research this but the UZN files I've seen mentioned don't translate to me as far as pixel widths and they seem like height is a factor which wouldn't work on these since the height is always going to be variable.
In addition, I need to figure out how to train Tesseract to recognize the numbers 100% accurately (the letters aren't really important). I started researching how to train the program but to be honest it got over my head pretty quickly as the scope of training in the documentation is more for having it recognize entire languages not just 10 digits.
The ultimate end game solution would be a pipeline chain of commands that took the original PNG from the app and gave me back a CSV with the 5 columns of data from the important part of the receipt. I don't expect that out of this question, but any assistance guiding me towards it would be greatly appreciated! At this point I just don't feel like being whipped by Tesseract once again and so I am determined to find a way to master her!
I ended up fully flushing this out and am pretty happy with the results so I thought I would post it in case anyone else ever finds it useful.
I did not have to do any image splitting and instead used a regex since the Wal-mart receipts are so predictable.
I am on Windows so I created a powershell script to run the conversion commands and regex find & replace:
# -----------------------------------------------------------------
# Script: ParseReceipt.ps1
# Author: Jim Sanders
# Date: 7/27/2015
# Keywords: tesseract OCR ImageMagick CSV
# Comments:
# Used to convert a Wal-mart receipt image to a CSV file
# -----------------------------------------------------------------
param(
[Parameter(Mandatory=$true)] [string]$image
) # end param
# create output and temporary files based on input name
$base = (Get-ChildItem -Filter $image -File).BaseName
$csvOutfile = $base + ".txt"
$upscaleImage = $base + "_150.png"
$ocrFile = $base + "_ocr"
# upscale by 150% to ensure OCR works consistently
convert $image -resize 150% $upscaleImage
# perform the OCR to a temporary file
tesseract $upscaleImage -psm 6 $ocrFile
# column headers for the CSV
$newline = "Description,UPC,Type,Cost,TaxType`n"
$newline | Out-File $csvOutfile
# read in the OCR file and write back out the CSV (Tesseract automatically adds .txt to the file name)
$lines = Get-Content "$ocrFile.txt"
Foreach ($line in $lines) {
# This wraps the 12 digit UPC code and the price with commas, giving us our 5 columns for CSV
$newline = $line -replace '\s\d{12}\s',',$&,' -replace '.\d+\.\d{2}.',',$&,' -replace ',\s',',' -replace '\s,',','
$newline | Out-File -Append $csvOutfile
}
# clean up temporary files
del $upscaleImage
del "$ocrFile.txt"
The resulting file needs to be opened in Excel and then have the text to columns feature run so that it won't ruin the UPC codes by auto converting them to numbers. This is a well known problem I won't dive into, but there are a multitude of ways to handle and I settled on this slightly more manual way.
I would have been happiest to end up with a simple .csv I could double click but I couldn't find a great way to do that without mangling the UPC codes even more like by wrapping them in this format:
"=""12345"""
That does work but I wanted the UPC code to be just the digits alone as text in Excel in case I am able to later do a lookup against the Wal-mart API.
Anyway, here is how they look after importing and some quick formating:
https://s3.postimg.cc/b6cjsb4bn/Receipt_Excel.png
I still need to do some garbage cleaning on the rows that aren't line items but that all only takes a few seconds so doesn't bother me too much.
Thanks for the nudge in the right direction #RevJohn, I would not have thought to try simply scaling the image but that made all the difference in the world with Tesseract!
Text recognition on receipts is one of the hardest problems for OCR to handle.
The reasons are numerous:
receipts are printed on cheap paper with cheap printers - to make them cheap, not readable!
they have very large amount of dense text (especially Wall-Mart receipts)
existing OCR engines are almost exclusively trained on non-receipt data (books, documents, etc.)
receipt structure, which is something between tabular and freeform, is hard for any layouting engine to handle.
Your best bet is to perform the following:
Analyse the input images. If they are hard to read by eyes, they are hard to read to tesseract as well.
Perform additional image preprocessing. Image scaling (0.5x, 1.5x, 2x) sometimes help a lot. Cleaning existing noise also helps.
Tesseract training. It's not that hard to do :)
OCR result postprocessing to ensure layouting.
Layouting is best performed by analysing the geometry of the results, not by regexes. Regexes have problems if the OCR has errors. Using geometry, for example, you find a good candidate for UPC number, draw a line through the centers of the characters, and then you know exactly which price belongs to that UPC.
Also, some commercial solutions have customisations for receipt scanning, and can even run very fast on mobile devices.
Company I'm working with, MicroBlink, has an OCR module for mobile devices. If you're on iOS, you can easily try it using CocoaPods
pod try PPBlinkOCR