python - remove known exact row in huge csv -
i have ~220 million row, 7 column csv file. need remove row 2636759. file 7.7gb, more fit in memory. i'm familiar r, in python or bash.
i can't read or write file in 1 operation. best way build file incrementally on disk, instead of trying in memory?
i've tried find on have been able find how files small enough read/write in memory, or rows @ beginning of file.
a python solution:
import os open('tmp.csv','w') tmp: open('file.csv','r') infile: linenumber, line in enumerate(infile): if linenumber != 10234: tmp.write(line) # copy original file. can skip if don't # mind (or prefer) having both files lying around open('tmp.csv','r') tmp: open('file.csv','w') out: line in tmp: out.write(line) os.remove('tmp.csv') # remove temporary file
this duplicates data, may not optimal if disk space issue. in place write more complicated without loading whole file ram first
the key python naturally supports handling files iterables. means can lazily evaluated, , never need hold entire thing in memory @ 1 time
i solution, if primary concern isn't raw speed, because can replace line linenumber != value
conditional test, example, filtering out lines include particular date
test = lambda line : 'november' in line open('tmp.csv','w') tmp: ... if test(line): ...
in-place read-writes , memory mapped file objects (which may considerably faster) going require considerably more book keeping
Comments
Post a Comment