python - remove known exact row in huge csv -


i have ~220 million row, 7 column csv file. need remove row 2636759. file 7.7gb, more fit in memory. i'm familiar r, in python or bash.

i can't read or write file in 1 operation. best way build file incrementally on disk, instead of trying in memory?

i've tried find on have been able find how files small enough read/write in memory, or rows @ beginning of file.

a python solution:

import os open('tmp.csv','w') tmp:      open('file.csv','r') infile:         linenumber, line in enumerate(infile):             if linenumber != 10234:                 tmp.write(line)  # copy original file. can skip if don't # mind (or prefer) having both files lying around            open('tmp.csv','r') tmp:     open('file.csv','w') out:         line in tmp:             out.write(line)  os.remove('tmp.csv') # remove temporary file 

this duplicates data, may not optimal if disk space issue. in place write more complicated without loading whole file ram first


the key python naturally supports handling files iterables. means can lazily evaluated, , never need hold entire thing in memory @ 1 time


i solution, if primary concern isn't raw speed, because can replace line linenumber != value conditional test, example, filtering out lines include particular date

test = lambda line : 'november' in line open('tmp.csv','w') tmp:     ...     if test(line):     ... 

in-place read-writes , memory mapped file objects (which may considerably faster) going require considerably more book keeping


Comments

Popular posts from this blog

html - Styling progress bar with inline style -

java - Oracle Sql developer error: could not install some modules -

How to use autoclose brackets in Jupyter notebook? -