python - remove known exact row in huge csv -

- June 15, 2014

i have ~220 million row, 7 column csv file. need remove row 2636759. file 7.7gb, more fit in memory. i'm familiar r, in python or bash.

i can't read or write file in 1 operation. best way build file incrementally on disk, instead of trying in memory?

i've tried find on have been able find how files small enough read/write in memory, or rows @ beginning of file.

a python solution:

import os open('tmp.csv','w') tmp:      open('file.csv','r') infile:         linenumber, line in enumerate(infile):             if linenumber != 10234:                 tmp.write(line)  # copy original file. can skip if don't # mind (or prefer) having both files lying around            open('tmp.csv','r') tmp:     open('file.csv','w') out:         line in tmp:             out.write(line)  os.remove('tmp.csv') # remove temporary file

this duplicates data, may not optimal if disk space issue. in place write more complicated without loading whole file ram first

the key python naturally supports handling files iterables. means can lazily evaluated, , never need hold entire thing in memory @ 1 time

i solution, if primary concern isn't raw speed, because can replace line linenumber != value conditional test, example, filtering out lines include particular date

test = lambda line : 'november' in line open('tmp.csv','w') tmp:     ...     if test(line):     ...

in-place read-writes , memory mapped file objects (which may considerably faster) going require considerably more book keeping

Search This Blog

Arrya Code

python - remove known exact row in huge csv -

Comments

Post a Comment

Popular posts from this blog

ios - Memory not freeing up after popping viewcontroller using ARC -

Django REST Framework perform_create: You cannot call `.save()` after accessing `serializer.data` -

Why does Go error when trying to marshal this JSON? -