- 
            
      
        
      
    Star
      
          
          (137)
      
  
You must be signed in to star a gist 
- 
              
      
        
      
    Fork
      
          
          (64)
      
  
You must be signed in to fork a gist 
- 
      
- 
        Save jrivero/1085501 to your computer and use it in GitHub Desktop. 
| import os | |
| def split(filehandler, delimiter=',', row_limit=10000, | |
| output_name_template='output_%s.csv', output_path='.', keep_headers=True): | |
| """ | |
| Splits a CSV file into multiple pieces. | |
| A quick bastardization of the Python CSV library. | |
| Arguments: | |
| `row_limit`: The number of rows you want in each output file. 10,000 by default. | |
| `output_name_template`: A %s-style template for the numbered output files. | |
| `output_path`: Where to stick the output files. | |
| `keep_headers`: Whether or not to print the headers in each output file. | |
| Example usage: | |
| >> from toolbox import csv_splitter; | |
| >> csv_splitter.split(open('/home/ben/input.csv', 'r')); | |
| """ | |
| import csv | |
| reader = csv.reader(filehandler, delimiter=delimiter) | |
| current_piece = 1 | |
| current_out_path = os.path.join( | |
| output_path, | |
| output_name_template % current_piece | |
| ) | |
| current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter) | |
| current_limit = row_limit | |
| if keep_headers: | |
| headers = reader.next() | |
| current_out_writer.writerow(headers) | |
| for i, row in enumerate(reader): | |
| if i + 1 > current_limit: | |
| current_piece += 1 | |
| current_limit = row_limit * current_piece | |
| current_out_path = os.path.join( | |
| output_path, | |
| output_name_template % current_piece | |
| ) | |
| current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter) | |
| if keep_headers: | |
| current_out_writer.writerow(headers) | |
| current_out_writer.writerow(row) | 
Hi,
I'm trying to run this on windows and I get this error
splitter.split(open('C:\Users\Ochieng'\Downloads\Compressed\filename.csv','r'));
^
SyntaxError: unexpected character after line continuation character
that error indicator actually points at the semicolon at the end; how do i go about this?
I figured it out!
I was using single quote instead of double. The backlash after the single quotes are interpreted as line continuation.
works perfect for me. Very fast.
Optional way but seems slower:
import pandas as pd
import numpy as np
groups = df.groupby(np.arange(len(df.index))/10000)
for (frameno, frame) in groups:
    frame.to_csv("%s.csv" % frameno,header=False,index=False,encoding="ISO-8859-1")
if you want to split large csv file by column, you can check this : https://github.com/harryhan1989/csvsplitter
https://gist.github.com/jrivero/1085501#gistcomment-1594889
"For Windows, to eliminate the blank rows..."
current_out_writer = csv.writer(open(current_out_path, 'wb'), delimiter=delimiter)
works for me (Win7/Python 2.7)
this sollution works for me (Windows 10/Python 2.7)
    class MyDialect(csv.Dialect):
        delimiter = ';'
        quotechar = '"'
        doublequote = True
        skipinitialspace = False
        lineterminator = '\n'                # the solution
        quoting = csv.QUOTE_MINIMAL
    # ......
    current_out_writer = csv.writer(open(current_out_path, 'w'), MyDialect()) for a modified version to use on commande line
Hi, I'm on python 2.7
In my case says:
Traceback (most recent call last):
File "csv_splitter.py", line 43, in 
current_out_writer.writerow(row)
NameError: name 'current_out_writer' is not defined
It is very weird, has anybody had this issue?
It is the last line
Thanks!
Just as a side note, but blank lines may appear between each row when working on python 2.7.
To get around this, open the file in binary mode:
current_out_writer = csv.writer(open(current_out_path, 'wb'), delimiter=delimiter)
Very nice work.
I changed the headers=reader.next() to headers=next(reader) for python 3 to alleviate the AttributeError: '_csv.reader' object has no attribute 'next' error.
Works like a charm!
Thank you!
Hi, I am new to python can you please help me understand where to put the file path and how to run this query? I have copy pasted the code but when im running python script_name.py in the cmd nothing is getting generated.
On Unix-like systems, and if you don't need to include the headers in each file, you can use the built-in split command.
got a armap' codec can't decode byte 0x81 in position 6696: character maps to . Whats wrong
On Unix-like systems, and if you don't need to include the headers in each file, you can use the built-in
splitcommand.
split won't work if your data has quoted cells with newlines in them.
Thanks for the starting point! Was able to make a single file CSV split by column program after looking at your code. https://github.com/APAHRoot/HelpfulHopeful
Hopefully it works for other people too!
Thanks for posting this. Very useful. Why do I get a "NameError: name 'current_out_writer' is not defined"
Thank you for this. Works like a charm. For thoose of you who have trouble using this ;
- You need to set your row limit for your own data size
- If you are using python 3.x , you need to change the line "headers=headers.next() " into "headers=next(headers)"
- You need to change the output_path to a folder where the splitted data files to be stored.
I wonder if it is easy to make each file have unique dates. For instance, I want 1st May of 2019 lines to be in one file only.
I wonder if it is easy to make each file have unique dates. For instance, I want 1st May of 2019 lines to be in one file only.
Should be able to do that with this: https://github.com/APAHRoot/HelpfulHopeful/blob/master/SortBySplitCSV.py
I made it to split data by county, but it should work with any value you want to use as an identifier
I wonder if it is easy to make each file have unique dates. For instance, I want 1st May of 2019 lines to be in one file only.
Should be able to do that with this: https://github.com/APAHRoot/HelpfulHopeful/blob/master/SortBySplitCSV.py
I made it to split data by county, but it should work with any value you want to use as an identifier
I am testing it now. I guess the column name has to be added instead of the "Unnamed". The code doesn't have a row or mb limit. The split by date will help. Hopefully, I can edit that and add a date range (per month etc).
HI <
I am new to Python and it is not working. Can some one please help I am using Python 3.5.
From what is see it is not going into the loop, "for i, row in enumerate(reader):"
The output of the print(next(reader)) statement is  ['d'] , looks like it is taking the 1st character of the file path.
One more thing i csaw is that the out put file that is created called "cat output_1.csv" contains the complete path of the CSV file that needs to be broken into multiple CSV files.
cat output_1.csv
a
t
a
/
t
r
a
n
s
f
o
r
m
a
t
i
o
n
s
/
t
e
m
p
_
s
m
a
l
l
.
c
s
v
The csv file has 10K rows.
The code is as follows.
The call to the function is as follows.
csv_splitter.split("data/transformations/temp_small.csv", ',',row_limit=10000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True)
The function is
import os
def split(filehandler, delimiter=',', row_limit=10000,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
"""
Splits a CSV file into multiple pieces.
A quick bastardization of the Python CSV library.
Arguments:
    `row_limit`: The number of rows you want in each output file. 10,000 by default.
    `output_name_template`: A %s-style template for the numbered output files.
    `output_path`: Where to stick the output files.
    `keep_headers`: Whether or not to print the headers in each output file.
Example usage:
    >> from toolbox import csv_splitter;
    >> csv_splitter.split(open('/home/ben/input.csv', 'r'));
"""
import csv
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
    output_path,
    output_name_template % current_piece
)
print("The current output path" + current_out_path)
print("The file name is " + filehandler)
current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
print("current_out_writer is " )
print(current_out_writer)
print(next(reader))
current_limit = row_limit
if keep_headers:
    #headers = next(reader)
    headers=next(headers)
    current_out_writer.writerow(headers)
for i, row in enumerate(reader):
    if i + 1 > current_limit:
        print("In the function csv_splitter")
        current_piece += 1
        current_limit = row_limit * current_piece
        current_out_path = os.path.join(
            output_path,
            output_name_template % current_piece
        )
        current_out_writer = csv.writer(open(current_out_path, 'w'), delimiter=delimiter)
        if keep_headers:
            current_out_writer.writerow(headers)
    current_out_writer.writerow(row)
after fixing  "headers=headers.next() "  I had this error
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 5376: character maps to <undefined>
anyone can help me?
many thanks
encoding='utf-8', add it as parameter
open(current_out_path, 'w',encoding='utf-8') for example. @hoai97nam
Adding a working solution, python3 and with encoding.
`import os
def split(filehandler, delimiter=',', row_limit=8500,
output_name_template='output_%s.csv', output_path='.', keep_headers=True):
"""
Splits a CSV file into multiple pieces.
A quick bastardization of the Python CSV library.
Arguments:
    `row_limit`: The number of rows you want in each output file. 10,000 by default.
    `output_name_template`: A %s-style template for the numbered output files.
    `output_path`: Where to stick the output files.
    `keep_headers`: Whether or not to print the headers in each output file.
Example usage:
    >> from toolbox import csv_splitter;
    >> csv_splitter.split(open('/home/ben/input.csv', 'r'));
"""
import csv
reader = csv.reader(filehandler, delimiter=delimiter)
current_piece = 1
current_out_path = os.path.join(
    output_path,
    output_name_template % current_piece
)
current_out_writer = csv.writer(open(current_out_path, 'w',encoding='utf-8'), delimiter=delimiter)
current_limit = row_limit
if keep_headers:
    headers = next(reader)
    current_out_writer.writerow(headers)
for i, row in enumerate(reader):
    if i + 1 > current_limit:
        current_piece += 1
        current_limit = row_limit * current_piece
        current_out_path = os.path.join(
            output_path,
            output_name_template % current_piece
        )
        current_out_writer = csv.writer(open(current_out_path, 'w',encoding='utf-8'), delimiter=delimiter)
        if keep_headers:
            current_out_writer.writerow(headers)
    current_out_writer.writerow(row)
split(open('test.csv','r',encoding='utf-8'))`
Thanks for the starting point! Was able to make a single file CSV split by column program after looking at your code. https://github.com/APAHRoot/HelpfulHopeful
Hopefully it works for other people too!
This is amazing and is exactly what I was looking for. Thank you so much.
Thanks for the starting point! Was able to make a single file CSV split by column program after looking at your code. https://github.com/APAHRoot/HelpfulHopeful
Hopefully it works for other people too!This is amazing and is exactly what I was looking for. Thank you so much.
@alternateaccounts You can test this tool with your 4.5gb file?
https://github.com/BurntSushi/xsv
Thank you by the mention in your project
I am sorry about this, I am new to the field, but where I should specify the file I want to split into smaller files? at which part of the code?
add newline=‘’  in open() to avoid blank row
For Windows, to eliminate the blank rows, add a newline option: