Skip to main content

Command Palette

Search for a command to run...

Large CSV Processing Using Go #WeekendBuild

Is it possible to extract the data of 1 million rows of CSV using Go? This #weekendbuild series will be a journal how to do it

Updated
5 min read
Large CSV Processing Using Go #WeekendBuild
D

Hi, I am a cat lover and software engineer from Malang, mostly doing PHP and stuff. Software Engineer Live in Malang, Indonesia Visit my resume and portfolios at didiktrisusanto.dev See you, folks!

So lately I was thinking how to spend my weekend productively. I mean not a full weekend, but instead spend couple hours to do experiment, build something which I could gain knowledge from it. There's no boundaries and could be anything as long as I put interest or curiosity on it. But since I am working in tech, so it would around software development.

The Idea

This weekend I was thinking about playing around with several programming languages. One of the good way to learn it is how to build something or to do something. So I challenged myself to do large CSV processing using different programming language.

The idea is:

Given a large dummy CSV (1 million rows) contains sample of customer data and do processing with goals below:

  • Extract the data from the CSV

  • Calculate how many data / rows

  • Grouping how many customers for each city

  • Sort cities by customers count from highest to lowest

  • Calculate processing time

Pretty basic stuff and could be achievable using several languages such are Go, PHP, Javascript, and Python and try to not use 3rd party dependencies to achieve those goals.

Sample CSV of the customers can be downloaded here https://github.com/datablist/sample-csv-files

Reading CSV with Go

I wrote a service in Go for several months until today and feels this language is quite simple, straightforward, and have good performance. So I believe processing 1 mil CSV rows should not be an issue.

Load & Extract Data

Apparently Go has standard lib for CSV processing. We don't need 3rd party dependency to solve our problem anymore which is nice. So the solution is pretty straightforward:

Read file and extract rows from csv in go

  1. Open the file from the given path

  2. Load opened file to csv reader

  3. Holds all extracted csv records / rows value into records slice for later processing

FieldsPerRecord is set to -1 because I want to skip fields checking on the row since fields or column count could be different in each format

At this state we already able to load and extract all the data from csv and ready to next processing state. We also will able to know how many rows in CSV by using function len(records).

Grouping Total Customer to Each City

Now we are able to iterate the records and create the map contains city name and total customer looks like this

["Jakarta": 10, "Bandung": 200, ...]

City data in csv row is located in 7th index and the code will look like this

convert slices into map in go

If the city map is not exists, create new map and set the customer total as 1. Otherwise just increment the total number of given city.

Now we have map m contains collection of city and how many customer inside it. At this point we already solved problem of grouping how many customer for each city.

Sorting Highest Total Customer

I tried to find is there any function in standard lib to sort the map but unfortunately I couldn't find it. Sorting only possible for slice because we can rearrange the data order based on the index position. So yeah, let's make a slice from our current map.

change map to slice in Go

Now how we sorted it by the CustomerCount from highest to lowest? The most common algorithm for this is using bubble short. Although it's not the fastest but it could do the job.

Bubble Sort is the simplest sorting algorithm that works by repeatedly swapping the adjacent elements if they are in the wrong order. This algorithm is not suitable for large data sets as its average and worst-case time complexity is quite high.

Reference: https://www.geeksforgeeks.org/bubble-sort-algorithm/

Using our slice, it will loop over the data and check the next value of the index and swap it if current data is less than next index. You can check the detail algorithm on the reference website.

Now our sorting process could be like this

bubble sort over slice in go

By the end of the loop, the final slice will give us a sorted data.

Calculate Processing Time

Calculate processing time is quite simple, we get timestamp before & after executing the main process of the program and calculate the difference. In go the approach should be simple enough:

func main() {
    start := time.Now() // start timing for processing time
    // the main process
    // ...
    duration := time.Since(start)
    fmt.Println("processing time (ms): ", duration.Milliseconds())
}

The Result

Run the program with command

go run main.go

The printed out would be rows count, sorted data, and processing time. Something like this below:

output program from go

As expected of Go performance, it handled 1 million rows csv under 1 second!

All the completed codes already publish on my Github Repository:

https://github.com/didikz/csv-processing/tree/main/golang

Lesson Learned

  • CSV processing in Go is already available in standard lib, no need to use 3rd party lib

  • Processing the data is quite easy. The challenge was to find out how to sort the data because need to do manually

What's Come in Mind?

I was thinking my current solution might can be optimized further because I looped all the records extracted csv to map and if we checked at ReadAll() source, it also have loop to create the slice based on the given file reader. By this, 1 Mil rows could produce 2 x loops for 1 Mil data which is not nice.

I thought if I could read data directly from File reader it only needs 1 loop because I could create map directly from it. Except the records slice will be used elsewhere but not in this case.

I still have no time to figure it out yet, but I also thought some downside if I will do it manually:

  • Probably need handle more errors of the parsing process

  • I am not sure how significant it will reduce the processing time to consider the workaround will be worth it or not

Yeah, let's see if I can do experiment it on next #weekendbuild.

More from this blog

D

Didik Tri Susanto

32 posts

Hi, welcome to my blog. I am a cat lover and software engineer from Malang, mostly doing PHP, Go and stuff. I will cover topics like software engineering, career, Laravel, web development, and Golang

Large CSV Processing Using Go #WeekendBuild