Photo by Joshua Sortino on Unsplash
Large CSV Processing Using Javascript #WeekendBuild
I challenged myself to find out how to load and process large csv data using Javascript. Here's the journey and the learning result.
Previous series I already covered large csv processing using PHP and Go. Now in new #WeekendBuild series I will use Javascript to process large csv file. Same requirements: extract and parse the file, count total of rows, collect city data and how many customers on it, and finally sorted it by most customers.
If you want to jump out directly to the code, check this final result on my Github Repository.
Load & Extract Data
import fs from 'fs';
const reader = fs.createReadStream(path, { encoding: "utf8" });
// handle errors
reader.on('error', (err) => console.error(`Error reading file: ${err}`));
// chunk the data stream and concat it into a single variable
let data = '';
reader.on('data', (chunk) => data += chunk);
The idea was reading a huge file contents and could be done without having huge resource consumption. I've found createReadStream function which able to read a file into a readable stream and I could chunk the stream for parsing the data later. My stupid idea was load the chunked data and concat it into single string variable.
Let see how it goes.
Parsing & Mapping Data
// at the end of the process, process the concatenated data and process it to map & sorting
reader.on('end', () => {
const customers = data.split('\n')
.slice(1) // do not include header row
.filter(row => row !== '') // do not include empty row
.map(row => {
let splitted = [];
let current = '';
let inQuotes = false;
for (let char of row) {
if (char === '"') {
inQuotes = !inQuotes;
} else if (char === ',' && !inQuotes) {
splitted.push(current);
current = '';
} else {
current += char;
}
}
splitted.push(current);
return splitted;
});
});
By the end of the stream now I will able to parsing the string from loaded chunks and determine values in each rows. I know there is csv-parse library that can be used but The challenge is only use standard or internal lib to solve the requirements. The logic are:
Split string by '\n' character
No need to parsing the header so skip it by using slice
No need to include empty row because it will counted as a row but has no data. Skip it using filter
Build customers array using map for parsing result
Assumed each column in a row is separated by comma so it will be easy, but turned out there are string that wrapped in double-quotes has comma in it.
7,EA4d384DfDbBf77,Darren,Peck,"Lester, Woodard and Mitchell",Lake Ana,Pitcairn Islands,(496)452-6181x3291,+1-247-266-0963x4995,tgates@cantrell.com,2021-08-24,https://www.le.com/
That unexpected data will ruin the parser so need additional logic to parser for detecting the string inside double-quotes.
Finally I already have customers array and I need to map it for city and customer count then sorted it by using sort function.
// map the data for city and total customers
let cities = new Map(); // 'city name': total customer
customers.forEach((val) => {
if (cities.has(val[6])) {
cities.set(val[6], cities.get(val[6]) + 1);
} else {
cities.set(val[6], 1);
}
});
// sorted by the biggest customers first
const sorted = Array.from(cities).sort((a, b) => b[1] - a[1]);
console.log('rows count: ' + customers.length);
console.log(JSON.stringify(sorted));
First Result
It did great job when load 100.000 rows but failed at 1 million rows. When processing 1 million rows I got error like this below:
I think the problem was when I concatenated all the loaded strings into a single variable caused that error. Stored a huge string into a memory is a bad idea.
Improvement
Oke I need another approach and was thinking about async-await so I modified it into a function. Basically still using same createReadStream function but now I also tried using readline because it seems more proper way to read and iterate the data.
import fs from 'fs';
import readline from 'readline';
async function processCsvAwait(path) {
// track the time
console.time('processing_csv_time');
const reader = fs.createReadStream(path, { encoding: "utf8" });
const rl = readline.createInterface({
input: reader,
crlfDelay: Infinity
});
// the rest of codes
}
processCsvAwait('../data/customers-1000000.csv');
Then for extract and parsing the data should be like this below.
let count = 0;
let cities = new Map();
for await (const line of rl) {
// process each row, excluding header row
if (count > 1) { // skip header row
// logic parsing still same like first implementation
// ...
}
// mapping cities logic still same like first implementation
// ...
count++;
}
Improvement Result
After using improved method finally I was able to processed 1 mil rows with ~2 seconds time. Yeah that's a closed number with PHP result back then. Looks like still missing something for pushing the performance better but couldn't figure out where yet. Hopefully in the next #WeekendBuild.
That's it!