Java 8 Stream advanced

Example of File Splitter Java 8

Often times, in software development we are faced to handle large data, and usually this data is represented as one large file, in this article I will share with you through a specific example how to handle such case.

Let's consider a large file of 253 572 lines, which contains the 4 companies'quotations during a  certain period of time (the values  that are presented within this file are fake and duplicated, so do not use it for any further financial studies) 

So, the content of this file is listed as following: 

Filecontent

The purpose of our program is to split this file into 4 files depending on companies code (which is listed in the first column under the tag "PERMNO"), we want also that the name of each output file will be like:  "CompanyGrossData_" + premno + ".txt"  for example for the company code 10000 as an output file we will have:  CompanyGrossData_10000.txt

The java 8 JDK introduced a quite light feature within Files class which allows to browse the lines of a file whether sequentially or in parallel, this method is:

public static Stream<String> lines(Path path) throws IOException {
    BufferedReader br = Files.newBufferedReader(path);
    try {

But you should be careful while using this method, it should be used in an autocloseable "try catch" statement ,otherwise while processing the lines of the many files/one large file you can get a "too many files opened " or  "an out of memory" Exceptions; in our example we will code as following:

try (Stream<String> lFileStream = Files.lines(file.toPath()).parallel())
{


    System.out.println("["+ZonedDateTime.now().format(DateTimeFormatter.ISO_ZONED_DATE_TIME)
            +"] Start Splitting Process");
    lFileStream.filter(line -> !line.contains("PERMCO")).forEach(line -> {
        String premno = StringUtils.getPremnoFromline(line);

        File outputFile =
                new File(outputDirectory + File.separator + "CompanyGrossData_" + premno + ".txt");
        dumpLineIntoFile(outputFile, line);
        getNextUniqueIndex();
        if(counter.intValue()%(100000)==0)
        {
            System.out.println("["+ZonedDateTime.now().format(DateTimeFormatter.ISO_ZONED_DATE_TIME)
                    +"] Number Of treated lines are: "+ counter
                    +" Spent Time: "+(System.currentTimeMillis() - startTime)/1000+" Seconds");
        }
    });

}
....

 

you can clone the code from my github page: https://github.com/mhimehdi/FileSplitter