Java work with big files

Содержание

Reading a Large File Efficiently in Java
How to Read Large File in Java
1 Using Java API
1.1 Using Java BufferReader
1.2 Using Java 8 Stream API
1.3 Using Java Scanner
2 Streaming File Using Apache Commons IO
Java – Reading a Large File Efficiently
2. Loading Whole File Into Memory
3. Loading a Binary File in Chunks
4. Reading a Text File Line By Line
5. Using a Scanner
6. With Java 8 Streams
Review
Построчное чтение больших файлов на Java
Чтение файла целиком
Чтение файла через стрим
Буферизация
Выводы

Reading a Large File Efficiently in Java

Learn to read all lines from a large file (size in GB) in Java and avoid any performance pitfalls such as very high usage of memory or even OutOfMemoryError if the File is large enough.

1. Approach to Read Large Files

Similar to DOM parser and SAX parser for XML files, we can read a file with two approaches:

Reading the complete file in memory before processing it
Reading the file content line by line and processing each line independently

The first approach looks cleaner and is suitable for small files where memory requirements are very low (in Kilobytes or few Megabytes). If used to read large files, it will quickly result in OutOfMemoryError for the files in size of Gigabytes.

The second approach is suitable for reading very large files in Gigabytes when it is not feasible to read the whole file into memory. In this approach, we use the line streaming i.e. read the lines from the file in form of a stream or iterator.

This tutorial is focused on the solutions using the second approach.

2. Using New IO’s Files.lines()

Using the Files.lines() method, the contents of the file are read and processed lazily so that only a small portion of the file is stored in memory at any given time.

The good thing about this approach is that we can directly write the Consumer actions and use newer language features such as lambda expressions with Stream.

Path filePath = Paths.get("C:/temp/file.txt") //try-with-resources try (Stream lines = Files.lines( filePath )) < lines.forEach(System.out::println); >catch (IOException e)

3. Common IO’s FileUtils.lineIterator()

The lineIterator() uses a Reader to iterator over the lines of a specified file. Use the try-with-resources to auto-close the iterator after reading the file.

Do not forget to import the latest version of commons-io module into project dependencies.

File file = new File("C:/temp/file.txt"); try(LineIterator it = FileUtils.lineIterator(file, "UTF-8")) < while (it.hasNext()) < String line = it.nextLine(); // do something with line System.out.println(line); >> catch (IOException e)

4. Reading Large Binary Files

Note that when we are reading the files in Stream or line by line, we are referring to the character-based or text files. For reading the binary files, UTF-8 charset may corrupt the data and so the above solution does not apply to binary data files.

To read large raw data files, such as movies or large images, we can use Java NIO’s ByteBuffer and FileChannel classes. Remember that you will need to try different buffer sizes and pick that works best for you.

try (RandomAccessFile aFile = new RandomAccessFile("test.txt", "r"); FileChannel inChannel = aFile.getChannel();) < //Buffer size is 1024 ByteBuffer buffer = ByteBuffer.allocate(1024); while (inChannel.read(buffer) >0) < buffer.flip(); for (int i = 0; i < buffer.limit(); i++) < System.out.print((char) buffer.get()); >buffer.clear(); // do something with the data and clear/compact it. > > catch (IOException e)

This Java tutorial discussed a few efficient solutions to read very large files. The correct solution depends on the type of file and other deciding factors specific to the problem.

Читайте также: Python open file numpy

I will suggest benchmarking all solutions in your environment and choosing based on their performance.

Источник

How to Read Large File in Java

In our last article, we cover How to read file in Java.This post will cover how to read large file in Java efficiently.

Reading the large file in Java efficiently is always a challenge, with new enhancements coming to Java IO package, it is becoming more and more efficient.

We have used sample file with size 1GB for all these. Reading such a large file in memory is not a good option, we will covering various methods outlining How to read large file in Java line by line.

1 Using Java API

We will cover various options how to read a file in Java efficiently using plain Java API.

1.1 Using Java BufferReader

 public class ReadLargeFileByBufferReader < public static void main(String[] args) throws IOException < String fileName = "/tutorials/fileread/file.txt"; //this path is on my local try (BufferedReader fileBufferReader = new BufferedReader(new FileReader(fileName))) < String fileLineContent; while ((fileLineContent = fileBufferReader.readLine()) != null) < // process the line. >> > >

 Max Memory Used : 258MB Time Take : 100 Seconds

1.2 Using Java 8 Stream API

public class ReadLargeFIleUsingStream < public static void main(String[] args) throws IOException < String fileName = "/tutorials/fileread/file.txt"; //this path is on my local // lines(Path path, Charset cs) try (Stream inputStream = Files.lines(Paths.get(fileName), StandardCharsets.UTF8)) < inputStream.forEach(System.out::println); >> >

Max Memory Used : 390MB Time Take : 60 Seconds

1.3 Using Java Scanner

Java Scanner API also provides a way to read large file line by line.

 public class ReadLargeFileByScanner < public static void main(String[] args) throws FileNotFoundException < String fileName = "/Users/umesh/personal/tutorials/fileread/file.txt"; //this path is on my local InputStream inputStream = new FileInputStream(fileName); try(Scanner fileScanner = new Scanner(inputStream, StandardCharsets.UTF_8.name()))< while (fileScanner.hasNextLine())< System.out.println(fileScanner.nextLine()); >> > >

 Max Memory Used : 460MB Time Take : 60 Seconds

2 Streaming File Using Apache Commons IO

This can also be achieved by using Apache Commons IO FileUtils.lineIterator () Method

 public class ReadLargeFileUsingApacheCommonIO < public static void main(String[] args) throws IOException < String fileName = "/Users/umesh/personal/tutorials/fileread/file.txt"; //this path is on my local LineIterator fileContents= FileUtils.lineIterator(new File(fileName), StandardCharsets.UTF_8.name()); while(fileContents.hasNext())< System.out.println(fileContents.nextLine()); >> >

 Max Memory Used : 400MB Time Take : 60 Seconds

As we saw how to read a large file in Java efficiently. Few things which you need to pay close attention

Reading the large file in one go will not be a good option (You will get OutOfMemoryError ).
We Adapted technique to read large file line by line to keep memory footprint low.

I used VisualVM to monitoring Memory, CPU and Threadpool information while running these programmes.

based on our test, BufferReader has the lowest memory footprint, though the overall execution was slow.

All the code of this article is available Over on Github. This is a Maven-based project.

Источник

Java – Reading a Large File Efficiently

What’s the most efficient and easiest way to read a large file in java? Well, one way is to read the whole file at once into memory. Let us examine some issues that arise when doing so.

2. Loading Whole File Into Memory

One way to load the whole file into a String is to use NIO. This can be accomplished in a single line as follows:

String str = new String(Files.readAllBytes(Paths.get(pathname)), StandardCharsets.UTF_8);

There are several other ways to read whole file into memory. Check this article for more details, including benchmarks.

The problem with the above approach is that, with a sufficiently large file, you end up with an OutOfMemoryError.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

On my machine with 4G of RAM and 12G of swap, I cannot load a 300MB file successfully using this method. So we need to look at alternative methods of processing a whole file.

3. Loading a Binary File in Chunks

The following code demonstrates how to load and process the bytes in a file (can be a binary file) a chunk at a time.

try(BufferedInputStream in = new BufferedInputStream(new FileInputStream(pathname))) < byte[] bbuf = new byte[4096]; int len; while ((len = in.read(bbuf)) != -1) < // process data here: bbuf[0] thru bbuf[len - 1] >>

4. Reading a Text File Line By Line

Processing a text file is easier when you need to do it line by line. There are several methods for doing so. Here is one method using a BufferedReader:

try(BufferedReader in = new BufferedReader(new FileReader(pathname))) < String line; while ((line = in.readLine()) != null) < // process line here. >>

5. Using a Scanner

The Scanner class provides another convenient way to read a file line by line, using the hasNextLine() and nextLine() methods.

try(Scanner scanner = new Scanner(new File(pathname))) < while ( scanner.hasNextLine() ) < String line = scanner.nextLine(); // process line here. >>

If you need to read line-by-line, I recommend the method above using BufferedReader since the Scanner method is slow as molasses.

6. With Java 8 Streams

Java 8 provides the streams facility which are useful in wide variety of cases. Here we can use the Files.lines() method to create a stream of lines from a file, apply any filters and do any processing we want. In the following example, we are selecting lines that contain the string abc and collect the results into a List.

List alist = Files.lines(Paths.get(pathname)) .filter(line -> line.contains("abc")) .collect(Collectors.toList());

Review

We discussed some methods for loading and processing files efficiently. First off, you could just load the whole file into memory if the file is small enough. For large files, you need to process chunks. A binary file can be processed in chunks of say, 4kB. A text file can be processed line by line.

Источник

Построчное чтение больших файлов на Java

В предыдущей статье Как сохранить текстовый файл на Java мы научились записывать текстовый файл больших размеров. А в этот раз давайте научимся его читать построчно и каждую строку как-то обрабатывать. Например, будем искать наибольшую длину строки. Поскольку размер файла больше 100 МБ, его чтение происходит с небольшой задержкой.

Читайте также: Php storm что это

Как обычно, рассмотрим несколько вариантов решения этой задачи, постепенно улучшая производительность.

Чтение файла целиком

В пакете nio есть удобный статичный метод Files.readAllLines(). Он возвращает список строк, т.е. считывает содержимое файла полностью в память.

public int getMaxLineLengthAllLines(File file) <
try <
return Files.readAllLines(Path.of(file.toURI()))
.stream()
.mapToInt(String::length)
.max()
.orElse( 0 );
> catch (Exception e) <
throw new RuntimeException(e);
>
>

Из этого списка мы получаем стрим, затем преобразуем его в список длин строк с помощью метода mapToInt() и находим среди значений наибольшее. Для кейсов, когда список пустой, максимум не может быть найден, поэтому возвращаем по умолчанию 0. Для удобства, в блоке catch оборачиваем исключение в RuntimeException, чтобы не прописывать исключения в сигнатуре нашего метода.

Как вы уже догадываетесь, читать большой файл в память не самая лучшая затея. Особенно если вы обрабатываете каждую строку отдельно. Поэтому чтение файла размером в 111 МБ на моём ноуте занимает порядка 1100 миллисекунд, т.е. чуть больше секунды.

Чтение файла через стрим

Рассмотрим более эффективную реализацию, где мы обойдёмся без создания промежуточной коллекции со всеми строками, а сразу будем получать стрим и по очереди обрабатывать строки из него. Воспользуемся методом Files.lines(). Этот вариант более предпочтителен с точки зрения экономии памяти.

public int getMaxLineLengthStream(File file) <
try ( var lines = Files.lines(Paths.get(file.toURI()), StandardCharsets. UTF_8 )) <
return lines.mapToInt(String::length)
.max()
.orElse( 0 );
> catch (Exception e) <
throw new RuntimeException(e);
>
>

Обратите внимание, что стрим мы инициализируем в блоке try-with-resources, чтобы Java автоматически освобождала ресурсы при выходе из метода. В остальном реализация этого метода похожа на предыдущую: получаем список длин строк и ищем среди них максимальную.

Данная реализация с тем же файлом работает почти в 2 раза быстрее, примерно за 600 миллисекунд.

Буферизация

Если отказаться от пакета nio и воспользоваться классической связкой FileReader, обёрнутый в BufferedReader, то этот вариант даст ещё больший прирост производительности.

public int getMaxLineLengthBuffered(File file) <
try (
var fr = new FileReader(file);
var br = new BufferedReader(fr)
) <
var line = br.readLine();
var maxLineLength = 0 ;

while (line != null ) if (maxLineLength < line.length()) maxLineLength = line.length();
>
line = br.readLine();
>
return maxLineLength;
> catch (Exception e) throw new RuntimeException(e);
>
>

Ридеры инициализируем в начале блока try-with-resources, заводим необходимые переменные и в цикле, строка за строкой, считываем весь файл. Попутно проверяем длину текущей строки, и если она больше той, что мы нашли ранее, то выбираем её в качестве результата. После обработки всего файла возвращаем найденное значение.

Данная реализация обрабатывает файл целиком примерно за 530 миллисекунд. Но кода здесь писать чуть больше.

Выводы

Вариант решения задачи с помощью Files.readAllLines() работает наименее эффективно, т.к. считывает все строки в память. Этот вариант имеет смысл использовать только если чтения всего файла целиком не избежать. Если же есть возможность обрабатывать файлы построчно, то лучше использовать Files.lines(), который сразу возвращает стрим.

А наибольшей эффективностью обладает классическая связка FileReader + BufferedReader, но она является блокирующей, в отличие от первых двух вариантов.

Источник