Scan text in java

Class Scanner

A Scanner breaks its input into tokens using a delimiter pattern, which by default matches whitespace. The resulting tokens may then be converted into values of different types using the various next methods.

For example, this code allows a user to read a number from System.in :

 Scanner sc = new Scanner(System.in); int i = sc.nextInt(); 

As another example, this code allows long types to be assigned from entries in a file myNumbers :

 Scanner sc = new Scanner(new File("myNumbers")); while (sc.hasNextLong())

The scanner can also use delimiters other than whitespace. This example reads several items in from a string:

 String input = "1 fish 2 fish red fish blue fish"; Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*"); System.out.println(s.nextInt()); System.out.println(s.nextInt()); System.out.println(s.next()); System.out.println(s.next()); s.close(); 

prints the following output:

The same output can be generated with this code, which uses a regular expression to parse all four tokens at once:

 String input = "1 fish 2 fish red fish blue fish"; Scanner s = new Scanner(input); s.findInLine("(\\d+) fish (\\d+) fish (\\w+) fish (\\w+)"); MatchResult result = s.match(); for (int i=1; i 

The default whitespace delimiter used by a scanner is as recognized by Character.isWhitespace() . The reset() method will reset the value of the scanner's delimiter to the default whitespace delimiter regardless of whether it was previously changed.

A scanning operation may block waiting for input.

The next() and hasNext() methods and their companion methods (such as nextInt() and hasNextInt() ) first skip any input that matches the delimiter pattern, and then attempt to return the next token. Both hasNext() and next() methods may block waiting for further input. Whether a hasNext() method blocks has no connection to whether or not its associated next() method will block. The tokens() method may also block waiting for input.

The findInLine() , findWithinHorizon() , skip() , and findAll() methods operate independently of the delimiter pattern. These methods will attempt to match the specified pattern with no regard to delimiters in the input and thus can be used in special circumstances where delimiters are not relevant. These methods may block waiting for more input.

When a scanner throws an InputMismatchException , the scanner will not pass the token that caused the exception, so that it may be retrieved or skipped via some other method.

Depending upon the type of delimiting pattern, empty tokens may be returned. For example, the pattern "\\s+" will return no empty tokens since it matches multiple instances of the delimiter. The delimiting pattern "\\s" could return empty tokens since it only passes one space at a time.

A scanner can read text from any object which implements the Readable interface. If an invocation of the underlying readable's read() method throws an IOException then the scanner assumes that the end of the input has been reached. The most recent IOException thrown by the underlying readable can be retrieved via the ioException() method.

When a Scanner is closed, it will close its input source if the source implements the Closeable interface.

A Scanner is not safe for multithreaded use without external synchronization.

Unless otherwise mentioned, passing a null parameter into any method of a Scanner will cause a NullPointerException to be thrown.

A scanner will default to interpreting numbers as decimal unless a different radix has been set by using the useRadix(int) method. The reset() method will reset the value of the scanner's radix to 10 regardless of whether it was previously changed.

Localized numbers

An instance of this class is capable of scanning numbers in the standard formats as well as in the formats of the scanner's locale. A scanner's initial locale is the value returned by the Locale.getDefault(Locale.Category.FORMAT) method; it may be changed via the useLocale() method. The reset() method will reset the value of the scanner's locale to the initial locale regardless of whether it was previously changed.

The localized formats are defined in terms of the following parameters, which for a particular locale are taken from that locale's DecimalFormat object, df , and its and DecimalFormatSymbols object, dfs .

LocalGroupSeparator The character used to separate thousands groups, i.e., dfs. getGroupingSeparator() LocalDecimalSeparator The character used for the decimal point, i.e., dfs. getDecimalSeparator() LocalPositivePrefix The string that appears before a positive number (may be empty), i.e., df. getPositivePrefix() LocalPositiveSuffix The string that appears after a positive number (may be empty), i.e., df. getPositiveSuffix() LocalNegativePrefix The string that appears before a negative number (may be empty), i.e., df. getNegativePrefix() LocalNegativeSuffix The string that appears after a negative number (may be empty), i.e., df. getNegativeSuffix() LocalNaN The string that represents not-a-number for floating-point values, i.e., dfs. getNaN() LocalInfinity The string that represents infinity for floating-point values, i.e., dfs. getInfinity()

Number syntax

The strings that can be parsed as numbers by an instance of this class are specified in terms of the following regular-expression grammar, where Rmax is the highest digit in the radix being used (for example, Rmax is 9 in base 10). NonAsciiDigit: A non-ASCII character c for which Character.isDigit (c) returns true Non0Digit: [1- Rmax ] | NonASCIIDigit Digit: [0- Rmax ] | NonASCIIDigit GroupedNumeral: ( Non0Digit Digit ? Digit ? ( LocalGroupSeparator Digit Digit Digit )+ ) Numeral: ( ( Digit + ) | GroupedNumeral ) Integer: ( [-+]? ( Numeral ) ) | LocalPositivePrefix Numeral LocalPositiveSuffix | LocalNegativePrefix Numeral LocalNegativeSuffix DecimalNumeral: Numeral | Numeral LocalDecimalSeparator Digit * | LocalDecimalSeparator Digit + Exponent: ( [eE] [+-]? Digit + ) Decimal: ( [-+]? DecimalNumeral Exponent ? ) | LocalPositivePrefix DecimalNumeral LocalPositiveSuffix Exponent ? | LocalNegativePrefix DecimalNumeral LocalNegativeSuffix Exponent ? HexFloat: [-+]? 0[xX][0-9a-fA-F]*\.[0-9a-fA-F]+ ([pP][-+]?5+)? NonNumber: NaN | LocalNan | Infinity | LocalInfinity SignedNonNumber: ( [-+]? NonNumber ) | LocalPositivePrefix NonNumber LocalPositiveSuffix | LocalNegativePrefix NonNumber LocalNegativeSuffix Float: Decimal | HexFloat | SignedNonNumber

Whitespace is not significant in the above regular expressions.

Источник

Scanning

Objects of type Scanner are useful for breaking down formatted input into tokens and translating individual tokens according to their data type.

Breaking Input into Tokens

By default, a scanner uses white space to separate tokens. (White space characters include blanks, tabs, and line terminators. For the full list, refer to the documentation for Character.isWhitespace .) To see how scanning works, let's look at ScanXan , a program that reads the individual words in xanadu.txt and prints them out, one per line.

import java.io.*; import java.util.Scanner; public class ScanXan < public static void main(String[] args) throws IOException < Scanner s = null; try < s = new Scanner(new BufferedReader(new FileReader("xanadu.txt"))); while (s.hasNext()) < System.out.println(s.next()); >> finally < if (s != null) < s.close(); >> > >

Notice that ScanXan invokes Scanner 's close method when it is done with the scanner object. Even though a scanner is not a stream, you need to close it to indicate that you're done with its underlying stream.

The output of ScanXan looks like this:

In Xanadu did Kubla Khan A stately pleasure-dome .

To use a different token separator, invoke useDelimiter() , specifying a regular expression. For example, suppose you wanted the token separator to be a comma, optionally followed by white space. You would invoke,

Translating Individual Tokens

The ScanXan example treats all input tokens as simple String values. Scanner also supports tokens for all of the Java language's primitive types (except for char ), as well as BigInteger and BigDecimal . Also, numeric values can use thousands separators. Thus, in a US locale, Scanner correctly reads the string "32,767" as representing an integer value.

We have to mention the locale, because thousands separators and decimal symbols are locale specific. So, the following example would not work correctly in all locales if we didn't specify that the scanner should use the US locale. That's not something you usually have to worry about, because your input data usually comes from sources that use the same locale as you do. But this example is part of the Java Tutorial and gets distributed all over the world.

The ScanSum example reads a list of double values and adds them up. Here's the source:

import java.io.FileReader; import java.io.BufferedReader; import java.io.IOException; import java.util.Scanner; import java.util.Locale; public class ScanSum < public static void main(String[] args) throws IOException < Scanner s = null; double sum = 0; try < s = new Scanner(new BufferedReader(new FileReader("usnumbers.txt"))); s.useLocale(Locale.US); while (s.hasNext()) < if (s.hasNextDouble()) < sum += s.nextDouble(); >else < s.next(); >> > finally < s.close(); >System.out.println(sum); > >

And here's the sample input file, usnumbers.txt

The output string is "1032778.74159". The period will be a different character in some locales, because System.out is a PrintStream object, and that class doesn't provide a way to override the default locale. We could override the locale for the whole program — or we could just use formatting, as described in the next topic, Formatting.

Источник

Читайте также:  Php base64 encode xml
Оцените статью