Detecting non-ASCII characters in a text file
Our internal coding standard for C++ source files dictates that 7-bit US-ASCII should be used for file encoding.
This decision is based on the fact that the current C++ standard (2003) limits characters that can be used in variable and type identifiers to ASCII letters. Although some compilers and the new (2011) C++ standard allow most Unicode code points in identifiers (basically whatever can be called a “letter” in the various scripts), the “same-glyph, different Unicode code-point syndrome” described here advises against that.
One could still allow non-ASCII characters in string constants and in comments, and this is tolerated by most modern compilers. But the decision was to be quite conservative in the current standard; in the future, as C++ 2011 is fully implemented, we might revise it.
The trouble is that sometimes non-ASCII characters sneak in, for example the euro sign €, the degree symbol ° and the dash – which looks so similar to the minus sign –.
Long story short, we needed an utility to detect non-ASCII characters in a collection of text (source) files. This utility is called checkAscii, and the C++ source code is:
/*
@file checkAscii.cc
@brief Detect non-ASCII characters in a text file
@author (C) Copyright 2012 Paolo Greppi libpf.com
@date 20120525
@version 0.1
no warranties whatsoever
distribute freely and free of charge citing this:
Detecting non-ASCII characters in a text file
*/
#include <iostream>
#include <fstream>
#include <cstdio>
int main(int argc, char *argv[]) {
std::istream *in = NULL;
std::ifstream inf;
if (argc == 1) {
in = &std::cin;
std::cout << "Now checking stdin" << std::endl;
} else if (argc == 2) {
inf.open(argv[1]);
if(!inf) {
std::cerr << "Error opening input file !" << std::endl;
return -1;
}
in = &inf;
std::cout << "Now checking file " << argv[1] << std::endl;
} else {
std::cerr << "Only 0 or 1 argument !" << std::endl;
return -1;
}
char c, bit8 = (1 << 7);
int line(0), column(0), count(0);
while ((c = in->get()) && (c != EOF)) {
if (c == '\n') {
++line;
column = 0;
}
if ((c & bit8) == bit8) {
std::cout << "line: " << line + 1 << " column: " << column << " nonascii " << c << std::endl; count++; } ++column; } if (argc > 1) {
inf.close();
}
return count;
}
Usage is as follows:
cat mySourceFile.cc | checkAscii
or:
checkAscii mySourceFile.cc
It will print this if non-ASCII characters are found (and return the number of found non-ASCII characters):
Now checking file mySourceFile.h line: 53 column: 26 nonascii ° line: 54 column: 27 nonascii €
or will print this (and return 0) if only ASCII characters are found:
Now checking file mySourceFile.h
We use it on large sets of files using bash and xargs as follows:
ls -1 include/*.h | xargs -d '\n' -n 1 checkAscii ls -1 src/*.cc | xargs -d '\n' -n 1 checkAscii
Enjoy !