This document describes a number of issues important to using the MLC++ library under Windows NT Dan Sommerfield 19 Nov 1997 Installation ------------ MLC++ for Windows NT is stored in a zip-format compressed archive. Pkzip is the most common tool for uncompressing these archives. You can find pkzip for windows at http://www.pkware.com To extract the files, create a new directory to house all of MLC++. For the rest of this document, we will assume you chose the directory C:\mlc as your directory. Now, create this directory and extract the archive into it. If you did everything correctly, you should have a few files and a large number of subdirectories off C:\mlc. Compilers --------- We compiled MLC++ for Windows NT using Microsoft Visual C++ 5.0. Earlier versions (e.g. 4.2, 4.0, 2.x) have significant bugs which will not allow you to compile the full source. In particular, 4.2 will not work. We recommend installing Microsoft Visual Studio service pack #2 if you have not already. However, it is not necessary to install the service pack to compile the sources. We also ship a number of external utilities which work with MLC++. Some of these were compiled using gcc/g++ as part of the GNU-win32 environment by Cygnus support (see below). None of the main sources directly depend on GNU-win32, however, and the MLC++ libraries will NOT require Cygnus's cygwin32.dll to run. Environment ----------- MLC++ has a large suite of testers and scripts which are used to automate builds and verify correct operation of the library. These scripts rely on a UNIX-like environment to operate. We use the GNU-win32 environment from Cygnus support. Gnu-win32 is a public-domain implementation of the major gnu tools on Windows NT. We used the beta-17.1 version of gnu-win32. Earlier versions have significant bugs. You can obtain information about gnu-win32 and the tools themselves from: http://www.cygnus.com/misc/gnu-win32/ We have not tested the MLC++ scripts under other environments (such as MKS ksh). They should work with any shell which is sh- compliant and most standard make utilities. Since different shell environments handle drive letters differently on NT, you may need to tweak some of the scripts to get pathnames converted correctly if you use a different environment. You cannot run the automatic builds and testers using the standard CMD and NMAKE utilities. In general, USING the MLC++ library should not require help from unix-like utilities. A few of the testers use utilities such as diff from system() calls, but should be no UNIX dependency in the library itself. Pathnames --------- All MLC++ variables which refer to pathnames should be in standard MLC++ format. The format is very similar to the one accepted by gnu-win32 utilities. It is NOT the same as standard Windows NT paths: 1. Use forward slashes (/) as directory separators 2. To specify a drive letter X, write //X/path, instead of X:\path 3. We do NOT yet support mounted volumes as is done in gnu-win32; the only way to access a directory on drive X is by writing //X/path. We provide two simple utilities for converting pathnames in scripts: ptobash takes a standard Windows/DOS pathname and converts it to our format. ptodos takes one of our pathnames and converts it to standard Windows/DOS format. Usage: ptobash ptodos You can write MYPATH=`ptobash $DOSPATH` in a shell script, for example. If you DO decide to use an alternate shell environment (such as MKS ksh), you may need to modify ptobash or ptodos to convert to the correct drive letter style. Setting up an environment for MLC++ ----------------------------------- First, go to http://www.cygnus.com/misc/gnu-win32/ and follow the instructions there to download and set up gnu-win32. Follow the instructions for setting up a FULL development environment, including gcc (you'll need it for the external utilities) Now, you need to make sure to go the following. The variables are best set in your .bashrc/.profile or similar files: All variables need to be exported MLCDIR should be set to the path of the outermost directory in the MLC++ heirarchy. This is //c/mlc if you used C:\mlc as your installation directory. MLCDIR should be an MLC++ standard pathname. (See pathnames section above). Now, cd to $MLCDIR and type: . setup.NT This will set some additional variables needed by the scripts. You should also put the following directories into your PATH (again, we assume we installed in C:\mlc): C:\mlc\bin C:\mlc\bin\MSVC . (dot, current directory) You are now ready to build. Note that running a large number of testers automatically is not particularly safe on NT. Therefore, we suggest doing "make minimal" to build the libraries first. You can then build the utilities by cd'ing to the util directory and doing a "make". Building external utilities --------------------------- Most of these have NOT been ported to Windows NT. T2 is an exception. You can attempt to build these by executing the "buildexternal" script as follows: buildexternal make Note that the external utilities are compiled with g++ under GNU-win32, and NOT with MSVC. Graphviz -------- The utilities are designed to take advantage of the graphviz suite of utilities by AT&T Bell Labs, available at http://www.research.att.com/sw/tools/graphviz/ Download the version for Win32 and follow all the installation instructions, including setting necessary variables and path components. If installed correctly, MLC++ will automatically use these utilities. MineSet visualizers ------------------- The MineSet visualizers (Treeviz, Scatterviz, Ruleviz) are NOT currently available for Windows NT; you will not be able to use the options to display classifiers in these formats in this version. Specific Windows NT issues -------------------------- Windows NT and SGI IRIX have some major differences. Here are they ways in which MLC++ gets around these: 1. Byte ordering: Windows NT and SGI IRIX have reversed byte orderings. We decided to use the "network byte ordering" as a standard in MLC++. This ordering happens to be the same as the SGI ordering. We provide the utility functions host_to_net() and net_to_host() for converting between the two. These are templated typesafe functions which should work on all basic types (signed or unsigned). We use them in the following places: binary files: our binary files are always stored in network byte order for compatibility between platforms. hashable structures: we use a bitwise hashing scheme in our hash tables. Therefore, any data used as a hash key must be in network byte order. Specifically, the integer values stored for nominal attribute values are ALWAYS stored in network byte order. c4.5 generated files: if you manage to port a version of c4.5 to NT, it will generate binary files in the wrong byte order, and MLC++ will be unable to read them. 2. End-of-line characters: Windows NT uses a two-character cr-lf (\r\n) pair to signify the end of a line. Unix uses only a linefeed (\n). MLC++ can correctly read input in EITHER format. However, the output of MLC++ text files will always be in the format of the platform you're using. 3. Template instantiation. The SGI compilers can instantiate templates at link time to speed compilation. To allow this, we place symbolic links to all template body files in the $MLCDIR/inc directory. The Makefile in the inc directory sets up the links. Under Windows NT, we make visual C++ instantiate templates at compile time by including the template bodies at the bottom of their respective .h file. The compile-time flag COMPILE_TIME_TEMPLATES may be defined to activate this behavior. We create ".ct" files in the inc directory which merely include the corresponding source files from the actual MLC++ tree. These are included directly from the corresponding .h files. 4. File suffixes. On UNIX, executables have no suffixes. Object files have ".o". MLC++ C++ files (as well as the few c files) use ".c". On NT, we keep the ".c" suffix for source, but objects use a .obj suffix and executables use .exe. Our build utilities are aware of these differences and will perform accordingly. Known bugs in Microsoft's compiler which affect this port --------------------------------------------------------- Occasionally, you will get an INTERNAL COMPILER ERROR #C1001 when compiling the sources. This is a known microsoft VC bug (article #Q168957). If you get this compiling your own code, try rearranging some header file includes so that the MLC++ files are all included at once and after any standard system includes. There are a few other bugs regarding initialization of pointers and references; certain notations allowed in C++ will fail under MSVC. There are usually obvious workarounds. Finally, Real number results on the NT version may occasionally differ from the IRIX version (shows up as diffs from the testers) due to the fact that Intel chips are a different architecture.