maxIdx=i;ifUsed(0,0);extractfile(tmpfile);//分离器之

Installation & The RDKit
documentation
Installation
Below a number of installation recipies is presented, with varying
degree of complexity.
Cross-platform under anaconda python (fastest install)
Introduction to anaconda
Conda is an open-source, cross-platform, software package manager. It
supports the packaging and distribution of software components, and
manages their installation inside isolated execution environments. It
has several analogies with pip and virtualenv, but it is designed to be
more “python-agnostic” and more suitable for the distribution of binary
packages and their dependencies.
How to get conda
The easiest way to get Conda is having it installed as part of the
possible (but a bit more complex to use) alternative is provided with
the smaller and more self-contained
. The conda source
code repository is available on
and additional documentation is provided by the project
How to install RDKit with Conda
Creating a new conda environment with the RDKit installed using these
packages requires one single command similar to the following:
$ conda create -c https://conda.anaconda.org/rdkit -n my-rdkit-env rdkit
Finally, the new environment must be activated, so that the
corresponding python interpreter becomes available in the same shell:
$ source activate my-rdkit-env
If for some reason this does not work, try:
$ cd [anaconda folder]/bin
$ source activate my-rdkit-env
Windows users will use a slightly different command:
C:& activate my-rdkit-env
How to build from source with Conda
For more details on building from source with Conda, see the
Installing and using PostgreSQL and the RDKit PostgreSQL cartridge from a conda environment
Due to the conda python distribution being a different version to the
system python, it is easiest to install PostgreSQL and the PostgreSQL
python client via conda.
With your environment activated, this is done simply by:
conda install -c https://conda.binstar.org/rdkit rdkit-postgresql
The conda packages PostgreSQL version needs to be initialized by running
the initdb command found in [conda folder]/envs/my-rdkit-env/bin
[conda folder]/envs/my-rdkit-env/bin/initdb -D /folder/where/data/should/be/stored
PostgreSQL can then be run from the terminal with the command:
[conda folder]/envs/my-rdkit-env/bin/postgres -D /folder/where/data/should/be/stored
For most use cases you will instead need to run PostgreSQL as a daemon,
one way to do this is using supervisor. You can find out more and how to
install supervisor . The required
configuration file will look something like this:
[program:postgresql]
command=[conda folder]/envs/my-rdkit-env/bin/postgres -D /folder/where/data/should/be/stored
user=[your username]
autorestart=true
Once PostgreSQL is up and running, all of the normal PostgreSQL commands
can then be run when your conda environment is activated. Therefore to
create a database you can run:
createdb my_rdkit_db
psql my_rdkit_db
If you are trying to use multiple installations of PostgreSQL in
different environments, you will need to setup different pid files, unix
sockets and ports by .
With the above configurations these files can be found in
/folder/where/data/should/be/stored.
Linux and OS X
Installation from repositories
Ubuntu 12.04 and later
Thanks to the efforts of the Debichem team, RDKit is available via the
Ubuntu repositories. To install:
sudo apt-get install python-rdkit librdkit1 rdkit-data
Fedora, CentOS, and RHEL
Gianluca Sforna creates binary RPMs that can be found here:
Eddie Cao has produced a homebrew formula that can be used to easily
build the RDKit
Building from Source
Prerequisites
Installing prerequisites as packages
Ubuntu and other debian-derived systems
Install the following packages using apt-get:
build-essential python-numpy cmake python-dev sqlite3 libsqlite3-dev libboost-dev libboost-system-dev libboost-thread-dev libboost-serialization-dev libboost-python-dev libboost-regex-dev
Fedora, CentOS (5.7+), and RHEL
Install the following packages using yum:
cmake tk-devel readline-devel zlib-devel bzip2-devel sqlite-devel @development-tools
Packages to install from source (not required on RHEL/CentOS 6.x):
python 2.7 : use
./configure CFLAGS=-fPIC --enable-unicode=ucs4 --enable-shared
numpy : do export LD\_LIBRARY\_PATH=&/usr/local/lib& before
python setup.py install
boost 1.48.0 or later: do
./bootstrap.sh --with-libraries=python, ./b2; ./b2 install
Older versions of CentOS
Here things are more difficult. Check this wiki page for information:
Installing prerequisites from source
Required packages:
cmake. You need version 2.6 (or more recent).
your linux distribution doesn’t have an appropriate package.
& **note**
& It seems that v2.8 is a better bet than v2.6. It might be worth compiling your own copy of v2.8 even if v2.6 is already installed.
The following are required if you are planning on using the Python
The python headers. This probably means that you need to install
the python-dev package (or whatever it’s called) for your linux
distribution.
sqlite3. You also need the shared libraries. This may require that
you install a sqlite3-dev package.
You need to have numpy () installed.
for building with XCode4 on OS X there seems to be a problem
with the version of numpy that comes with XCode4. Please see
below in the (see faq) section for a workaround.
Installing Boost
If your linux distribution has a boost-devel package including the
python, regex, threading, and serialization libraries, you can use that
and save yourself the steps below.
if you do have a version of the boost libraries pre-installed and
you want to use your own version, be careful when you build the
code. We’ve seen at least one example on a Fedora system where cmake
compiled using a user-installed version of boost and then linked
against the system version. This led to segmentation faults. There
is a workaround for this below in the (see FAQ) section.
download the boost source distribution from
extract the source somewhere on your machine (e.g.
/usr/local/src/boost_1_58_0)
build the required boost libraries. The boost site has
for this, but here’s an overview:
If you want to use the python wrappers:
./bootstrap.sh --with-libraries=python,regex,thread,serialization
If not using the python wrappers:
./bootstrap.sh --with-libraries=regex,thread,serialization
./b2 install
If you have any problems with this step, check the boost
Building the RDKit
Fetch the source, here as tar.gz but you could use git as well:
wget https://github.com/rdkit/rdkit/archive/Release_XXXX_XX_X.tar.gz
Ensure that the prerequisites are installed
environment variables:
RDBASE: the root directory of the RDKit distribution (e.g.
Linux: LD_LIBRARY_PATH: make sure it includes $RDBASE/lib and
wherever the boost shared libraries were installed
OS X: DYLD_LIBRARY_PATH: make sure it includes $RDBASE/lib and
wherever the boost shared libraries were installed
The following are required if you are planning on using the Python
PYTHONPATH: make sure it includes $RDBASE
cd to $RDBASE
mkdir build
cmake .. : See the section below on configuring the build if you
need to specify a non-default version of python or if you have boost
in a non-standard location
make : this builds all libraries, regression tests, and wrappers
(by default).
make install
See below for a list of FAQ and solutions.
Testing the build (optional, but recommended)
cd to $RDBASE/build and do ctest
you’re done!
Specifying an alternate Boost installation
You need to tell cmake where to find the boost libraries and header
If you have put boost in /opt/local, the cmake invocation would look
cmake -DBOOST_ROOT=/opt/local ..
Note that if you are using your own boost install on a system with a
system install, it’s normally a good idea to also include the argument
-D Boost_NO_SYSTEM_PATHS=ON in your cmake command.
Specifying an alternate Python installation
If you aren’t using the default python installation for your computer,
You need to tell cmake where to find the python library it should link
against and the python header files.
Here’s a sample command line:
cmake -D PYTHON_LIBRARY=/usr/lib/python2.7/config/libpython2.7.a -D PYTHON_INCLUDE_DIR=/usr/include/python2.7/ -D PYTHON_EXECUTABLE=/usr/bin/python ..
The PYTHON_EXECUTABLE part is optional if the correct python is the
first version in your PATH.
Disabling the Python wrappers
You can completely disable building of the python wrappers:
cmake -DRDK_BUILD_PYTHON_WRAPPERS=OFF ..
Recommended extras
You can enable support for generating InChI strings and InChI keys by
adding the argument -DRDK_BUILD_INCHI_SUPPORT=ON to your cmake
command line.
You can enable support for the Avalon toolkit by adding the argument
-DRDK_BUILD_AVALON_SUPPORT=ON to your cmake command line.
If you’d like to be able to generate high-quality PNGs for structure
depiction cairo (for use with Python2) or cairocffi (for use with
Python3) and their respective Python bindings are recommended.
Building the Java wrappers
Additional Requirements
SWIG v2.0.x:
When you invoke cmake add -D RDK_BUILD_SWIG_WRAPPERS=ON to the
arguments. For example: cmake -D RDK_BUILD_SWIG_WRAPPERS=ON ..
Build and install normally using make. The directory
$RDBASE/Code/JavaWrappers/gmwrapper will contain the three
required files: libGraphMolWrap.so (libGraphMolWrap.jnilib on OS X),
org.RDKit.jar, and org.RDKitDoc.jar.
Using the wrappers
To use the wrappers, the three files need to be in the same directory,
and that should be on your CLASSPATH and in the java.library.path. An
example using jython:
% CLASSPATH=$CLASSPATH:$RDBASE/Code/JavaWrappers/gmwrapper/org.RDKit. jython -Djava.library.path=$RDBASE/Code/JavaWrappers/gmwrapper
Jython 2.2.1 on java1.6.0_20
Type &copyright&, &credits& or &license& for more information.
&&& from org.RDKit import *
&&& from java import lang
&&& lang.System.loadLibrary('GraphMolWrap')
&&& m = RWMol.MolFromSmiles('c1ccccc1')
&&& m.getNumAtoms()
Optional packages
If you would like to install the RDKit InChI support, follow the
instructions in $RDBASE/External/INCHI-API/README.
If you would like to install the RDKit Avalon toolkit support, follow
the instructions in $RDBASE/External/AvalonTool/README.
If you would like to build and install the PostgreSQL cartridge,
follow the instructions in $RDBASE/Code/PgSQL/rdkit/README.
Frequently Encountered Problems
In each case I’ve replaced specific pieces of the path with ....
Problem: :
Linking CXX shared library libSLNParse.so
/usr/bin/ld: .../libboost_regex.a(cpp_regex_traits.o): relocation R_X86_64_32S against `std::basic_string&char, std::char_traits&char&, std::allocator&char& &::_Rep::_S_empty_rep_storage' can not be used when ma recompile with -fPIC
.../libboost_regex.a: could not read symbols: Bad value
collect2: ld returned 1 exit status
make[2]: *** [Code/GraphMol/SLNParse/libSLNParse.so] Error 1
make[1]: *** [Code/GraphMol/SLNParse/CMakeFiles/SLNParse.dir/all] Error 2
make: *** [all] Error 2
Add this to the arguments when you call cmake:
-DBoost_USE_STATIC_LIBS=OFF
More information here:
Problem: :
.../Code/GraphMol/Wrap/EditableMol.cpp:114:
instantiated from here
.../boost/type_traits/detail/cv_traits_impl.hpp:37: internal compiler error: in make_rtl_for_nonlocal_decl, at cp/decl.c:5067
Please submit a full bug report, with preprocessed source if appropriate. See \&URL:&http://bugzilla.redhat.com/bugzilla&\& for instructions. Preprocessed source stored into /tmp/ccgSaXge.out file, please attach this to your bugreport. make[2]: **\* [Code/GraphMol/Wrap/CMakeFiles/rdchem.dir/EditableMol.cpp.o] Error 1 make[1]:**\* [Code/GraphMol/Wrap/CMakeFiles/rdchem.dir/all] Error 2 make: *\** [all] Error 2
Add #define BOOST_PYTHON_NO_PY_SIGNATURES at the top of
Code/GraphMol/Wrap/EditableMol.cpp
More information here:
Your system has a version of boost installed in /usr/lib, but you would
like to force the RDKit to use a more recent one.
This can be solved by using cmake version 2.8.3 (or more recent) and
providing the -D Boost_NO_SYSTEM_PATHS=ON argument:
cmake -D BOOST_ROOT=/usr/local -D Boost_NO_SYSTEM_PATHS=ON ..
Building on OS X with XCode 4
The problem seems to be caused by the version of numpy that is
distributed with XCode 4, so you need to build a fresh copy.
Solution: Get a copy of numpy and build it like this as root: as root:
export MACOSX_DEPLOYMENT_TARGET=10.6
export LDFLAGS=&-Wall -undefined dynamic_lookup -bundle -arch x86_64&
export CFLAGS=&-arch x86_64&
ln -s /usr/bin/gcc /usr/bin/gcc-4.2
ln -s /usr/bin/g++ /usr/bin/g++-4.2
python setup.py build
python setup.py install
Be sure that the new numpy is used in the build:
PYTHON_NUMPY_INCLUDE_PATH /Library/Python/2.6/site-packages/numpy/core/include
and is at the beginning of the PYTHONPATH:
export PYTHONPATH=&/Library/Python/2.6/site-packages:$PYTHONPATH&
Now it’s safe to build boost and the RDKit.
Prerequisites
Python 2.7 or 3.4+ (from )
numpy (from
or use pip install numpy).
Binaries for win64 are available here:
Pillow: (from & or use
pip install Pillow)
Recommended extras
aggdraw: a library for high-quality drawing in Python. Instructions
for downloading are here:
The new (as of May 2008) drawing code has been tested with v1.2a3 of
aggdraw. Despite the alpha label, the code is stable and functional.
matplotlib: a library for scientific plotting from Python.
ipython : a very useful interactive shell (and much more) for Python.
win32all: Windows extensions for Python.
Installation of RDKit binaries
Get the appropriate windows binary build from:
Extract the zip file somewhere without a space in the name, i.e.
The rest of this will assume that the installation is in
Set the following environment variables:
RDBASE: C:\RDKit_
PYTHONPATH: %RDBASE% if there is already a PYTHONPATH, put
;%RDBASE% at the end.
PATH: add ;%RDBASE%\lib to the end
In Win7 systems, you may run into trouble due to missing DLLs, see one
thread from the mailing list:
You can download the missing DLLs from here:
Installation from source
Extra software to install
Microsoft Visual C++ : The Community version has everything necessary
and can be downloaded for free
This is a big installation and will take a while. The RDKit has been
successfully built with all version of Visual C++ since 6.0, so the
current version of VC++ (2015 as of this writing) should be fine.
cmake : () should
be installed.
boost : It is strongly recommended to download and use a precompiled
version of the boost libraries from
you run the installer, the only binary libraries you need are python,
regex, and system. If you want to install boost from source, download
a copy from
and follow the instructions in the
“Getting Started” section of the documentation. Make sure the
libraries and headers are installed to C:\boost
a git client : This is only necessary if you are planning on
building development versions of the RDKit. This can be downloaded
git is also included as an
optional add-on of Microsoft Visual Studio 2015.
Setup and Preparation
This section assumes that python is installed in C:\Python27, that
the boost libraries have been installed to C:\boost, and that you
will build the RDKit from a directory named C:\RDKit. If any of
these conditions is not true, just change the corresponding paths.
If you install things in paths that have spaces in their names, be
sure to use quotes properly in your environment variable definitions.
If you are planning on using a development version of the RDKit: get
a copy of the current RDKit source using git. If you’re using the
command-line client the command is:
git clone& /rdkit/rdkit.git C:\RDKit
If you are planning on using a released version of the RDKit: get a
copy of the most recent release and extract it into the directory
Set the required environment variables:
RDBASE = C:\RDKit
Make sure C:\Python27 is in your PATH
Make sure C:\RDKit\lib is in your PATH
Make sure C:\boost\lib is in your PATH.
Make sure C:\RDKit is in your PYTHONPATH
Building from the command line (recommended)
Create a directory C:\RDKit\build and cd into it
Run cmake. Here’s an example basic command line for 64bit windows
that will download the InChI and Avalon toolkit sources from the
InChI Trust and SourceForge repositories, respectively, and build the
PostgreSQL cartridge for the installed version of PostgreSQL:
cmake -DRDK_BUILD_PYTHON_WRAPPERS=ON -DBOOST_ROOT=C:/boost -DRDK_BUILD_INCHI_SUPPORT=ON -DRDK_BUILD_AVALON_SUPPORT=ON -DRDK_BUILD_PGSQL=ON -DPostgreSQL_ROOT=&C:\Program Files\PostgreSQL\9.5& -G&Visual Studio 14 2015 Win64& ..
Build the code. Here’s an example command line:
C:/Windows/Microsoft.NET/Framework64/v4.0.30319/MSBuild.exe /m:4 /p:Configuration=Release INSTALL.vcxproj
If you have built in PostgreSQL support, you will need to open a
shell with administrator privileges, stop the PostgreSQL service, run
the pgsql_install.bat installation script, then restart the
PostgreSQL service (please refer to
%RDBASE%\Code\PgSQL\rdkit\README for further details):
&C:\Program Files\PostgreSQL\9.5\bin\pg_ctl.exe& -N &postgresql-9.5& -D &C:\Program Files\PostgreSQL\9.5\data& -w stop
C:\RDKit\build\Code\PgSQL\rdkit\pgsql_install.bat
&C:\Program Files\PostgreSQL\9.5\bin\pg_ctl.exe& -N &postgresql-9.5& -D &C:\Program Files\PostgreSQL\9.5\data& -w start
Before restarting the PostgreSQL service, make sure that the Boost
libraries the RDKit was built against are in the system PATH, or
PostgreSQL will fail to create the rdkit extension with a
deceptive error message such as:
ERROR: could not load library &C:/Program Files/PostgreSQL/9.5/lib/rdkit.dll&: The specified module could not be found.
Testing the Build (optional, but recommended)
cd to C:\RDKit\build and run ctest. Please note that if you have
built in PostgreSQL support, the current logged in user needs to be a
PostgreSQL user with database creation and superuser privileges, or
the PostgreSQL test will fail. A convenient option to authenticate
will be to set the PGPASSWORD environment variable to the
PostgreSQL password of the current logged in user in the shell from
which you are running ctest.
You’re done!
This document is copyright (C)
by Greg Landrum
This work is licensed under the Creative Commons Attribution-ShareAlike
3.0 License. To view a copy of this license, visit
or send a letter to
Creative Commons, 543 Howard Street, 5th Floor, San Francisco,
California, 94105, USA.
The intent of this license is similar to that of the RDKit itself. In
simple words: “Do whatever you want with it, but please give us some
credit.”Regular expressions
&&Regular expressions
So far we have been reading through files, looking for patterns and extracting various bits of lines that we find interesting. We have been using string methods like split and find and using lists and string slicing to extract portions of the lines.
This task of searching and extracting is so common that Python has a very powerful library called regular expressions that handles many of these tasks quite elegantly. The reason we have not introduced regular expressions earlier in the book is because while they are very powerful, they are a little complicated and their syntax takes some getting used to.
Regular expressions are almost their own little programming language for searching and parsing strings. As a matter of fact, entire books have been written on the topic of regular expressions. In this chapter, we will only cover the basics of regular expressions. For more detail on regular expressions, see:
http://en.wikipedia.org/wiki/Regular_expression
http://docs.python.org/library/re.html
The regular expression library must be imported into your program before you can use it. The simplest use of the regular expression library is the search() function. The following program demonstrates a trivial use of the search function.
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('From:', line) :
print line
We open the file, loop through each line and use the regular expression search() to only print out lines that contain the string "From:". This program does not use the real power of regular expressions since we could have just as easily used line.find() to accomplish the same result.
The power of the regular expressions comes when we add to special characters to the search string that allow us to more precisely control which lines match the string. Adding these special characters to our regular expression allow us to do sophisticated matching and extraction while writing very little code.
For example, the caret character is uses in regular
expressions to match "the beginning" of a line.
We could change our application to only match
lines where "From:" was at the beginning of the line as follows:
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^From:', line) :
print line
Now we will only match lines that start with the string "From:". This is still a very simple example that we could have done equivalently with the startswith() method from the string library. But it serves to introduce the notion that regular expressions contain special action characters that give us more control as to what will match the regular expression.
&&Character matching in regular expressions
There are a number of other special characters that let us build even more powerful regular expressions. The most commonly used special character is the period character which matches any character.
In the following example, the regular expression "F..m:" would match any of the strings "From:", "Fxxm:", "F12m:", or ":" since the period characters in the regular expression match any character.
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^F..m:', line) :
print line
This is particularly powerful when combined with the ability to indicate that a character can be repeated any number of times using the "*" or "+" characters in your regular expression. These special characters mean that instead of matching a single character in the search string they match zero-or-more in the case of the asterisk or one-or-more of the characters in the case of the plus sign.
We can further narrow down the lines that we match using a repeated wild card character in the following example:
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^From:.+@', line) :
print line
The search string "^From:.+@" will successfully match lines that start with "From:" followed by one or more characters ".+" followed by an at-sign. So this will match the following line:
From: stephen.marquard @uct.ac.za
You can think of the ".+" wildcard as expanding to match all the characters between the
colon character and the at-sign.
It is good to think of the plus and asterisk characters as "pushy". For example the following string would match the last at-sign in the string as the ".+" pushes outwards as shown below:
From: , , and cwen @iupui.edu
It is possible to tell an asterisk or plus-sign not to be so "greedy" by adding
another character. See the detailed documentation for information on turning off the
greedy behavior.
&&Extracting data using regular expressions
If we want to extract data from a string in Python we can use the findall() method to extract all of the substrings which match a regular expression. Let's use the example of wanting to extract anything that looks like an e-mail address from any line regardless of format. For example, we want to pull the e-mail addresses from each of the following lines:
5 09:14:16 2008
Return-Path: &&
Received: (from )
We don't want to write code for each of the types of lines, splitting and slicing differently for each line. This following program uses findall() to find the lines with e-mail addresses in them and extract one or more addresses from each of those lines.
s = 'Hello from
about the meeting @2PM'
lst = re.findall('\S+@\S+', s)
The findall() method searches the string in the second argument and returns a list of all of the strings that look like e-mail addresses. We are using a two-character sequence
that matches a non-whitespace character (\S).
The output of the program would be:
Translating the regular expression, we are looking for substrings that have at least one non-whitespace character, followed by an at-sign, followed by at least one more non-white space characters. Also, the "\S+" matches as many non-whitespace characters as possible (this is called "greedy" matching in regular expressions).
The regular expression would match twice ( and ) but it would not match the string "@2PM" because there are no non-blank characters before the at-sign.
We can use this regular expression in a program to read all the lines in a file and print out anything that looks like an e-mail address as follows:
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
x = re.findall('\S+@\S+', line)
if len(x) & 0 :
We read each line and then extract all the substrings that match our regular expression. Since findall() returns a list, we simple check if the number of elements in our returned list is more than zero to print only lines where we found at least one substring that looks like an e-mail address.
If we run the program on mbox.txt we get the following output:
Some of our E-mail addresses have incorrect characters like "&" or ";" at the beginning or end. Let's declare that we are only interested in the portion of the string that starts and ends with a letter or a number.
To do this, we use another feature of regular expressions. Square brackets are used to indicate a set of multiple acceptable characters we are willing to consider matching. In a sense, the "\S" is asking to match the set of "non-whitespace characters". Now we will be a little more explicit in terms of the characters we will match.
Here is our new regular expression:
[a-zA-Z0-9]\S*@\S*[a-zA-Z]
This is getting a little complicated and you can begin to see why regular expressions are their own little language unto themselves. Translating this regular expression, we are looking for substrings that start with a single lowercase letter, upper case letter, or number "[a-zA-Z0-9]" followed by zero or more non blank characters "\S*", followed by an at-sign, followed by zero or more non-blank characters "\S*" followed by an upper or lower case letter. Note that we switched from "+" to "*" to indicate zero-or-more non-blank characters since "[a-zA-Z0-9]" is already one non-blank character. Remember that the "*" or "+" applies to the single character immediately to the left of the plus or asterisk.
If we use this expression in our program, our data is much cleaner:
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)
if len(x) & 0 :
Notice that on the "" lines, our regular expression eliminated two letters at the end of the string ("&;"). This is because when we append "[a-zA-Z]" to the end of our regular expression, we are demanding that whatever string the regular expression parser finds, it must end with a letter. So when it sees the "&" after "sakaiproject.org&;" it simply stops at the last "matching" letter it found (i.e. the "g" was the last good match).
Also note that the output of the program is a Python list that has a string as the single element in the list.
&&Combining searching and extracting
If we want to find numbers on lines that start with the string "X-" such as:
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
We don't just want any floating point numbers from any lines. We only to extract numbers from lines that have the above syntax.
We can construct the following regular expression to select the lines:
^X-.*: [0-9.]+
Translating this, we are saying, we want lines that start with "X-" followed by zero or more characters ".*" followed by a colon (":") and then a space. After the space we are looking for one or more characters that are either a digit (0-9) or a period "[0-9.]+". Note that in between the square braces, the period matches an actual period (i.e. it is not a wildcard between the square brackets).
This is a very tight expression that will pretty much match only the lines we are interested in as follows:
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^X\S*: [0-9.]+', line) :
print line
When we run the program, we see the data nicely filtered to show
only the lines we are looking for.
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
But now we have to solve the problem of extracting the numbers using split. While it would be simple enough to use split, we can use another feature of regular expressions to both search and parse the line at the same time.
Parentheses are another special character in regular expressions. When you add parentheses to a regular expression they are ignored when matching the string, but when you are using findall(), parentheses indicate that while you want the whole expression to match, you only are interested in extracting a portion of the substring that matches the regular expression.
So we make the following change to our program:
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
x = re.findall('^X\S*: ([0-9.]+)', line)
if len(x) & 0 :
Instead of calling search(), we add parentheses around the part of the regular expression that represents the floating point number to indicate we only want findall() to give us back the floating point number portion of the matching string.
The output from this program is as follows:
['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
The numbers are still in a list and need to be converted from strings to floating point but we have used the power of regular expressions to both search and extract the information we found interesting.
As another example of this technique, if
you look at the file there are a number of lines of the form:
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772
If we wanted to extract all of the revision numbers (the integer number at the end of these lines) using the same technique as above, we could write the following program:
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
x = re.findall('^Details:.*rev=([0-9.]+)', line)
if len(x) & 0:
Translating our regular expression, we are looking for lines that start with "Details:', followed by any any number of characters ".*" followed by "rev=" and then by one or more digits. We want lines that match the entire expression but we only want to extract the integer number at the end of the line so we surround "[0-9]+" with parentheses.
When we run the program, we get the following output:
Remember that the "[0-9]+" is "greedy" and it tries to make as large a string of digits as possible before extracting those digits. This "greedy" behavior is why we get all five digits for each number. The regular expression library expands in both directions until it counters a non-digit, the beginning, or the end of a line.
Now we can use regular expressions to re-do an exercise from earlier in the book where we were interested in the time of day of each mail message. We looked for lines of the form:
5 09:14:16 2008
And wanted to extract the hour of the day for each line. Previously we did this with two calls to split. First the line was split into words and then we pulled out the fifth word and split it again on the colon character to pull out the two characters we were interested in.
While this worked, it actually results in pretty brittle code that is assuming the lines are nicely formatted. If you were to add enough error checking (or a big try/except block) to insure that your program never failed when presented with incorrectly formatted lines, the code would balloon to 10-15 lines of code that was pretty hard to read.
We can do this far simpler with the following regular expression:
^From .* [0-9][0-9]:
The translation of this regular expression is that we are looking for lines that start with "From " (note the space) followed by any number of characters ".*" followed by a space followed by two digits "[0-9][0-9]" followed by a colon character. This is the definition of the kinds of lines we are looking for.
In order to pull out only the hour using findall(), we add parentheses around the two digits as follows:
^From .* ([0-9][0-9]):
This results in the following program:
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
x = re.findall('^From .* ([0-9][0-9]):', line)
if len(x) & 0 : print x
When the program runs, it produces the following output:
&&Escape character
Since we use special characters in regular expressions to match the beginning or end of
a line or specify wild cards, we need a way to indicate that these characters are "normal"
and we want to match the actual character such as a dollar-sign or caret.
We can indicate that we want to simply match a character by prefixing that character
with a backslash. For example, we can find money amounts with the following regular
expression.
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)
Since we prefix the dollar-sign with a backslash, it actually matches the dollar-sign
in the input string instead of matching the "end of line" and the rest of the regular
expression matches one or more digits or the period character. Note: In between
square brackets, characters are not "special". So when we say "[0-9.]", it really
means digits or a period. Outside of square brackets, a period is the "wild-card"
character and matches any character. In between square brackets, the period is a period.
While this only scratched the surface of regular expressions, we have learned a bit about the language of regular expressions. They are search strings that have special characters in them that communicate your wishes to the regular expression system as to what defines "matching" and what is extracted from the matched strings. Here are some of those special characters and character sequences:
Matches the beginning of the line.
Matches the end of the line.
Matches any character (a wildcard).
Matches a whitespace character.
Matches a non-whitespace character (opposite of \s).
Applies to the immediately preceding character and indicates to match zero or more of the preceding character.
Applies to the immediately preceding character and indicates to match zero or more of the preceding character in "non-greedy mode".
Applies to the immediately preceding character and indicates to match zero or more of the preceding character.
Applies to the immediately preceding character and indicates to match zero or more of the preceding character in "non-greedy mode".
Matches a single character as long as that character is in the specified set. In this example, it would match "a", "e", "i", "o" or "u" but no other characters.
You can specify ranges of characters using the minus sign. This example is a single character that must be a lower case letter or a digit.
When the first character in the set notation is a caret, it inverts the logic. This example matches a single character that is anything other than an upper or lower case character.
When parentheses are added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall().
Matches the empty string, but only at the start or end of a word.
Matches the empty string, but not at the start or end of a word.
Match equivalent to the set [0-9].
Matches any non- equivalent to the set [^0-9].
&&Bonus section for Unix users
Support for searching files using regular expressions was built into the Unix operating system
since the 1960's and it is available in nearly all programming languages in one form or another.
As a matter of fact, there is a command-line program built into Unix
called grep (Generalized Regular Expression Parser) that does pretty much
the same as the search() examples in this chapter. So if you have a
Macintosh or Linux system, you can try the following commands in your command line window.
$ grep '^From:' mbox-short.txt
This tells grep to show you lines that start with the string "From:" in the file mbox-short.txt. If you experiment with the grep command a bit and read the documentation for grep, you will find some subtle differences between the regular expression support in Python and the regular expression support in grep. As an example, grep does not support the non-blank character "\S" so you will need to use the slightly more complex set notation "[^ ]"- which simply means - match a character that is anything other than a space.
&&Debugging
Python has some simple and rudimentary built-in documentation that can be quite helpful if you need a quick refresher to trigger your memory about the exact name of a particular method. This documentation can be viewed in the Python interpreter in interactive mode.
You can bring up an interactive help system using help().
&&& help()
Welcome to Python 2.6!
This is the online help utility.
If this is your first time using Python, you should definitely check out
the tutorial on the Internet at http://docs.python.org/tutorial/.
Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules.
To quit this help utility and
return to the interpreter, just type "quit".
To get a list of available modules, keywords, or topics, type "modules",
"keywords", or "topics".
Each module also comes with a one-line summary
to list the modules whose summaries contain a given word
such as "spam", type "modules spam".
help& modules
If you know what module you want to use, you can use the dir() command to find the methods in the module as follows:
&&& import re
&&& dir(re)
[.. 'compile', 'copy_reg', 'error', 'escape', 'findall',
'finditer', 'match', 'purge', 'search', 'split', 'sre_compile',
'sre_parse', 'sub', 'subn', 'sys', 'template']
You can also get a small amount of documentation on a particular method using the dir command.
&&& help (re.search)
Help on function search in module re:
search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning
a match object, or None if no match was found.
The built in documentation is not very extensive, but it can be helpful when you are in a hurry
or don't have access to a web browser or search engine.
&&Glossary
brittle code:
Code that works when the input data is in a particular format but prone to breakage
if there is some deviation from the correct format. We call this "brittle code"
because it is easily broken.
greedy matching:
The notion that the "+" and "*" characters in a regular expression expand outward to match the largest possible string.
A command available in most Unix systems that searches through text files looking for lines that match regular expressions. The command name stands for "Generalized Regular Expression Parser".
regular expression:
A language for expressing more complex search strings. A regular expression may contain special characters that indicate that a search only matches at the beginning or end of a line or many other similar capabilities.
wild card:
A special character that matches any character. In regular expressions the wild card character is the period character.
&&Exercises
Exercise&1&&
Write a simple program to simulate the operation of the the grep command
on Unix. Ask the user to enter a regular expression and count the number
of lines that matched the regular expression:
$ python grep.py
Enter a regular expression: ^Author
mbox.txt had 1798 lines that matched ^Author
$ python grep.py
Enter a regular expression: ^X-
mbox.txt had 14368 lines that matched ^X-
$ python grep.py
Enter a regular expression: java$
mbox.txt had 4218 lines that matched java$
Exercise&2&&
Write a program to look for lines of the form
New Revision: 39772
And extract the number from each of the lines using a regular expression
and the findall() method. Compute the average of the numbers and
print out the average.
Enter file:mbox.txt
Enter file:mbox-short.txt

我要回帖

更多关于 分离器 的文章

 

随机推荐