Why?

I needed a tool that easily extracts information from a web page. Usually this ment extracting data from HTML lists or table cells, and storing the data in CSV format.

As programming a specifc tool for each HTML page did not make sense, I wanted to create a general tool that I could use for several purposes. Therefore a very simple language was created.

How it works?

The program is executed from a command line and it has two obligatory arguments:

extract_data.py [HTML-file or URL] [instructions file]

First argument can be an url (starting with http://) or any text file including downloaded HTML-file. Second argument is a name of a file that includes the instructions how to extract the strings from data file.

The strings found from the HTML-file are inserted into the array "results". At the end of the execution these results are printed to standard out in the following format (CSV):

field1;field2;field3;...fieldN;\n
field1;field2;field3;...fieldN;\n
....

Every register is separated by newline '\n'. It is also possible to insert data manually into the results.

The syntax of this program file is the following (case sensitive):

search STRING             -- Goto next STRING in the data file - if not found then error. 
                          -- STRING can include white spaces.
get STRING ** STRING      -- Insert string between STRING1 and STRING2 into the results. 
                          -- STRING1 and STRING2 are separated with **
next                      -- This command separates a register from another, i.e. adds 
                          -- a new line in the results.
insert STRING             -- Insert STRING into the results. This can be used to insert 
                          -- data manually into the results.
loop                      -- Start of the loop. Loops can be nested.
until STRING              -- Loop ends when STRING is encountered.

About the string extracted with the 'get' -command:
all double spaces are replaced with a single space, and newlines and tabs are removed.

Example of instruction file.

loop                                            -- loop until /html -tag is found
   search <tbody>                               -- search first tbody-tag
   loop                                         -- loop until /tbody is found
      search <!-- Start Entry data -->          -- search next Start Entry -string
      get <td class="col1">   ** </td>          -- get the string between col1 and /td  => FIELD 1
      get <td class="col2">   ** <              -- get the string between col2 and /td  => FIELD 2
      get <td class="col3">   ** </td>          -- get the string between col3 and /td  => FIELD 3
      get <h4> ** </h4>                         -- get the string between h4 and /h4    => FIELD 4
      get <h5> ** </h5>                         -- get the string between h5 and /h5    => FIELD 5
      get <p>( ** )                             -- get the string between <p>( and )    => FIELD 6 
      next                                      -- end of register, put a new line
   until </tbody>
until </html>

Requirements

Python installed.

 

Install

Copy the file (below) to your hard disk.

Analyze the web page you want to extract data from, and create a instruction file.

Standard way is to execute it with python:

python extract_data.py [arg1] [arg2]

You can also edit the file's first line to make the script executable. Also remember to change the rights (chmod), if needed.

[DOWNLOAD v. 0.2] (not tested, use it on your own risk)

<< BACK
laskuri -->