I needed a tool that easily extracts information from a web page. Usually this ment extracting data from HTML lists or table cells, and storing the data in CSV format.
As programming a specifc tool for each HTML page did not make sense, I wanted to create a general tool that I could use for several purposes. Therefore a very simple language was created.
The program is executed from a command line and it has two obligatory arguments:
extract_data.py [HTML-file or URL] [instructions file]
First argument can be an url (starting with http://) or any text file including downloaded HTML-file. Second argument is a name of a file that includes the instructions how to extract the strings from data file.
The strings found from the HTML-file are inserted into the array "results". At the end of the execution these results are printed to standard out in the following format (CSV):
field1;field2;field3;...fieldN;\n field1;field2;field3;...fieldN;\n ....
Every register is separated by newline '\n'. It is also possible to insert data manually into the results.
The syntax of this program file is the following (case sensitive):
search STRING -- Goto next STRING in the data file - if not found then error. -- STRING can include white spaces. get STRING ** STRING -- Insert string between STRING1 and STRING2 into the results. -- STRING1 and STRING2 are separated with ** next -- This command separates a register from another, i.e. adds -- a new line in the results. insert STRING -- Insert STRING into the results. This can be used to insert -- data manually into the results. loop -- Start of the loop. Loops can be nested. until STRING -- Loop ends when STRING is encountered.
About the string extracted with the 'get' -command:
all double spaces are replaced with a single space, and newlines and tabs are removed.
loop -- loop until /html -tag is found search <tbody> -- search first tbody-tag loop -- loop until /tbody is found search <!-- Start Entry data --> -- search next Start Entry -string get <td class="col1"> ** </td> -- get the string between col1 and /td => FIELD 1 get <td class="col2"> ** < -- get the string between col2 and /td => FIELD 2 get <td class="col3"> ** </td> -- get the string between col3 and /td => FIELD 3 get <h4> ** </h4> -- get the string between h4 and /h4 => FIELD 4 get <h5> ** </h5> -- get the string between h5 and /h5 => FIELD 5 get <p>( ** ) -- get the string between <p>( and ) => FIELD 6 next -- end of register, put a new line until </tbody> until </html>
Copy the file (below) to your hard disk.
Analyze the web page you want to extract data from, and create a instruction file.
Standard way is to execute it with python:
python extract_data.py [arg1] [arg2]
You can also edit the file's first line to make the script executable. Also remember to change the rights (chmod), if needed.
[DOWNLOAD v. 0.2] (not tested, use it on your own risk)