
www.Usenet.com
| <-- __Chronological__ --> | <-- __Thread__ --> |
Markus wrote:
>>I want to build a tool for data mining from an html page. I want the user to
>>select an element from a web page, and train my application to recognize it
>>in its later updates. For example, suppose the user wants to extract some
>>data from a financial. He want to extract his total balance, plus the table
>>of the last transactions. What he should do is to highlight the elements
>>inside the html page. After doing that, the application should analyze the
>>html element structure, and learns how to find it in similar pages (even
>>when they are not identical). What I really need is an algorithm to
>>"understand" a single element (by it's structure, position in page or any
>>other methods), and then I want to look in a new page, and choose the most
>>similar element (which should probably be the right one).
>
>
> Seems you are trying to "learn" a structure, for example a grammar for
> a pattern language. There are a bunch of algorithms out there that can
> learn text patterns nicely.
>
> I've seen something like what you described before, I think it was
> with the Lexikon Project at DFKI (www.dfki.de). I don't know of any
> publications out of the top of my head, though.
>
> Markus
Is this tool going to be stealing the data off just one page, or a
series of pages that are templated?
I wrote a tool in ColdFusion that would extract elements from a remote
dynamic site (say Amazon.com) and store specific fields in a local database.
It would loop through variable strings in the URL by sending requests to
http://site/page.php?ID=1
http://site/page.php?ID=2
http://site/page.php?ID=3
http://site/page.php?ID=4
and so on, indefinitely. You want to make sure the target URL is
really a template, otherwise it's going to be tough to ensure the fields
will be the same (the HTML markers before and after the target text).
Dustin Smith
[ comp.ai is moderated. To submit, just post and be patient, or if ]
[ that fails mail your article to <[EMAIL PROTECTED]>, and ]
[ ask your news administrator to fix the problems with your system. ]
| <-- __Chronological__ --> | <-- __Thread__ --> |
Please check out one of the premium Usenet Newsgroup Service Providers below for access to Usenet.