Usenet.com

www.Usenet.com

Group Index

Comp Thread Archive from Usenet.com

<-- __Chronological__ --> <-- __Thread__ -->

Re: Finding an HTML element



Markus wrote:
>>I want to build a tool for data mining from an html page. I want the user to
>>select an element from a web page, and train my application to recognize it
>>in its later updates. For example, suppose the user wants to extract some
>>data from a financial. He want to extract his total balance, plus the table
>>of the last transactions. What he should do is to highlight the elements
>>inside the html page. After doing that, the application should analyze the
>>html element structure, and learns how to find it in similar pages (even
>>when they are not identical). What I really need is an algorithm to
>>"understand" a single element (by it's structure, position in page or any
>>other methods), and then I want to look in a new page, and choose the most
>>similar element (which should probably be the right one).
> 
> 
> Seems you are trying to "learn" a structure, for example a grammar for
> a pattern language. There are a bunch of algorithms out there that can
> learn text patterns nicely.
> 
> I've seen something like what you described before, I think it was
> with the Lexikon Project at DFKI (www.dfki.de). I don't know of any
> publications out of the top of my head, though.
> 
> Markus

Is this tool going to be stealing the data off just one page, or a 
series of pages that are templated?

I wrote a tool in ColdFusion that would extract elements from a remote 
dynamic site (say Amazon.com) and store specific fields in a local database.

It would loop through variable strings in the URL by sending requests to
        http://site/page.php?ID=1
        http://site/page.php?ID=2
        http://site/page.php?ID=3
        http://site/page.php?ID=4

and so on, indefinitely.    You want to make sure the target URL is 
really a template, otherwise it's going to be tough to ensure the fields 
will be the same (the HTML markers before and after the target text).

Dustin Smith

[ comp.ai is moderated.  To submit, just post and be patient, or if ]
[ that fails mail your article to <[EMAIL PROTECTED]>, and ]
[ ask your news administrator to fix the problems with your system. ]



<-- __Chronological__ --> <-- __Thread__ -->


Usenet.com




Please check out one of the premium Usenet Newsgroup Service Providers below for access to Usenet.




Please check out one of the premium Usenet Newsgroup Service Providers below for access to Usenet.