+ Reply to Thread
Results 1 to 8 of 8

Extracting data from webpage

  1. #1
    Registered User
    Join Date
    06-11-2009
    Location
    Croatia
    MS-Off Ver
    Excel 2007
    Posts
    5

    Extracting data from webpage

    Hey i need to extract specific data from a series of webpages into specific cells on my sheet.
    Here is a example page:
    http://www.erepublik.com/en/citizen/profile/1248794

    The data i need is the number between
    (src="/images/parts/icon_skill_strenght.gif" /> <span class="special">) and (</span>) in this case 5.092

    Can anyone help me with this or give me some tutorials for this i can read?
    Thank you

  2. #2
    Forum Contributor
    Join Date
    02-23-2006
    Location
    Near London, England
    MS-Off Ver
    Office 2003
    Posts
    770

    Re: Extracting data from webpage

    I am unable to view that webpage from here, (if you could right click somewhere on it, 'View Source', then save and attach that file to here it would help ).

    However it seems quite like something I had to do in the past, so I have put together some code here for you to try. I have not been able to test it due to the access problems above, so if it doesn't work please get the source of the webpage as detailed and attach it

    Please Login or Register  to view this content.
    You would then use this function as follows:

    Please Login or Register  to view this content.
    Whether it will work or not though is going to be a bit hit-and-miss without being able to see the source of the webpage though as it is relying on both of your delimiting bits of text being on the same line with the data value between them.

    If you find the response helpful please click the scales in the blue bar above and rate it
    If you don't like the response, don't bother with the scales, they are not for you

  3. #3
    Registered User
    Join Date
    06-11-2009
    Location
    Croatia
    MS-Off Ver
    Excel 2007
    Posts
    5

    Re: Extracting data from webpage

    Hmm well the code is too big to post it here so i only copied the bit with the information i need
    Please Login or Register  to view this content.
    The information i need is the number 5.092.
    How i imagined this is that i put a list of urls from A1-Ax and then this number from each url is next to it in B1-Bx

  4. #4
    Registered User
    Join Date
    06-11-2009
    Location
    Croatia
    MS-Off Ver
    Excel 2007
    Posts
    5

    Re: Extracting data from webpage

    I tweaked your code to do everything i need it to do but I was wondering if there is any way to make it run faster? Is it downloading the pictures from each page and if yes is there any way to download only the source code of the page without the pictures?

    Please Login or Register  to view this content.
    Last edited by Gramzon; 06-12-2009 at 03:24 AM.

  5. #5
    Forum Contributor
    Join Date
    02-23-2006
    Location
    Near London, England
    MS-Off Ver
    Office 2003
    Posts
    770

    Re: Extracting data from webpage

    I had meant that you could post the html source file here as a file attachement, not paste it all here, but never mind
    If you uncomment the following line you will be able to see exactly what your ie session is doing:
    Please Login or Register  to view this content.
    It effectively visits the page and loads the entire page, including images, and then grabs the html from it. I don't know of a way to tell the instance of IE to not display images. I imagine you could set it in IE's settings itself, but of course then it would be system wide, (and I don't know how to do that via vba).
    I notice that you have moved the "ie.quit" call to a lot later in the code. Whilst there is nothing wrong with this be careful. As IE is by default not visible, the only way you can close that instance of IE is via your VBA code, or via Task Manager. If you end your VBA code early (ie. before ie.quit is called), or within your code you lose the handle to the IE object then you will be left with a non-visible instance of IE running on your PC. You would then have to use Task Manager to get rid of it.

  6. #6
    Registered User
    Join Date
    06-11-2009
    Location
    Croatia
    MS-Off Ver
    Excel 2007
    Posts
    5

    Re: Extracting data from webpage

    Yes I know but there is alot of URLs and I thought it would make it faster if it doesent have to open ie for every one. I turned off showing images in IE settings but I still can't resolve a very annoying problem.
    When the code is running it works fine but then it stops at a random url and i get:
    Please Login or Register  to view this content.
    When i Debug it goes to this line:
    Please Login or Register  to view this content.
    But if i press continue the code normally runs and correctly keeps writing the values in the table. Untill I get the error again and have to press continue. Is there any way to avoid this error?
    Here is the final code:
    Please Login or Register  to view this content.

  7. #7
    Registered User
    Join Date
    06-11-2009
    Location
    Croatia
    MS-Off Ver
    Excel 2007
    Posts
    5

    Re: Extracting data from webpage

    I cant find the problem it's driving me nuts.
    Last edited by Gramzon; 06-13-2009 at 06:18 AM.

  8. #8
    Registered User
    Join Date
    03-12-2009
    Location
    ch
    MS-Off Ver
    Excel 2003
    Posts
    2

    Re: Extracting data from webpage

    Hi Gramzon:

    This may require something simple. Let me first show you a quick 4-line script.


    Please Login or Register  to view this content.
    I have tested this script on your webpage http://www.erepublik.com/en/citizen/profile/12 . It correctly shows the number 5.092 . Let me explain how it is extracting this number.

    The cat command is getting the web page and putting it into the string variable $page. The stex command is string extractor. It takes an argument of the form ^xyz^. It searches for xyz. In your case, xyz is "src="/images/parts/icon_skill_strenght.gif" /> <span class="special">". (Thank you for very accurately elaborating on the marker strings.) I am using regular expression with the -r option. So, I have backslashed the double quotes and using & to match any number of (irrelevant) characters.

    You will see a ] after the search string. That tells the stex command to extract everything up to and including the search string. We want to remove that part fromt the input string. Similarly, the second stex command extracts everything beginning and including the following </span>.

    The script is in biterscripting. To try this script on your web page,

    1. Save the script as C:\GetNumber.txt.

    2. Donwload biterscripting from http://www.biterscripting.com . Should install in minutes.

    3. Start biterscripting. Run the GetNumber script with the following command.

    script "C:\GetNumber.txt"

    You should see the number 5.092, as I am seeing it.


    I am a non-programmer, and use biterscripting to extract info from our own web pages. Someone got me started with it by providing a sample and simple script just like here.

    J

    (You should always be respectful of the copyright of the owner of a web site when extracting info. If it is for your own use, that's ok. If you are going to republish the extracted info, you should seek their permission.)

+ Reply to Thread

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts

Search Engine Friendly URLs by vBSEO 3.6.0 RC 1