Crawling a website for data, converting data to standardized datafile

Completed Posted Jul 7, 2009 Paid on delivery
Completed Paid on delivery

A website is presenting some numerical data. I want a crawler to extract and transform that data into a standardized data file(s), preferably in Excel format, but if necessary, in CSV or other similar format.

## Deliverables

{Please note, this request is related to, but not the same as my other work request # 1188594. Those who may have bid on that project may bid on this one, and bidders on this project may also want to try to bid on the other one.}

The website [login to view URL] has a bunch of numerical information on over 80,000 Facebook applications that I want extracted and transformed into numerical format. The final deliverable will be a data file, preferably a single file, but multiple if necessary, that has all of that data. I prefer the file(s) to be in Excel format, but can also be in CSV or other nonproprietary data format.

You can obtain a standard login username/password from the site for free. Using that standard login, you would crawl all of the apps, starting from the following anchor page:

[login to view URL]

Note, I have been informed that there is an unfortunate pagination bug in the website which you can see here: Note that while the anchor page claims to display apps 1-25, it does not actually display apps 1-25. Hence, the spider cannot simply click on "Next" for doing so would actually mean skipping some apps. {If you click on Next manually, you will see what I mean). Furthermore, the number of apps that is skipped seems to be unpredictable and hence you cannot simply crawl using a fixed increment value within the search query. Hence, the spider should be programmed to be smart enough to see the number of apps that were displayed and then construct the proper query to display the true next set of apps, without skipping any.

For example, if apps 1-17 are actually displayed (as opposed to apps 1-25 that the site claims to display), then the next query could be:

[login to view URL]

Basically, you would append the string ?0=x where x = the number of the last application in the previous search page. Or, if you have a better idea, then feel free to use it. What is important is that the crawler not skip any apps. Again, if that is not clear, then playing with the site should clarify the matter.

*The Final Output Data

I want all data fields, and importantly, all the information from the Javascript graphs that the crawler can see. For example let's consider the Top Friends apps:

[login to view URL]

With the free standard login, you will see that information in the Summary, Reach, and Audience Profile tabs are available (the info in the Engagement and Growth tabs will be grayed out).

From the Summary tab, I want the data regarding:

By Company Name (for example, RockYou, Slide, etc.)

Rank

DAU

Social Graph Influence

MAU

Categories

Description

The entire Unique Active Users graphs for daily, weekly, and monthly (where x=date, y= UAU) - note, while the graph is Adobe Flash, all of the data is viewable in the Page Source

From the Reach tab, I want

DAU

MAU

The entire UAU graph (just like above)

From the Audience Profile tab, I want:

Male/Female

Average Age

Average Number of Friends

Gender

App User Overlap (all of the fields)

App User Affinity (all of the fields)

Age (all of the categories in the histogram)

Social Graph Influence(all of the categories in the histogram)

Note, some of the data will be repetitive. I don't care - I just want to make sure that the data is complete, even if some of it is repetitive.

Important: many of the apps won't have all of these tabs or all of the fields. If the crawler can't find a tab or field for a particular app, it should just input a "-" string into the data file.

Engineering MySQL PHP Project Management Software Architecture Software Testing

Project ID: #2798068

About the project

14 proposals Remote project Active Jul 18, 2009

Awarded to:

khalidsafwatvw

See private message.

$127.5 USD in 14 days
(64 Reviews)
5.5

14 freelancers are bidding on average $243 for this job

hwanghendra

See private message.

$191.25 USD in 14 days
(468 Reviews)
7.5
wangpretty

See private message.

$127.5 USD in 14 days
(27 Reviews)
5.3
rxhector2k5

See private message.

$85 USD in 14 days
(67 Reviews)
5.0
Robotapps

See private message.

$65.45 USD in 14 days
(91 Reviews)
5.4
codelabsl

See private message.

$552.5 USD in 14 days
(58 Reviews)
5.0
anurag7vw

See private message.

$59.5 USD in 14 days
(63 Reviews)
4.9
jvavadiya

See private message.

$102 USD in 14 days
(19 Reviews)
4.1
webconsultantvw

See private message.

$51 USD in 14 days
(4 Reviews)
2.7
liaisonsolu

See private message.

$170 USD in 14 days
(3 Reviews)
1.3
rkvermavw

See private message.

$425 USD in 14 days
(0 Reviews)
0.0
rizwanofuk

See private message.

$850 USD in 14 days
(0 Reviews)
0.0
noumanhanif

See private message.

$170 USD in 14 days
(0 Reviews)
0.0
g33kwesley

See private message.

$425 USD in 14 days
(0 Reviews)
0.0