# The Extraction Plan
The Extraction plan is where you tell bob what to extract from a web page
# A Simple Selector
As we already saw, the most simple extraction plan is a CSS selector (opens new window) that is evaluated on the page. The extracted value is by default the text value of the first matching element.
bob.work(
"http://hostname/path/post-01.html",
"div.content > div.post > h1"
);
In this case, the result is a simple string which is probably the title of the targeted post.
"This is the Post Title"
If you want to extract not only the first matching element, but all of them, just enclose the selector in an array. The extracted value is then an array of strings.
bob.work(
"http://hostname/path/post-01.html",
["div.content > div.post > h1"]
);
Result :
["This is the Title", "Another title" ]
But ok, this is quite basic right ? Bob can also provide a complete object if you tell him how to do. We will see that on the next section.
# A Complex Extraction Plan
If you want Bob to produce an object instead of a simple text value, you can give him an extraction plan describing exactly this object. Here is an example... Are you able to understand it ?
bob.work(
"http://hostname/path/post-01.html",
{
"title": "div.content > div.post > h1",
"sub-title": "div.content > div.post > h3",
"text" : ["div.content > div.post > p"]
}
);
The object we asked Bob to build has 3 properties. The title
and sub-title
of a post are simple strings. The text
property is an array of strings, each one extracted from a paragraph of the post in this page.
The next example should be applied on a HTML page that contains several posts :
bob.work(
"http://hostname/path/index.html",
{
"postList": {
"selector": ["div.content > div.post"],
"type": {
"title": '.post-header > h3.title'
"body" : 'div.excerpt'
}
}
}
);
The object we ask Bob to build has one property named postList. This property is an array (that's its type) that contains objects representings posts. Each object in this array has 2 properties.
- title : a string representing the title of the post
- body : a string representing the content of the post
Here is how the result could look like :
[
{ "title" : " some title", "body" : "lorem ipsum etc ..."},
{ "title" : " some other title", "body" : "let it be etc."},
{ "title" : " the last title", "body" : "Bob is in da place !"}
]
One thing to note in this example is that the CSS selector of title and body properties are relative to the parent's selector (in this case
div.content > div.post
).
# Value Type
Up to now we have only been extracting the text content of the selected elements hidding somewhere in HTML page. But we can also ask Bob to extract attributes values.
Yes this is possible, look :
bob.work(
"http://hostname/path/index.html",
{
"some-classes": {
"selector": ["div.content > div"],
"type": "@class"
}
}
);
This extraction plan asks Bob to select all elements that match the div.content > div
CSS selector, and for each one of them, read the value of the class
attribute.
The extracted data could look like this :
["post", "post"]
Ok, this is not super useful (who knows ?), but imagine that you want to extract a list of URL from anchors elements ? ... and imagine you then want to use this list of URL as an Itinerary from where to extract real data. This is exactly what the URL extraction section is all about.
Here is the example again :
bob.work(
"http://hostname/path/index.html",
{
selector: ["div.post a"],
type: "@href absolute",
}
);
As you can see, the plan
property (our extraction plan) is designed to extract the absolute URL defined by the href
attribute for all element that match the div.post a
CSS selector.
The result could be something like this :
[
"http://hostname/post1/page.html",
"http://hostname/post2/page.html",
"http://hostname/post3/page.html"
]
Of course, if you don't add the absolute modifier, Bob will just gives you what was found in the attribute value.