Nodejs html to json

Содержание

Converting html code to a JavaScript object in node.js with html-to-json
1 — Basic example of html-to-json in node.
2 — Converting many files to javaScript objects
3 — Conclusion
Nodejs html to json

Converting html code to a JavaScript object in node.js with html-to-json

So I wanted to make a simple tool to run threw all of my blog posts that have been parsed into html, and find certain values such as word count for my posts. In other words I want to create a collection of objects for each html file, or have a way to convert to a JSON format from HTML. So there should be some kind of dependency in the npmjs ecosystem that I can use to just quickly turn html into an object tyoe form that I can work with in a node environment, similarly to that of what I can work with in a browser or client side javaScript environment.

WIth that being said I took a little time to see what there is to work with if anything and after doing so I found a few projects that work nice. However in this post I will mostly be writing about a npm package called html-to-json. This package has a method where I can feed it an html string, and what is returned is a workable object.

1 — Basic example of html-to-json in node.

So of course as always the first thing is to install the package into a node project folder. I assume that you know the basics of setting up a new node project folder, if not this is nt the post to start out with the basic of using node and the default package manager for it called npm.

$ npm install html-to-json --save

After that I wanted to make my typical hello world example of how to get started with html to json. As such I put together a simple example to just test out how it works, by seeing if I can just pull the text from a single paragraph element just for starters.

So with that said there is the parse method of this project where the first argument that is given is an html string, and the second argument given is an object the serves as a filter. Then a third argument can be given that is a callback, but the method also returns a promise. The resulting object that is given via a callback, or in a resolved promise via the this method contains what I would expect.

var htmlToJson = require('html-to-json'),
 htmlToJson.parse('This is only an example
', {
 p: function (doc) {
 return doc.find('p').text();
 }
 }).then(function (result) {
 console.log(result.p); // 'this is only an example'
 });

So far so good, looks like this project is more or less what I had in mind, but lets look at a few more examples just for the hell of it.

2 — Converting many files to javaScript objects

To do this I used another javaScript dependency called node-dir, which comes in handy when I want to grab the content of many files that exist in a complex file structure. I wrote a post on it if you want to learn more about how to loop over files recursively with it.

Читайте также: Html form inline css

There are other options that can be used to walk the contents of a file system, in fact I am not sure if I can say node dir is the best option when it comes to file system walkers. I wrote a post on the subject of file system walking that you might want to check out when it does come to other options for this. However in any case I just need a way to loop over the contents of a file system recursively, open each html file, and then use this project to parse the html into a workable object.

Anyway using node-dir with html-to-json i was able to quickly build the json report that I wanted.

var htmlToJson = require('html-to-json'),
fs = require('fs'),
dir = require('node-dir'),
 results = [],
source = './html',
jsonFN = './report.json';
 // using the readFiles method in node-dir
dir.readFiles(source,
 // a function to call for each file in the path
 function (err, content, fileName, next) {
 // if an error happens log it
 if (err) {
 console.log(err);
 }
 // log current filename
 console.log(fileName);
 // using html-to-jsons parse method
 htmlToJson.parse(content, {
 // include the filename
 fn: fileName,
 // get the h1 tag in my post
 title: function (doc) {
 return doc.find('title').text().replace(/\n/g, '').trim();
 },
 // getting word count
 wc: function (doc) {
 // finding word count by getting text of all p elements
 return doc.find('p').text().split(' ').length;
 }
 }).then(function (result) {
 // log the result
 results.push(result);
 })
 next();
 }, function ( ) {
 // write out a json file
 fs.writeFile(jsonFN, JSON.stringify(results), 'utf-8');
 });

3 — Conclusion

So I might want to work on my content analysis tool some more as a great deal more comes to mind other than just word count of my posts. It seems like what is a lot more important than a high word count is targeting the right string of keywords that people are searching for. Anyway this is a great solution for helping me with the task of converting html to json, I hope this post helped you.

Источник

Nodejs html to json

Parses HTML strings into objects using flexible, composable filters.

htmlToJson.parse(html, filter, [callback]) -> promise

The parse() method takes a string of HTML, and a filter, and responds with the filtered data. This supports both callbacks and promises.

var promise = htmlToJson.parse('content', 
 'text': function ($doc) 
 return $doc.find('div').text();
 >
>, function (err, result) 
 console.log(result);
>);
 promise.done(function (result) 
 //Works as well
>);

htmlToJson.request(requestOptions, filter, [callback]) -> promise

The request() method takes options for a call to the request library and a filter, then returns the filtered response body.

var promise = htmlToJson.request('http://prolificinteractive.com/team', 
 'images': ['img', function ($img) 
 return $img.attr('src');
 >]
>, function (err, result) 
 console.log(result);
>);

htmlToJson.batch(html, dictionary, [callback]) -> promise

Performs many parsing operations against one HTML string. This transforms the HTML into a DOM only once instead of for each filter in the dictionary, which can quickly get expensive in terms of processing. This also allows you to break your filters up into more granular components and mix and match them as you please.

The values in the dictionary can be htmlToJson.Parser objects, generated methods from htmlToJson.createMethod , or naked filters that you might normally pass into htmlToJson.parse . For example:

return getProlificHomepage().then(function (html) 
 return htmlToJson.batch(html, 
 sections: htmlToJson.createParser(['#primary-nav a', 
 'name': function ($section) 
 return $section.text();
 >,
 'link': function ($section) 
 return $section.attr('href');
 >
 >]),
 offices: htmlToJson.createMethod(['.office', 
 'location': function ($office) 
 return $office.find('.location').text();
 >,
 'phone': function ($office) 
 return $office.find('.phone').text();
 >
 >]),
 socialInfo: ['#footer .social-link', 
 'name': function ($link) 
 return $link.text();
 >,
 'link': function ($link) 
 return $link.attr('href');
 >
 >]
 >);
>);

htmlToJson.createMethod(filter) -> function (html, [callback])

Generates a method that wraps the passed filter argument. The generated method takes an HTML string and processes it against that filter .

var parseFoo = htmlToJson.createMethod(
 'foo': function ($doc) 
 return $doc.find('#foo').bar();
 >
>);

htmlToJson.createParser(filter), new htmlToJson.Parser(filter)

For the sake of reusability, creates an object with .parse and .request helper methods, which use the passed filter. For example:

var linkParser = htmlToJson.createParser(['a[href]', 
 'text': function ($a) 
 return $a.text();
 >,
 'href': function ($a) 
 return $a.attr('href');
 >
>]);
 linkParser.request('http://prolificinteractive.com').done(function (links) 
 //Do stuff with links
>);

linkParser.request('http://prolificinteractive.com', ['a[href]', 
 'text': function ($a) 
 return $a.text();
 >,
 'href': function ($a) 
 return $a.attr('href');
 >
>]).done(function (links) 
 //Do stuff with links
>);

The former allows you to easily reuse the filter (and make it testable), while that latter is a one-off.

Parses the passed html argument against the parser’s filter.

Returns a method that wraps parser.parse()

Makes a request with the request options, then runs the response body through the parser’s filter.

The return values of functions are mapped against their corresponding keys. Function filters are passed cheerio objects, which allows you to play with a jQuery-like interface.

htmlToJson.parse('foo', 
 'foo1': function ($doc, $) 
 return $doc.find('#foo').text(); //foo
 >
>, callback);

Arrays of data can be parsed out by either using the .map() method within a filter function or using the shorthand [selector, filter] syntax:

A filter is applied incrementally against each matched element, and the results are returned within an array.

var html = '1
2';
 htmlToJson.parse(html, function () 
 return this.map('.item', function ($item) 
 return $item.text();
 >);
>).done(function (items) 
 // Items should be: ['1','2']
>, function (err) 
 // Handle error
>);

This is essentially a short-hand alias for .map() , making the filter look more like its output:

var html = '1
2';
 htmlToJson
 .parse(html, ['.item', function ($item) 
 return $item.text();
 >])
 .done(function (items) 
 // Items should be: ['1','2']
 >, function (err) 
 // Handle error
 >);

As an added convenience you can pass in a 3rd argument into the array filter, which allows you to manipulate the results. You can return a promise if you wish to do an asynchronous operation.

var html = '1
2';
 htmlToJson
 .parse(html, ['.item', function ($item) 
 return +$item.text();
 >, function (items) 
 return _.map(items, function (item) 
 return item * 3;
 >);
 >])
 .done(function (items) 
 // Items should be: [3,6]
 >, function (err) 
 // Handle error
 >);

Filter functions may also return promises, which get resolved asynchronously.

function getProductDetails (id, callback) 
 return htmlToJson.request(
 uri: 'http://store.prolificinteractive.com/products/' + id
 >, 
 'id': function ($doc) 
 return $doc.find('#product-details').attr('data-id');
 >,
 'colors': ['.color', 
 'id': function ($color) 
 return $color.attr('data-id');
 >,
 'hex': function ($color) 
 return $color.css('background-color');
 >
 >]
 >, callback);
>
 function getProducts (callback) 
 return htmlToJson.request(
 uri: 'http://store.prolificinteractive.com'
 >, ['.product', 
 'id': function ($product) 
 return $product.attr('data-id');
 >,
 'image': function ($product) 
 return $product.find('img').attr('src');
 >,
 'colors': function ($product) 
 // This is where we use a promise to get the colors asynchronously
 return this
 .get('id')
 .then(function (id) 
 return getProductDetails(id).get('colors');
 >);
 >
 >], callback);
>

Dependencies on other values

Filter functions may use the .get(propertyName) to use a value from another key in that filter. This returns a promise representing the value rather than the value itself.

function getProducts (callback) 
 return htmlToJson.request('http://store.prolificinteractive.com', ['.product', 
 'id': function ($product) 
 return $product.attr('data-id');
 >,
 'image': function ($product) 
 return $product.find('img').attr('src');
 >,
 'colors': function ($product) 
 // Resolve 'id' then get product details with it
 return this
 .get('id')
 .then(function (id) 
 return getProductDetails(id).get('colors');
 >);
 >
 >], callback);
>

Nested objects within a filter are run against the same HTML context as the parent filter.

var html = 'foobar';
 htmlToJson.parse(html, 
 'foo': 
 'bar': function ($doc) 
 return $doc.find('#bar').text();
 >
 >
>);

You may specify a more specific DOM context by setting the $container property on the object filter:

var html = 'foobar';
 htmlToJson.parse(html, 
 'foo': 
 $container: '#foo',
 'bar': function ($foo) 
 return $foo.find('#bar').text();
 >
 >
>);

Strings, numbers, and null values are simply used as the filter’s value. This especially comes in handy for incrementally converting from mock data to parsed data.

htmlToJson.parse(' ', 
 x: 1,
 y: 'string value',
 z: null
>);

Tests are written in mocha and located in the test directory. Run them with:

This script also executes jshint against lib/ and test/ directories.

Please read the existing code in order to learn the conventions.

Источник