Javascript objects in python

Содержание

Chompjs
Quickstart
Features
JavaScript objects¶
class explanation¶
Custom item methods¶
On metaclass¶
metaclass construction¶
metaclass demonstration¶
metaclass explanation¶
Never instantiated¶
Conclusion¶
Saved searches
Use saved searches to filter your results more quickly
License
Nykakin/chompjs
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md

Chompjs

Transforms JavaScript objects into Python data structures.

In web scraping, you sometimes need to transform Javascript objects embedded in HTML pages into valid Python dictionaries. chompjs is a library designed to do that as a more powerful replacement of standard json.loads :

Quickstart

1. installation

$ git clone https://github.com/Nykakin/chompjs $  chompjs $ python setup.py build $ python setup.py install

Features

There are two functions available:

parse_js_object — try reading first encountered JSON-like object. Raises ValueError on failure
parse_js_objects — returns a generator yielding all encountered JSON-like objects. Can be used to read JSON Lines. Does not raise on ivalid input.

An example usage with scrapy :

            If the input string is not yet escaped and contains a lot of \\ characters, then unicode_escape=True argument might help to sanitize it:

'By default chompjs tries to start with first < or [ character it founds, omitting the rest:

.
 ' json_params argument can be used to pass options to underlying json_loads , such as strict or object_hook :

In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:

 Standard library function json.loads is usually sufficient to extract this data:

 The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:

"" is not a valid JSON because it uses ' character to quote
'' is not a valid JSON because property name is not quoted at all
'' is not a valid JSON because there is an extra , character at the end of the array
'' is not a valid JSON because float value lacks a leading 0

As a result, json.loads fail to extract any of those:

"''' chompjs library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:

"'''Internally chompjs use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's json.loads , ensuring a high speed as compared to full-blown JavaScript parsers such as demjson .

Pull requests are welcome.

Источник

JavaScript objects¶

In JavaScript all objects are part of an inheritance tree. The create function adds a node to the inheritance tree.

// A JavaScript object. js> root = <> // Prototype inheritance. js> create = function (obj)  var f = function () return this;>; f.prototype = obj; return new f; > js> a = create(root) js> b = create(a) js> a.name = 5 js> a.name 5 js> b.name 5

In Python classes inherit in the same way.

>>> root = type # Most classes are instance of type. >>> class a(root): pass >>> class b(a): pass # Class inheritance. >>> a.name = 5 # Just like JavaScript. >>> a.name 5 >>> b.name 5

class explanation¶

In Python we can subclass anything whose type is type (or a subclass of type). A subclass (and its instances) inherits properties from the super-class.

>>> type(root) == type(a) == type(b) == type True

Custom item methods¶

In JavaScript attribute and item access are the same.

js> a = create(root) js> a.name = 5 js> a['name'] 5 js> a['key'] = 6 js> a.key 6 js> a[1] = 6 js> a['1'] 6

In Python we can defined our own item methods. (The programmer owns the dot.)

>>> class A(object): . . def __getitem__(self, key): . return getattr(self, str(key)) . def __setitem__(self, key, value): . return setattr(self, str(key), value) >>> a = A() >>> a.name = 5 >>> a['name'] 5 >>> a['key'] = 6 >>> a.key 6 >>> a[1] = 6 >>> a['1'] 6

Because type(a) is A, which has the special item methods, we get the special item behaviour.

On metaclass¶

Using previous definition, we cannot subclass a to create b.

>>> class b(a): pass Traceback (most recent call last): class b(a): pass TypeError: Error when calling the metaclass bases object.__new__() takes no parameters

This is because a is not a type. The solution involves Python metaclasses (an advanced topic).

metaclass construction¶

We will subclass type, not object, and add to it the special item methods.

>>> class ObjectType(type): . . def __getitem__(self, key): . return getattr(self, str(key)) . . def __setitem__(self, key, value): . return setattr(self, str(key), value)

Here is a fancy way of calling ObjectType.

>>> class root(object): . __metaclass__ = ObjectType

Here is a more direct (and equivalent) construction (create an instance of ObjectType, whose instances are objects).

>>> root = ObjectType('root', (object,), <>) >>> isinstance(root(), object) True

metaclass demonstration¶

>>> class a(root): pass >>> class b(a): pass >>> a.name = 5 >>> a.name 5 >>> b.name 5 >>> a['name'] 5 >>> b['name'] 5 >>> a[1] = 6 >>> a['1'] 6

metaclass explanation¶

Because type(root) is a subclass of type we can subclass root.

>>> issubclass(type(root), type) True

Because the type(root) is ObjectType, which has special item methods, we get the special item behaviour.

>>> type(root) == type(a) == type(b) == ObjectType True

Never instantiated¶

We can’t call JavaScript objects (unless they are a function). But create creates ordinary JavaScript objects.

js> a = create(root) js> a(1, 2, 3) TypeError: a is not a function

We will monkey-patch the previous Python class, to provide custom behaviour when called.

>>> def raise_not_a_function(obj, *argv, **kwargs): . raise TypeError, obj.__name__ + ' is not a function' >>> ObjectType.__call__ = raise_not_a_function >>> a(1, 2, 3) Traceback (most recent call last): a(1, 2, 3) TypeError: a is not a function

Conclusion¶

JavaScript objects are like Python classes (because they inherit like Python classes).

For JavaScript attribute and item access are the same. This is achieved in Python by providing custom item methods.

In Python the custom item methods must be placed on the type of the object (or a superclass of its type).

Ordinary JavaScript objects are not functions and cannot be called. A Python class can be called (to create an instance of the object). But we can override this behaviour by supplying a custom method for call.

JavaScript objects are like Python classes with custom item methods (on the metaclass) that are never instantiated.

Versions latest Downloads pdf htmlzip epub On Read the Docs Project Home Builds Free document hosting provided by Read the Docs.

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Parsing JavaScript objects into Python data structures

License

Nykakin/chompjs

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

Transforms JavaScript objects into Python data structures.

>>> chompjs.parse_js_object("") 'a': 100> >>> >>> json_lines = """ . . . . """ >>> for entry in chompjs.parse_js_objects(json_lines): . print(entry) . 'a': 12> 'b': 13> 'c': 14>

1. installation

$ git clone https://github.com/Nykakin/chompjs $ cd chompjs $ python setup.py build $ python setup.py install

There are two functions available:

parse_js_object - try reading first encountered JSON-like object. Raises ValueError on failure
parse_js_objects - returns a generator yielding all encountered JSON-like objects. Can be used to read JSON Lines. Does not raise on invalid input.

An example usage with scrapy :

import chompjs import scrapy class MySpider(scrapy.Spider): # . def parse(self, response): script_css = 'script:contains("__NEXT_DATA__")::text' script_pattern = r'__NEXT_DATA__ = (.*);' # warning: for some pages you need to pass replace_entities=True # into re_first to have JSON escaped properly script_text = response.css(script_css).re_first(script_pattern) try: json_data = chompjs.parse_js_object(script_text) except ValueError: self.log('Failed to extract data from <>'.format(response.url)) return # work on json_data

>>> data = """ . . // comments . unquoted: 'and you can quote me on that', . singleQuotes: 'I can use "double quotes" here', . lineBreaks: "Look, Mom! \ . No \\n's!", . hexadecimal: 0xdecaf, . leadingDecimalPoint: .8675309, andTrailing: 8675309., . positiveSign: +1, . trailingComma: 'in objects', andIn: ['arrays',], . "backwardsCompatible": "with JSON", . > . """ >>> chompjs.parse_js_object(data) 'unquoted': 'and you can quote me on that', 'singleQuotes': 'I can use "double quotes" here', 'lineBreaks': "Look, Mom! No \n's!", 'hexadecimal': 912559, 'leadingDecimalPoint': 0.8675309, 'andTrailing': 8675309.0, 'positiveSign': '+1', 'trailingComma': 'in objects', 'andIn': ['arrays'], 'backwardsCompatible': 'with JSON'>

If the input string is not yet escaped and contains a lot of \\ characters, then unicode_escape=True argument might help to sanitize it:

>>> chompjs.parse_js_object('\\\"a\\\": 12>', unicode_escape=True) 'a': 12>

By default chompjs tries to start with first < or [ character it founds, omitting the rest:

>>> chompjs.parse_js_object('.
 .
 ') [1, 2, 3]

json_params argument can be used to pass options to underlying json_loads , such as strict or object_hook :

>>> import decimal >>> import chompjs >>> chompjs.parse_js_object('[23.2]', json_params='parse_float': decimal.Decimal>) [Decimal('23.2')]

In web scraping data often is not present directly inside HTML, but instead provided as an embedded JavaScript object that is later used to initialize the page, for example:

html> head>. head> body> . script type pl-s">text/javascript">window.__PRELOADED_STATE__="foo": "bar">script> . body> html>

Standard library function json.loads is usually sufficient to extract this data:

>>> # scrapy shell file:///tmp/test.html >>> import json >>> script_text = response.css('script:contains(__PRELOADED_STATE__)::text').re_first('__PRELOADED_STATE__=(.*)') >>> json.loads(script_text) u'foo': u'bar'>

The problem is that not all valid JavaScript objects are also valid JSONs. For example all those strings are valid JavaScript objects but not valid JSONs:

"" is not a valid JSON because it uses ' character to quote
'' is not a valid JSON because property name is not quoted at all
'' is not a valid JSON because there is an extra , character at the end of the array
'' is not a valid JSON because float value lacks a leading 0

As a result, json.loads fail to extract any of those:

>>> json.loads("") Traceback (most recent call last): . ValueError: Expecting property name: line 1 column 2 (char 1) >>> json.loads('') Traceback (most recent call last): . ValueError: Expecting property name: line 1 column 2 (char 1) >>> json.loads('') Traceback (most recent call last): . ValueError: No JSON object could be decoded >>> json.loads('') Traceback (most recent call last): . json.decoder.JSONDecodeError: Expecting value: line 1 column 7 (char 6)

chompjs library was designed to bypass this limitation, and it allows to scrape such JavaScript objects into proper Python dictionaries:

>>> import chompjs >>> >>> chompjs.parse_js_object("") 'a': 'b'> >>> chompjs.parse_js_object('') 'a': 'b'> >>> chompjs.parse_js_object('') 'a': [1, 2, 3]> >>> chompjs.parse_js_object('') 'a': 0.99>

Internally chompjs use a parser written in C to iterate over raw string, fixing its issues along the way. The final result is then passed down to standard library's json.loads , ensuring a high speed as compared to full-blown JavaScript parsers such as demjson .

>>> import json >>> import _chompjs >>> >>> _chompjs.parse('') '' >>> json.loads(_) 'a': 1>

Pull requests are welcome.

Источник

Javascript objects in python

Chompjs

Quickstart

Features

JavaScript objects¶

class explanation¶

Custom item methods¶

On metaclass¶

metaclass construction¶

metaclass demonstration¶

metaclass explanation¶

Never instantiated¶

Conclusion¶

Saved searches

Use saved searches to filter your results more quickly

License

Nykakin/chompjs

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md